Spark Release 1.3.0
Spark 1.3.0 is the fourth release on the 1.X line. This release brings a new DataFrame API alongside the graduation of Spark SQL from an alpha project. It also brings usability improvements in Spark’s core engine and expansion of MLlib and Spark Streaming. Spark 1.3 represents the work of 174 contributors from more than 60 institutions in more than 1000 individual patches.
To download Spark 1.3 visit the downloads page.
Spark Core
Spark 1.3 sees a handful of usability improvements in the core engine. The core API now supports multi level aggregation trees to help speed up expensive reduce operations. Improved error reporting has been added for certain gotcha operations. Spark’s Jetty dependency is now shaded to help avoid conflicts with user programs. Spark now supports SSL encryption for some communication endpoints. Finally, realtime GC metrics and record counts have been added to the UI.
DataFrame API
Spark 1.3 adds a new DataFrames API that provides powerful and convenient operators when working with structured datasets. The DataFrame is an evolution of the base RDD API that includes named fields along with schema information. It’s easy to construct a DataFrame from sources such as Hive tables, JSON data, a JDBC database, or any implementation of Spark’s new data source API. Data frames will become a common interchange format between Spark components and when importing and exporting data to other systems. Data frames are supported in Python, Scala, and Java.
Spark SQL
In this release Spark SQL graduates from an alpha project, providing backwards compatibility guarantees for the HiveQL dialect and stable programmatic API’s. Spark SQL adds support for writing tables in the data sources API. A new JDBC data source allows importing and exporting from MySQL, Postgres, and other RDBMS systems. A variety of small changes have expanded the coverage of HiveQL in Spark SQL. Spark SQL also adds support schema evolution with the ability to merging compatible schemas in Parquet.
Spark ML/MLlib
In this release Spark MLlib introduces several new algorithms: latent Dirichlet allocation (LDA) for topic modeling, multinomial logistic regression for multiclass classification, Gaussian mixture model (GMM) and power iteration clustering for clustering, FP-growth for frequent pattern mining, and block matrix abstraction for distributed linear algebra. Initial support has been added for model import/export in exchangeable format, which will be expanded in future versions to cover more model types in Java/Python/Scala. The implementations of k-means and ALS receive updates that lead to significant performance gain. PySpark now supports the ML pipeline API added in Spark 1.2, and gradient boosted trees and Gaussian mixture model. Finally, the ML pipeline API has been ported to support the new DataFrames abstraction.
Spark Streaming
Spark 1.3 introduces a new direct Kafka API (docs) which enables exactly-once delivery without the use of write ahead logs. It also adds a Python Kafka API along with infrastructure for additional Python API’s in future releases. An online version of logistic regression and the ability to read binary records have also been added. For stateful operations, support has been added for loading of an initial state RDD. Finally, the streaming programming guide has been updated to include information about SQL and DataFrame operations within streaming applications, and important clarifications to the fault-tolerance semantics.
GraphX
GraphX adds a handful of utility functions in this release, including conversion into a canonical edge graph.
Upgrading to Spark 1.3
Spark 1.3 is binary compatible with Spark 1.X releases, so no code changes are necessary. This excludes API’s marked explicitly as unstable.
As part of stabilizing the Spark SQL API, the SchemaRDD
class has been renamed to DataFrame
. Spark SQL’s migration guide describes the upgrade process in detail. Spark SQL also now requires that column identifiers which use reserved words (such as “string” or “table”) be escaped using backticks.
Known Issues
This release has few known issues which will be addressed in Spark 1.3.1:
- SPARK-6194: A memory leak in PySPark’s
collect()
.
- SPARK-6222: An issue with failure recovery in Spark Streaming.
- SPARK-6315: Spark SQL can’t read parquet data generated with Spark 1.1.
- SPARK-6247: Errors analyzing certain join types in Spark SQL.
Credits
- Aaron Davidson – Bug fixes in Core
- Alex Baretta – Improvement in Core
- Alex Liu – Improvements in Core and SQL; bug fixes in SQL
- Alexander Bezzubov – Documentation in Core
- Alexander Ulanov – Umbrella in MLlib; documentation in Core and MLlib; new features in MLlib
- Andrew Ash – Documentation in Core
- Andrew Or – Improvements in Core and YARN; bug fixes in Core and YARN
- Andrew Rowson – Bug fixes in YARN
- Andrey Zagrebin – Improvements in Core and PySpark
- Antonio Navarro Perez – Documentation in Core
- Ben Cook – Test in MLlib and PySpark; improvements in PySpark and SQL; new features in Core
- Bilna P – Test in Streaming
- Brennon York – New features in Core; bug fixes in Core, GraphX, and scheduler; improvement in Core
- Burak Yavuz – Improvements in spark submit and MLlib; new features in Core and MLlib; bug fixes in Core and spark submit; documentation in Core and MLlib
- Cheng Hao – Improvements in SQL; new features in SQL; bug fixes in Core and SQL
- Cheng Lian – Documentation in Core; test in SQL; improvements in Core and SQL; bug fixes in Core, tests, and SQL; improvement in SQL
- Cheolsoo Park – Bug fixes in YARN
- Chip Senkbeil – Bug fixes in Core
- Christophe Preaud – Improvements in Core
- Cody Koeninger – Improvements in Streaming
- DB Tsai – Improvements in MLlib; documentation in Core and MLlib; new features in MLlib; bug fixes in MLlib; improvement in MLlib
- Dale Richardson – Improvement in Core
- Daniel Darabos – Bug fixes in Core
- Daoyuan Wang – Improvement in SQL; improvements in Core and SQL; new features in Core and SQL; bug fixes in SQL; documentation in Core
- David Y. Ross – Umbrella in Core
- Davies Liu – Improvements in PySpark; documentation in Core and PySpark; new features in Streaming and PySpark; bug fixes in Streaming, Core, PySpark, MLlib, and SQL; improvement in PySpark and SQL
- Derek Ma – Bug fixes in Shuffle
- Doing Done – Improvements in SQL
- Elmer Garduno – Bug fixes in Core
- Emre Sevinc – Documentation in Core and MLlib
- Eric Moyer – Documentation in Core
- Ernest – Improvements in Core and GraphX
- Evan Yu – Bug fixes in Core
- Fan Jiang – New features in MLlib
- Fernando Otero (ZeoS) – Improvements in MLlib
- Gabe Mulley – Bug fixes in PySpark and SQL
- Gang Li – Bug fixes in Core
- Gankun Luo – Improvements in Core; bug fixes in SQL
- Gaspar Munoz – Documentation in Core
- Gen TANG – Bug fixes in EC2
- Grzegorz Dubicki – Improvements in EC2
- Guo Wei – Bug fixes in SQL
- GuoQiang Li – Improvements in Core; bug fixes in Core and YARN
- Hari Shreedharan – Bug fixes in Streaming, tests, and YARN
- Holden Karau – Improvements in EC2
- Huang Zhaowei – Bug fixes in Core and YARN
- Hung Lin – Improvements in SQL
- Ilayaperumal Gopinathan – Bug fixes in Streaming
- Ilya Ganelin – Improvements in Core; bug fixes in Core and Shuffle
- Imran Rashid – Bug fixes in Core
- Iulian Dragos – Test in Streaming
- Ivan Vergiliev – Improvements in Core
- Jacek Lewandowski – Bug fixes in Core
- Jacky Li – Improvements in MLlib and SQL; new features in MLlib; bug fixes in MLlib and SQL
- Jakub Dubovsky – Improvements in MLlib
- Jeremy Freeman – Improvements in Streaming and PySpark; new features in Streaming and MLlib; bug fixes in MLlib and PySpark
- Jesper Lundgren – Bug fixes in Streaming
- Jongyoul Lee – Improvements in Core and Mesos; documentation in Streaming; bug fixes in Core, Mesos, and SQL
- Joseph J.C. Tang – Bug fixes in MLlib
- Joseph K. Bradley – New features in MLlib; umbrella in MLlib; documentation in Core and MLlib; improvement in MLlib; improvements in GraphX, MLlib, and SQL; bug fixes in Core, GraphX, PySpark, MLlib, and SQL
- Josh Rosen – Bug fixes in Core
- Josh Rosen – Improvements in Core, tests, EC2, and SQL; new features in Core; bug fixes in Core, tests, PySpark, Streaming, scheduler, SQL, spark submit, and Web UI
- Judy Nash – New features in SQL
- Kai Sasaki – Documentation in Core and PySpark; bug fixes in Core and MLlib
- Kanwaljit Singh – Bug fixes in Core
- Kashish Jain – Bug fixes in YARN
- Kay Ousterhout – Improvements in Web UI; new features in Core; bug fixes in Core and SQL
- Kazuki Taniguchi – New features in MLlib and PySpark
- Kenji Kikushima – Bug fixes in GraphX
- Kenneth Myers – Documentation in Streaming
- Kirill A. Korinskiy – Bug fixes in Web UI
- Kostas Sakellis – Improvements in Core, Web UI, and YARN; bug fixes in Core; improvement in Core
- Kousuke Saruta – Improvements in Core, Web UI, and YARN; new features in Streaming and PySpark; bug fixes in Core and Web UI; documentation in Core
- Kuldeep – Bug fixes in SQL
- Li Zhihui – Documentation in Core
- Liang-Chi Hsieh – Improvements in Core, MLlib, and SQL; test in Core; documentation in Core; bug fixes in Core and SQL
- Liangliang Gu – Bug fixes in Web UI
- Lianhui Wang – Improvements in YARN; bug fixes in Core and YARN
- Liu Hao – Bug fixes in GraphX
- Liu Jiongzhou – Bug fixes in MLlib
- Lu Yan – Improvements in SQL
- Lukasz Jastrzebski – Bug fixes in Core
- Madhu Siddalingaiah – Documentation in Core
- Makoto Fukuhara – Improvements in Core
- Manoj Kumar – Improvements in MLlib and PySpark; documentation in Core and MLlib
- Marcelo Vanzin – Improvements in Core and YARN; bug fixes in Core, PySpark, YARN, and SQL
- Markus Dale – Bug fixes in Core
- Martin Zapletal – Documentation in Core and MLlib; new features in MLlib
- Masayoshi TSUZUKI – Improvements in Web UI; bug fixes in Windows, Core, and YARN
- Matei Zaharia – Improvements in Core
- Matt Whelan – Bug fixes in Core
- Matthew Cheah – Bug fixes in Core
- Mayur Rustagi – Documentation in Streaming
- Meethu Mathew – New features in MLlib and PySpark
- Michael Armbrust – Improvements in Core; bug fixes in Core, MLlib, and SQL; improvement in SQL
- Michael Davies – Improvements in SQL
- Michael Nazario – Improvements and bug fixes in PySpark
- Mike Jennings – New features in EC2
- Mingyu Kim – Bug fixes in Core
- Nan Zhu – Improvements in Streaming; documentation in Core; bug fixes in Core and Streaming
- Nate Crosswhite – Improvements in MLlib and PySpark
- Nathan Kronenfeld – Bug fixes in Core
- Nathan McCarthy – Bug fixes in Core
- Nicholas Chammas – Improvements in EC2; umbrella in EC2; bug fixes in EC2; documentation in Core
- Nishkam Ravi – Bug fixes in Core
- Octavian Geagla – Improvements in MLlib
- Patrick Wendell – Improvements in Core; bug fixes in Core, tests, and Streaming; improvement in Core
- Paul Power – Documentation in Core
- Peishen Jia – New features in MLlib
- Peng Xu – Documentation in Core
- Peter Klipfel – Documentation in Core
- Peter Rudenko – Improvements in MLlib
- Peter Vandenabeele – Documentation in Core
- Prabeesh K – Improvements in Streaming
- Prashant Sharma – New features in Core; bug fixes in Core; improvement in Core and Web UI
- RJ Nowling – New features in MLlib and PySpark
- Ravindra Pesala – Improvements in SQL
- Reynold Xin – Improvements in Core, Shuffle, and SQL; documentation in Core; bug fixes in Core and SQL; improvement in Java API and SQL
- Reza Zadeh – Improvements in MLlib
- Ryan Williams – Improvements, bug fixes, and documentation in Core
- Sadhan Sood – Bug fixes in SQL
- Saisai Shao – Improvements in Streaming; bug fixes in Streaming, SQL, and Core; improvement in Streaming
- Sam Halliday – Improvements in Core
- Sandy Ryza – Improvements in Core and YARN; bug fixes in Core and YARN; improvement in YARN
- Sasaki Toru – Improvements in SQL
- Sean Owen – Documentation in Core; wish in Core; improvements in Java API, Core, MLlib, EC2, and Streaming; bug fixes in Core, tests, MLlib, YARN, Streaming, SQL, Java API, Web UI, and GraphX; improvement in Core
- Shekhar Bansal – Bug fixes in YARN
- Sheng Li – Improvements in Core and SQL; new features in SQL; bug fixes in SQL; documentation in Core
- Shixiong Zhu – Test in Core; improvement in Core; improvements in Streaming, SQL, Shuffle, YARN, and Core; bug fixes in Core, SQL, and Streaming; documentation in Core, YARN, and Streaming
- Shuo Xiang – New features in MLlib
- Soumitra Kumar – New features in Streaming
- Stephen Boesch – Documentation in Core and MLlib
- Stephen Haberman – Bug fixes in Core
- Su Yan – Improvements in Core; bug fixes in Core and Web UI
- Takayuki Hasegawa – Bug fixes in Project Infra
- Takeshi Yamamuro – Improvements in GraphX; documentation in Core and SQL; bug fixes in GraphX
- Takuya UESHIN – Improvements and bug fixes in SQL
- Tathagata Das – Improvements in Streaming; bug fixes in Core, Web UI, PySpark, tests, and Streaming
- Thomas Graves – Bug fixes in Core
- Thu Kyaw – Improvements in Core and SQL
- Timothy Chen – Documentation in Core
- Tingjun Xu – Improvements in Core; bug fixes in Core and YARN
- Tobias Schlatter – Improvements and bug fixes in Core
- Tom Panning – Bug fixes in SQL
- Tor Myklebust – Improvements in SQL
- Travis Galoppo – Improvements in MLlib; documentation in Core and MLlib; new features in MLlib
- Tsuyoshi Ozawa – Documentation in Core and YARN
- Uncle Gen – Improvements in spark submit and Web UI; bug fixes in Core
- Varun Saxena – Improvements in Core
- Venkata Ramana Gollamudi – Bug fixes in Core and SQL; improvement in Core
- Vladimir Grigor – Bug fixes in EC2
- Vladimir Vladimirov – Improvements in PySpark
- Wang Fei – Improvement in SQL; improvements in Web UI and SQL; bug fixes in SQL; documentation in Core
- Wang Tao – Improvements in Core and YARN; bug fixes in Core and YARN
- Wenchen Fan – Bug fixes in SQL
- Winston Chen – Bug fixes in PySpark
- Xiangrui Meng – Improvements in PySpark, Core, Streaming, EC2, and MLlib; documentation in Core and MLlib; new features in MLlib and PySpark; bug fixes in PySpark, MLlib, and SQL; improvement in MLlib and PySpark
- Xiaohua Yi – Bug fixes in SQL
- Xiaojing Wang – Test in SQL; improvements in SQL; documentation in Core
- Xu Kun – Bug fixes in Core
- Yadong Qi – Bug fixes in SQL; Improvements in Streaming
- Yanbo Liang – Bug fixes in SQL, MLlib, and PySpark
- Yandu Oppacher – Improvements in PySpark
- Yantang Zhai – Improvements in Core and SQL; bug fixes in SQL
- Yash Datta – Bug fixes in SQL
- Ye Xianjin – Bug fixes in Core
- Yi Tian – Improvements and bug fixes in SQL
- Yin Huai – Documentation in Core; improvements in SQL; bug fixes in SQL; improvement in SQL
- Yuhao Yang – Improvements and bug fixes in MLlib
- Yuri Saito – Improvements in MLlib
- Yuu ISHIKAWA – New features in MLlib
- Zhan Zhang – Bug fixes in Core and YARN
- Zhang, Liye – Improvements in Core and Web UI; bug fixes in Core
Thanks to everyone who contributed!
Spark News Archive