Notes from Databricks’ blog post about the plans for Spark SQL

You should instead be reading the original blog post Shark, Spark SQL, Hive on Spark, and the future of SQL on Spark on Databricks’ blog as what’s coming below are my notes to build up a better understanding of Spark SQL (and potentially help the project bringing in more eyes to it).

  • Spark SQL = a new SQL engine for Spark
  • compatibility with Shark/Hive
  • the new Hive on Spark effort – https://issues.apache.org/jira/browse/HIVE-7292
  • ending development of Shark – all hands on Spark SQL = moving all Databricks’ development resources to Spark SQL
  • a superset of Shark’s features for existing Shark users
  • designed from ground-up to leverage the power of Spark
  • upgrade path from Shark 0.9 to Spark SQL to help migrate people from Hive to Spark
  • Hive (on MapReduce) was the only choice for SQL on Hadoop 3 years ago
  • Hive compiled SQL into scalable MapReduce jobs and could work with a variety of formats (through its SerDes)
    • performance not satisfactory though = deficiencies that made Hive slow were fundamental
    • no interactive queries
  • Shark became one of the first interactive SQL on Hadoop systems
    • the only one built on top of a general runtime (Spark)
  • Shark built on the Hive codebase and achieved performance improvements by swapping out the physical execution engine part of Hive
  • Shark inherited a large, complicated code base from Hive that made it hard to optimize and maintain.
    • constrained by the legacy that was designed for MapReduce.
  • Spark SQL supports all existing Hive data formats, user-defined functions (UDF), and the Hive megastore.
  • Spark SQL becomes the narrow-waist for manipulating (semi-) structured data as well as ingesting data from sources that provide schema, such as JSON, Parquet, Hive, or EDWs (Enterprise Data Warehourses)
  • unifies SQL and sophisticated analysis, allowing users to mix and match SQL and more imperative programming APIs for advanced analytics.
  • Spark SQL proposes a novel, elegant way of building query planners
    • easy to add new optimisations
  • Spark SQL is becoming the standard for SQL on Spark
  • Spark as an alternative execution engine to Hive = to migrate the execution to Spark
  • Spark SQL will be the future of not only SQL, but also structured data processing on Spark.

BTW, the examples in Spark SQL Programming Guide are working like a charm. They’re so easy to run and comprehend that I couldn’t understand at first what the fuss is about.

Be Sociable, Share!
This entry was posted in Cluster Computing.

3 Responses to Notes from Databricks’ blog post about the plans for Spark SQL

  1. abicz says:

    Jacek how Spark SQL relate to Cloudera Impala in means of performance and scalability?

Leave a Reply

%d bloggers like this: