Learning #Spark from @OReillyMedia – Look no further for definitive guide to big data analytics with Apache Spark

Learning Spark book's cover

The following book review is about the Early Release (Raw & Unedited) version of the book Learning Spark: Lightning-Fast Big Data Analytics from O’Reilly. I was reading the book in the mobi format on Kindle Paperwhite. It was a review copy from O’Reilly.

tl;dr Read the book if you’re curious about Apache Spark and are on the lookout for a more systematic approach to learn its features. Even at this writing stage can the book be very useful for newcomers to the field as well as people more experienced.

I have never worked commercially or as a hobby with any project that would claim to be some sort of Big Data solution, and Apache Spark was no exception. The reason for more interest in Spark was due to the Scala language the project is developed in (reviewing the source code’s changes) and at some point in the project’s history they were more convinced to use sbt that I hoped to have helped myself and them to comprehend. Later on, it turned out that not only could I learn Scala and sbt, but also Akka and the concept of data stream mining with Spark Streaming (I’ve also been considering Storm, but it’s in Clojure that I once left aside for Scala). Spark shell (based on Scala REPL) made the learning so much easier that I was sure I’m with Spark for longer. I also happened to have developed an Activator template to let others get started with Spark Streaming – Spark Streaming with Scala and Akka (spark-streaming-scala-akka).

With a few weeks of learning Spark under my belt I needed a book to overcome initial hurdles and reach higher level of confidence in applying Spark where it’d fit well. I simply needed a mentor who’d guide me through “what, when, how” of Spark and the book did that far beyond my expectations.

There are already 5 chapters of quality that I didn’t expect from a book in an early release – I must admit that the content’s polished and after having read the chapters I need more of it. The book’s written by people who are the committers of the project and their writing style is very engaging with enough theory and code samples in Java, Scala and Python. There are many use cases for which Spark is a valid software offering and I’m in no way to imagine how my Spark skills will have grown up after the other chapters yet to come like Advanced Programming with RDDs, Spark Architecture and Deploying Spark. It’s undoubtedly going to be a painful experience waiting for them to show up.

If the 5 chapters (out of 13 planned) were any indication of what the book’s going to look like in the final version, I’m fully confident of its success – it’s going to be the bestseller in the area of Big Data Analytics. No programming language – out of Java, Scala or Python – is favoured. As the authors pointed out in the initial pages, they’re going to show examples of using Spark in the three programming languages and they’re doing it for each and every use case. That’s also one of the selling points of Spark that the book highlights very well – the samples are simple to comprehend, almost no-brainers, and can easily fit a page, even in all three languages. Without Spark the samples would not have been so easy to implement and would’ve required much more from the implementer, be it an engineer or data analyst. The book demonstrates it well.

As we’re at it, the two job titles – a software engineer and a data analyst – are the people the book targets. It’s just this book that has helped me to notice the difference between them and how Spark blends their needs into a single software offering. After the 5 chapters Spark appears so simple that I doubt there’s anything that can surprise me that would not be a bug or an intended (yet surprising) feature. The book has helped me to build confidence in understanding the benefits of using Spark in my project and I’m really looking forward to reading the remaining chapters. I’m hoping that the authors and the publisher won’t let me wait too long.

Be Sociable, Share!
This entry was posted in Books, Frameworks, Tools.

2 Responses to Learning #Spark from @OReillyMedia – Look no further for definitive guide to big data analytics with Apache Spark

  1. abicz says:

    Jacek can you be more specific what have you learned from this book and what kind of project are you going to use it?

    • The five chapters have helped me to fill out the knowledge gaps from my own learning. It was often when I found a sentence that explained me more than I tried to understand before. I remember one Aha moment when I realized that Spark hides the complexity of distributing your stream processing over the nodes without you much saying how it should be done – you’re just declaratively saying what needs to be done not where and how. No specific projects so far – just a couple of ideas where Spark could be a good fit. There’s nothing really going on yet.

Leave a Reply

%d bloggers like this: