Spark Starter Kit

Mihai Chelaru-Centea

Online Course by Hadoop in Real World

Resource last updated: June 1, 2017


Overview

If you haven't already done the Hadoop Starter Kit course and don't know what Hadoop is or how it works, then I recommend you check the course out or look at my review on the subject. This course explains the core concepts of Spark, as well as where it fits into the Hadoop ecosystem and what challenges it specifically addresses. There's also a short chapter at the end of the course for those unfamiliar with Scala.

Format

Once again, the course has only a 3.5 hour runtime, broken up into multiple chapters similar to the structure of the Hadoop Starter Kit course. Unlike that course, however, there are no quizzes at the end of each section to test if you were paying attention. Once again there's a mix of theory and practical examples, as well as some live demonstrations on a Spark cluster of the code executing, as well as the Web UI which displays logs for Spark jobs run on the cluster.

Content

The start of the course is all about context, laying the foundations for why Spark was developed and what limitations of Hadoop's MapReduce implementation it tries to address. In particular, it focuses on why Spark is faster than Hadoop, but does not completely replace Hadoop because it doesn't provide an alternative to HDFS. Spark still requires some sort of distributed file system to operate on, so you could conceivably run Spark and Hadoop on the same cluster.

An entire chapter is devoted to RDDs (resilient distributed datasets), and how they allow Spark to ensure fault tolerance despite the fact that Spark uses in-memory computing. Quite a bit of time is also spent on explaining the difference between a logical plan and a physical plan, as well as narrow and wide dependencies. There are also some nice tidbits about how to know when you might need to cache RDDs in memory.

An entire chapter is dedicated to the core concepts of memory management and fault tolerance, which are some of the big challenges of in-memory computing that Spark had to overcome in its implementation. Understanding these sets a nice foundation for any future learning, and they also might show up in an interview question or two.

The Scala chapter is quite a short introduction, but it does the job for the purpose of the course, although I wish they'd placed it before the used any Scala in the course instead of at the very end. It's a bit jarring to have to go to the end in order to get information for something in the middle.

Difficulty

As with the Hadoop Starter Kit course, this course is geared towards beginners with no experience whatsoever with Spark. You do need to know about Hadoop, in particular how HDFS works, and also to have some programming knowledge, or you might not fully understand some of the practical examples. As with the other course, the lectures are scripted and have a good logical flow, so it should be appropriate for any learner at any level.

The Bottom Line


Another free course that offers great value in its short runtime. There is not a lot of unnecessary repetition in this course, and so you can safely watch all of it and feel that you are spending your time learning something worthwhile. I'd recommend this to anybody as a nice primer into the world of Spark. A fair number of data scientist positions would like at least some familiarity with the Hadoop ecosystem, Spark included, so why not take this course and see what it's all about?