Two weeks ago I had zero experience with Spark, Hive, or Hadoop. Two weeks later I was able to reimplement Artsy sitemaps using Spark and even gave a “Getting Started” workshop to my team (with some help from @izakp). I’ve also made some pull requests into Hive-JSON-Serde and am starting to really understand what’s what in this fairly complex, yet amazing ecosystem.
This post will get you started with Hadoop, HDFS, Hive and Spark, fast.
What is Spark?
Apache Spark is a fast and general purpose engine for large-scale data processing. You can write code in Scala or Python and it will automagically parallelize itself on top of Hadoop. It basically runs map/reduce.
What is Hadoop and HDFS?
Hadoop is a software library, which is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It’s really a common library called Hadoop Common and a framework called Hadoop MapReduce that sits on top of a distributed file system, called HDFS.
What is Yarn?
The Hadoop Distributed File System or HDFS is a way to distribute file system data to a bunch of workers. The distribution, job scheduling and cluster resource management is done by a system called Yarn.
What is Hive?
Hadoop alone doesn’t know much about data structure and deals with text files. Most humans work with SQL, so the Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed HDFS storage using SQL. It lets you create and query a SQL schema on top of text files, which can be in various formats, including the usual CSV or JSON.
Running Locally
A good place to start is to run a few things locally.
Hadoop
On OSX run brew install hadoop, then configure it (This post was helpful.) I turned the configuration into a script in my dotfiles. Once installed you can run hstart. Once working, you can navigate to a local resource manager on https://localhost:50070 or the job tracker on https://localhost:8088 and run a small test.
Hive
Once you’ve installed Hadoop, install Hive. On OSX run brew install hive, the configure it. I turned the configuration into a script in my dofiles as well. The biggest difficulty is that you need to initialize a metastore where Hive stores its configuration information with schematool -initSchema -dbType derby (or another dbType, such as mysql or postgres). In the case of Derby, the metastore_db is created in the same directory as from where you run the command, so it needs to be tied down to a location via hive-site.xml.
Once installed you can run hive and get a hive> prompt.
Before we do that, lets get some data into HDFS.
Load data into Hive. If you have existing data files you can just use those and add a schema on top of them with CREATE EXTERNAL TABLE.
Spark
Neither Hadoop or Hive are prerequisites to run Spark on OSX, install it with brew install apache-spark. From your installation in /usr/local/Cellar/apache-spark/X.Y.Z run ./bin/run-example SparkPi 10 from there. You should see Pi is roughly 3.1413551413551413.
You can run a Spark shell with spark-shell. Lets play with a sample dataset of country GDPs (Update: unfortunately lost forever) curated for us by @ByzantineFault during a recent @ArtsyOpenSource Spark workshop.
We can turn this data into a proper resilient distributed dataset (RDD).
Lets calculate the total GDP for each country during the years surveyed.
We can also get our data from our previous Hive installation (Spark comes with its own “standalone” hive, too) by linking hive-site.xml into Spark’s libexec/conf/hive-site.xml as in my dotfiles.
Next Steps
The next step should be to write a Spark job in Scala. If you want to share one, I’ll add it to this tutorial.