When we run Databricks training for Apache Spark, we generally emphasize how to launch a Spark shell on a laptop and work from that basis. It’s a great way to get started using Spark. It’s also the preferred approach for developing apps, for many people who use Spark in production.
To wit, once you have an application running correctly on your laptop with a relatively small data set, then move your app to a cluster and work with data at scale. Many organizations provide Hadoop clusters on which it is quite simple to launch a Spark app.
If you don’t have a cluster available already, another good approach is to leverage cloud computing. One quick and easy way to launch a Spark cluster in the cloud is to run atop Apache Mesos on the Google Compute Engine cloud service. This is both simple for a beginner to get started, and robust at scale for production use.
#no #hadoop #needed
The following five steps show how to launch, use, and monitor Spark running on a Mesos cluster on the Google Cloud Platform. Well, more like seven steps – if you include a brief wait time while the VMs launch, plus your little happy dance while celebrating at the end.
Step 1: Set up your GCP accountSet up an account on Google Cloud Platform by going to https://console.developers.google.com/project and creating a project. Let’s use a project called
spark-lauch-00for this example. Once that is created, be sure to click on the Billing link and arrange your payment details.
Step 2: Launch a Mesosphere clusterNext, check out the free-tier service Mesosphere for Google Cloud Platform by going to https://google.mesosphere.io/ and launching a cluster. That requires a login from your Google account. Then click the +New Cluster button to get started. You will be prompted to choose a configuration for your Mesosphere cluster. For this example click on the Select Development button to run a four-node cluster.
Then you need to provide both a public SSH key and a GCP project identifier to launch a cluster with this service. If you need help on the former, Mesosphere provides a tutorial for how to generate SSH keys. Copy and paste your SSH public key into the input field and click the Create button.
|Google Cloud Console|
Next, go to the Google Cloud Console in your browser and click on the Projects link. Find the GCP project identifier in the second column of the table that lists your projects. In this example, we’ll use
spark-launch-00for that identifier. Copy and paste that string into the input field and click the Next button.
The price tag for this development configuration will run you a walloping total of approximately US$0.56 per hour. Mesosphere charges absolutely nada on top of the cost of the VMs. Depending on how long you run the example below, it should cost much less than the price of a respectable chai latte. You’re welcome.
Wait…It will take a few minutes for those VMs to launch and get configured. You can use this precious time to meditate for some serious non-thinking, or catch up on YouTube videos. Or something.
Within a mere matter of minutes, you should receive a delightful email message from Mesosphere, indicating that your new cluster is ready to roll. Or, if you’re impatient, or OCD, or something, then just keep refreshing either the GCP console or the Mesosphere cluster console. Or refresh both, if you must. In any case, you should see the VMs updating.
Step 3: The Master and The MargaritaCheck your Mesosphere cluster console in the browser, and scroll down to the Topology section. There should be one VM listed under the Master section, with both internal and external IP addresses shown for it. Copy the internal IP address for the Mesos master, and make a note about its external IP address.
|Mesosphere Cluster Console|
Next, you need to login through SSH to the Mesos master. You could use the OpenVPN configuration through the Mesosphere console – which is great for production use, but a bit more learning curve for those who are just getting started. It’s much simpler to login through the GCP console:
- click on the
- click on the Compute section
- click on the Compute Engine subsection
- click on the VMs instances subsection
184.108.40.206for the master. You’ll need to change that to whatever your master’s external IP address happens to be… Anywho, click on the SSH button for the master to launch a terminal window in your browser.
Once the SSH login completes, you must change to the
Next, set an environment variable to point to your Mesos master. In this example, the internal IP address was
sudo bash su - jclouds
10.224.168.177for the master:
Now let’s download a binary distribution for Apache Spark. This example uses the latest
Great. We need to configure just a few variables…
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop2.tgz tar xzvf spark-1.0.2-bin-hadoop2.tgz
Bokay, ready to roll. Launch a Spark shell that points to the Mesos master:
cd spark-1.0.2-bin-hadoop2/conf cp spark-env.sh.template spark-env.sh echo "export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so" >> spark-env.sh echo "export SPARK_EXECUTOR_URI=http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop2.tgz" >> spark-env.sh cp spark-defaults.conf.template spark-defaults.conf echo "spark.mesos.coarse=true" >> spark-defaults.conf cd ..
Once the Spark shell launches, you should see a
./bin/spark-shell --master mesos://$MESOS_MASTER:5050
Step 4: Run a simple Spark appNext, let’s run a simple Spark app to verify your cluster operation. Copy and paste the following two-liner at the
That code will do three things:
val data = 1 to 10000 sc.parallelize(data).sum()
- parallelize ten thousand numbers across your cluster as an RDD
- sum them together
- print the result on your console
5followed by zeros and another
5in its midst. Specifically, for the OCD math geeks in the audience, the value needs to be the same as
(10000*10001)/2, at least in the northern hemisphere.
Step 5: Welcome to the PanopticonFor extra goodness, check out the Spark and Mesos consoles while your Mesos cluster is running Spark apps. First, open some TCP ports by adding a firewall rule… Go back to the GCP console window in your browser and locate the Mesos master, then click on any link in the Network column. Next, under the Firewall rules section, click on the Create new link:
- give it a name, e.g.,
- copy and paste the following into the Protocols and Ports input field:
- click on the shiny blue Create button
220.127.116.11for the master.
For the Spark console, open
http://18.104.22.168:4040in your browser. For the Mesos console, open
http://22.214.171.124:5050in your browser. Of course, you’ll need to substitute the external IP address for your cluster.
Then click through these consoles to see how the cluster resources are getting used. The Spark docs give more details about monitoring.
Finally, after a good gawk through the monitoring consoles, you’ll probably want to shutdown the cluster. Go back to the Mesosphere cluster window and click on the shiny red Destroy cluster button… the jolly, candy-like button.
CelebrateCongrads, you have just run a Spark app on a Mesos cluster, based on the Mesosphere free-tier service, which you launched on Google Cloud Platform. That’s quite an accomplishment! Do you feel all DevOps-ish suddenly?
It’s time to celebrate, ergo the Margarita.