Spark atop Mesos on Google Cloud Platform
When we run Databricks training for Apache Spark, we generally emphasize how to launch a Spark shell on a laptop and work from that basis. It’s a great way to get started using Spark. It’s also the preferred approach for developing apps, for many people who use Spark in production.
To wit, once you have an application running correctly on your laptop with a relatively small data set, then move your app to a cluster and work with data at scale. Many organizations provide Hadoop clusters on which it is quite simple to launch a Spark app.
If you don’t have a cluster available already, another good approach is to leverage cloud computing. One quick and easy way to launch a Spark cluster in the cloud is to run atop Apache Mesos on the Google Compute Engine cloud service. This is both simple for a beginner to get started, and robust at scale for production use.
#no #hadoop #needed
The following five steps show how to launch, use, and monitor Spark running on a Mesos cluster on the Google Cloud Platform.
Well, more like seven steps – if you include a brief wait time while the VMs launch, plus your little happy dance while celebrating at the end.Step 1: Set up your GCP account
Set up an account on Google Cloud Platform by going to https://console.developers.google.com/project and creating a project. Let’s use a project calledspark-lauch-00
for this example.
Once that is created, be sure to click on the Billing link and arrange your payment details.Step 2: Launch a Mesosphere cluster
Next, check out the free-tier service Mesosphere for Google Cloud Platform by going to https://google.mesosphere.io/ and launching a cluster. That requires a login from your Google account. Then click the +New Cluster button to get started. You will be prompted to choose a configuration for your Mesosphere cluster. For this example click on the Select Development button to run a four-node cluster.Then you need to provide both a public SSH key and a GCP project identifier to launch a cluster with this service. If you need help on the former, Mesosphere provides a tutorial for how to generate SSH keys. Copy and paste your SSH public key into the input field and click the Create button.
Google Cloud Console |
Next, go to the Google Cloud Console in your browser and click on the Projects link.
Find the GCP project identifier in the second column of the table that lists your projects.
In this example, we’ll use
spark-launch-00
for that identifier.
Copy and paste that string into the input field and click the Next button.The price tag for this development configuration will run you a walloping total of approximately US$0.56 per hour. Mesosphere charges absolutely nada on top of the cost of the VMs. Depending on how long you run the example below, it should cost much less than the price of a respectable chai latte. You’re welcome.
Wait…
It will take a few minutes for those VMs to launch and get configured. You can use this precious time to meditate for some serious non-thinking, or catch up on YouTube videos. Or something.Within a mere matter of minutes, you should receive a delightful email message from Mesosphere, indicating that your new cluster is ready to roll. Or, if you’re impatient, or OCD, or something, then just keep refreshing either the GCP console or the Mesosphere cluster console. Or refresh both, if you must. In any case, you should see the VMs updating.
Step 3: The Master and The Margarita
Check your Mesosphere cluster console in the browser, and scroll down to the Topology section. There should be one VM listed under the Master section, with both internal and external IP addresses shown for it. Copy the internal IP address for the Mesos master, and make a note about its external IP address.Mesosphere Cluster Console |
Next, you need to login through SSH to the Mesos master. You could use the OpenVPN configuration through the Mesosphere console – which is great for production use, but a bit more learning curve for those who are just getting started. It’s much simpler to login through the GCP console:
- click on the
spark-launch-00
project link - click on the Compute section
- click on the Compute Engine subsection
- click on the VMs instances subsection
146.148.63.167
for the master.
You’ll need to change that to whatever your master’s external IP address happens to be…
Anywho, click on the SSH button for the master to launch a terminal window in your browser.Once the SSH login completes, you must change to the
jclouds
user:sudo bash
su - jclouds
Next, set an environment variable to point to your Mesos master.
In this example, the internal IP address was 10.224.168.177
for the master:export MESOS_MASTER=10.224.168.177
Now let’s download a binary distribution for Apache Spark.
This example uses the latest 1.0.2
production release:wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop2.tgz
tar xzvf spark-1.0.2-bin-hadoop2.tgz
Great.
We need to configure just a few variables…cd spark-1.0.2-bin-hadoop2/conf
cp spark-env.sh.template spark-env.sh
echo "export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so" >> spark-env.sh
echo "export SPARK_EXECUTOR_URI=http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop2.tgz" >> spark-env.sh
cp spark-defaults.conf.template spark-defaults.conf
echo "spark.mesos.coarse=true" >> spark-defaults.conf
cd ..
Bokay, ready to roll.
Launch a Spark shell that points to the Mesos master:./bin/spark-shell --master mesos://$MESOS_MASTER:5050
Once the Spark shell launches, you should see a scala>
prompt. ¡Bueno!Step 4: Run a simple Spark app
Next, let’s run a simple Spark app to verify your cluster operation. Copy and paste the following two-liner at thescala>
prompt:val data = 1 to 10000
sc.parallelize(data).sum()
That code will do three things:- parallelize ten thousand numbers across your cluster as an RDD
- sum them together
- print the result on your console
5
followed by zeros and another 5
in its midst.
Specifically, for the OCD math geeks in the audience, the value needs to be the same as (10000*10001)/2
, at least in the northern hemisphere.Step 5: Welcome to the Panopticon
For extra goodness, check out the Spark and Mesos consoles while your Mesos cluster is running Spark apps. First, open some TCP ports by adding a firewall rule… Go back to the GCP console window in your browser and locate the Mesos master, then click on any link in the Network column. Next, under the Firewall rules section, click on the Create new link:- give it a name, e.g.,
spark-console
- copy and paste the following into the Protocols and Ports input field:
tcp:4040,5050
- click on the shiny blue Create button
146.148.63.167
for the master.For the Spark console, open
http://146.148.63.167:4040
in your browser.
For the Mesos console, open http://146.148.63.167:5050
in your browser.
Of course, you’ll need to substitute the external IP address for your cluster.Then click through these consoles to see how the cluster resources are getting used. The Spark docs give more details about monitoring.
Finally, after a good gawk through the monitoring consoles, you’ll probably want to shutdown the cluster. Go back to the Mesosphere cluster window and click on the shiny red Destroy cluster button… the jolly, candy-like button.
Celebrate
Congrads, you have just run a Spark app on a Mesos cluster, based on the Mesosphere free-tier service, which you launched on Google Cloud Platform. That’s quite an accomplishment! Do you feel all DevOps-ish suddenly?It’s time to celebrate, ergo the Margarita.
No comments:
Post a Comment