2014-09-14

Data & Analytics Fellowship - O'Reilly Strata conf

Amplify Partners Data & Analytics Fellowship — designed for engineers, analysts, students, and anyone else passionate about data science, analytics, data-driven apps, and data infrastructure. The fellowship includes full conference registration, airfare, and hotel accommodation to attend the Strata NY conference, Oct 15-17 in NYC.

Fellows will be invited to join Amplify Partners along with a select group at a private dinner during the event, as well as for selected gatherings and Amplify Partners events ongoing throughout the year.

Applications are due Sep 30
http://www.amplifypartners.com/fellowships/amplify-partners-data-analytics-fellowship/

2014-09-10

Spark atop Mesos on Google Cloud Platform


When we run Databricks training for Apache Spark, we generally emphasize how to launch a Spark shell on a laptop and work from that basis. It’s a great way to get started using Spark. It’s also the preferred approach for developing apps, for many people who use Spark in production.

To wit, once you have an application running correctly on your laptop with a relatively small data set, then move your app to a cluster and work with data at scale. Many organizations provide Hadoop clusters on which it is quite simple to launch a Spark app.

If you don’t have a cluster available already, another good approach is to leverage cloud computing. One quick and easy way to launch a Spark cluster in the cloud is to run atop Apache Mesos on the Google Compute Engine cloud service. This is both simple for a beginner to get started, and robust at scale for production use. #no #hadoop #needed

The following five steps show how to launch, use, and monitor Spark running on a Mesos cluster on the Google Cloud Platform. Well, more like seven steps – if you include a brief wait time while the VMs launch, plus your little happy dance while celebrating at the end.

Step 1: Set up your GCP account

Set up an account on Google Cloud Platform by going to https://console.developers.google.com/project and creating a project. Let’s use a project called spark-lauch-00 for this example. Once that is created, be sure to click on the Billing link and arrange your payment details.

Step 2: Launch a Mesosphere cluster

Next, check out the free-tier service Mesosphere for Google Cloud Platform by going to https://google.mesosphere.io/ and launching a cluster. That requires a login from your Google account. Then click the +New Cluster button to get started. You will be prompted to choose a configuration for your Mesosphere cluster. For this example click on the Select Development button to run a four-node cluster.

Then you need to provide both a public SSH key and a GCP project identifier to launch a cluster with this service. If you need help on the former, Mesosphere provides a tutorial for how to generate SSH keys. Copy and paste your SSH public key into the input field and click the Create button.

Google Cloud Console

Next, go to the Google Cloud Console in your browser and click on the Projects link. Find the GCP project identifier in the second column of the table that lists your projects. In this example, we’ll use spark-launch-00 for that identifier. Copy and paste that string into the input field and click the Next button.

Now it’s time to launch your cluster. Click the shiny purple Launch Cluster button. NB: do not click the shiny red History Eraser button.

The price tag for this development configuration will run you a walloping total of approximately US$0.56 per hour. Mesosphere charges absolutely nada on top of the cost of the VMs. Depending on how long you run the example below, it should cost much less than the price of a respectable chai latte. You’re welcome.

Wait…

It will take a few minutes for those VMs to launch and get configured. You can use this precious time to meditate for some serious non-thinking, or catch up on YouTube videos. Or something.

Within a mere matter of minutes, you should receive a delightful email message from Mesosphere, indicating that your new cluster is ready to roll. Or, if you’re impatient, or OCD, or something, then just keep refreshing either the GCP console or the Mesosphere cluster console. Or refresh both, if you must. In any case, you should see the VMs updating.

Step 3: The Master and The Margarita

Check your Mesosphere cluster console in the browser, and scroll down to the Topology section. There should be one VM listed under the Master section, with both internal and external IP addresses shown for it. Copy the internal IP address for the Mesos master, and make a note about its external IP address.

Mesosphere Cluster Console

Next, you need to login through SSH to the Mesos master. You could use the OpenVPN configuration through the Mesosphere console – which is great for production use, but a bit more learning curve for those who are just getting started. It’s much simpler to login through the GCP console:
  1. click on the spark-launch-00 project link
  2. click on the Compute section
  3. click on the Compute Engine subsection
  4. click on the VMs instances subsection
Then find your Mesos master, based on its external IP address. In this example, the external IP address was 146.148.63.167 for the master. You’ll need to change that to whatever your master’s external IP address happens to be… Anywho, click on the SSH button for the master to launch a terminal window in your browser.

Once the SSH login completes, you must change to the jclouds user:
sudo bash
su - jclouds
Next, set an environment variable to point to your Mesos master. In this example, the internal IP address was 10.224.168.177 for the master:
export MESOS_MASTER=10.224.168.177
Now let’s download a binary distribution for Apache Spark. This example uses the latest 1.0.2 production release:
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop2.tgz
tar xzvf spark-1.0.2-bin-hadoop2.tgz
Great. We need to configure just a few variables…
cd spark-1.0.2-bin-hadoop2/conf

cp spark-env.sh.template spark-env.sh
echo "export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so" >> spark-env.sh
echo "export SPARK_EXECUTOR_URI=http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop2.tgz" >> spark-env.sh
 
cp spark-defaults.conf.template spark-defaults.conf
echo "spark.mesos.coarse=true" >> spark-defaults.conf

cd ..
Bokay, ready to roll. Launch a Spark shell that points to the Mesos master:
./bin/spark-shell --master mesos://$MESOS_MASTER:5050
Once the Spark shell launches, you should see a scala> prompt. ¡Bueno!

Step 4: Run a simple Spark app

Next, let’s run a simple Spark app to verify your cluster operation. Copy and paste the following two-liner at the scala> prompt:
val data = 1 to 10000
sc.parallelize(data).sum()
That code will do three things:
  1. parallelize ten thousand numbers across your cluster as an RDD
  2. sum them together
  3. print the result on your console
The result should be a ginormous number beginning with 5 followed by zeros and another 5 in its midst. Specifically, for the OCD math geeks in the audience, the value needs to be the same as (10000*10001)/2, at least in the northern hemisphere.

Step 5: Welcome to the Panopticon

For extra goodness, check out the Spark and Mesos consoles while your Mesos cluster is running Spark apps. First, open some TCP ports by adding a firewall rule… Go back to the GCP console window in your browser and locate the Mesos master, then click on any link in the Network column. Next, under the Firewall rules section, click on the Create new link:
  1. give it a name, e.g., spark-console
  2. copy and paste the following into the Protocols and Ports input field: tcp:4040,5050
  3. click on the shiny blue Create button
That new firewall rule will take a few moments to propagate across your cluster, but you should see its notification updating on that web page. Once you see that the rule is in place, browse to the Spark and Mesos consoles using the Mesos master’s external IP address. In this example, the external IP address was 146.148.63.167 for the master.

For the Spark console, open http://146.148.63.167:4040 in your browser. For the Mesos console, open http://146.148.63.167:5050 in your browser. Of course, you’ll need to substitute the external IP address for your cluster.

Then click through these consoles to see how the cluster resources are getting used. The Spark docs give more details about monitoring.

Finally, after a good gawk through the monitoring consoles, you’ll probably want to shutdown the cluster. Go back to the Mesosphere cluster window and click on the shiny red Destroy cluster button… the jolly, candy-like button.

Celebrate

Congrads, you have just run a Spark app on a Mesos cluster, based on the Mesosphere free-tier service, which you launched on Google Cloud Platform. That’s quite an accomplishment! Do you feel all DevOps-ish suddenly?

It’s time to celebrate, ergo the Margarita.

2014-07-31

NdGT pseudoscience

I'm thinking that Neil deGrasse Tyson, an astrophysicist, making broad claims about genomics, soil science, agronomics, etc., with obvious political outcomes, is sufficiently similar to William Shockley, a physicist who invented the transistor, making broad claims about population genetics, with obvious political outcomes. Both represent failures for science.

The notion that an observation over "tens of thousands of years" is scientifically valid for sweeping pronouncements about complex biological processes (e.g., our digestive tracts, topsoil ecosystems, beehive population dynamics, etc.) that have evolved over many millions of years -- that represents pseudoscience on the part of NdGT. Leading agronomists (who are not employed by Monsanto) such as at The Land Institute present vastly different opinions on the subject.

Moreover, many of the arguments against GMOs are based on the political process and business outcomes, not the science per se. For example, why should a transnational corporation spend many millions of dollars to prevent a state government from enacting reasonable laws -- or from even allowing voters to voice opinion? That's just over the labeling... Clearly, the GMO issues *aren't* so much about the scientific aspects as they are about the commercial aspects.

Last time that I checked, NdGT was not qualified to act as an attorney. Nor should he be giving legal advice to voters. Which is what the subtext indicates.

Huge points off for NdGT in my book. Meanwhile, that guy gets funded by somebody. What are the political linkages and business agendas for his funders?