IPython Notebooks with PySpark
When you launch PySpark from the command line, you are greeted with a nice 80's style banner.
This is nice and all but you can install ipython, a python interactive shell with a ton of features, with one simple line of code.
In order for PySpark to use the ipython shell, you will need to do the following.
This will launch the PySpark shell with ipython. The first thing you will notice is the prompt has changed from
This is a good indicator that you are now using ipython. With ipython installed you get a lot more functionality.
You get code completion by pressing the TAB key.
You get what ipython calls magic commands by typing a % + TAB.
This is an example of using the %time magic command that can be used to test how long an algorithm takes to run. This example iterates over a list from 1 to 100 and appends a hypen between each number. The %time command returns how long it took to run.
There are 123 magic commands according to my system. So have a look at the documentation and see what is available.
Jupyter Notebooks (formerly known as IPython Notebooks)
When you install ipython, it installs another feature called a Notebook. According to the python website
"The IPython Notebook is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media…"
This sounds pretty awesome to me, so let’s see how we can do some of this cool stuff with Spark.
Instead of launching the PySpark shell, we will launch the notebook instead using the following command.
This will launch your browser, that is pointing to a local Tornado web server, on port 8888 (default port). It will open up a file browser interface that is reading the directory structure from your current working directory.
To create a new notebook, go to the upper-right corner and click on the New drop-down button and select Python <version>. I am using Python 3, but this could also be Python 2, which is the default in Spark.
This will open up a new window which is your empty notebook. Notice the prompt is the same as it was when using the ipython shell.
The notebook has all the same features available in the shell, except it is through the browser and much more aestically pleasing.
To use the notebook, you enter commands in the cell and click the Run button on the toolbar. It is straightforward and more information can be found here.
Here is a simple PySpark program that reads the README.md file that is in the Spark source code and counts the top 5 words.
Not only can you run code in Notebooks, but you can also plot visualizations right in the browser. Here is a simple bar plot of our top 5 words.
That is all for now. I hope you investigate IPython and Jupyter and see all the possibilites integrating these tools with Spark.
Labels: data science, machine learning, python