Installing Spark on Windows

Here I’m going to provide a step by step instructions on how to install Spark on Windows.

Computer: Windows 7 x64, 8 GB RAM, i5 CPU.

Spark is written with Scala and runs in the Java virtual environment. To build Spark we need to prepare the environment first by installing: JDK, Scala, SBT and GIT. Versions are important.

Let’s start with Java. The latest JDK 7 will be used:

1 JDK 1Default installation folder is fine:

1 JDK 2Next, Scala, version 2.10.x:

3 Scala 1Pay attention to the installation path, remove spaces:

3 Scala 23 Scala 3Next, the latest version of SBT:

4 SBT 14 SBT 4And finally GIT:

2 GIT 12 GIT 2

2 GIT 3Before we continue lets make few configurations.

Default SBT memory consumption limits are too low, we need to increase:

4 SBT 5-Xmx – sets the maximum Java heap size

-XX:MaxPermSize= – the maximum permanent generation size, class files are kept here

If you are not going to increase these sizes then during the assemble process SBT will fail with the memory lack errors.

Another thing to configure is the PATH values for JAVA, Scala, SBT and GIT. In the command line prompt type “SET” and look for the Path variable:

4 SBT 6Make sure correct paths to the bin folders for every product is set. If you can’t see, add them manually based on the previously installed paths. In my case Java was not added and sbt was inserted with the mistake in the path.

OK,  by now the environment is ready.

Download Spark from the official web site:

5 SPARK 1It’s simply an archive:

5 SPARK 2Extract it somewhere to the disk:

5 SPARK 5Run a cmd.exe and navigate to the “build” sub folder. Execute “sbt package” command:

5 SPARK 6If successful you will see “Done packaging” message:

5 SPARK 8Now navigate back to the root folder and run another command “sbt assembly“:

5 SPARK 9This time it’s going to take much longer. Depending on how fast is your PC and Internet connection it might take from 15 minutes or 1 hour. You can launch the Task Manager to see how java eats memory 🙂

5 SPARK 14During the process you can get the error messages. Do not worry, just address the issue accordingly and run “sbt assembly” again. If all good the process will end up with the “Done packaging” message:

5 SPARK 16What assemble process did is it collected all the required subpackages into a single jar file. This file is located in the “assembly\target\scala-2.10” folder:

5 SPARK 17If you recall I have downloaded Spark with Hadoop, therefore the name reflects this fact. When you will be starting Spark with the cmd files, they are actually will be calling the above jar.

OK, we are done, Spark is installed. How to test it?

From the command prompt we can enter an interactive mode of Spark. Just navigate to the bin folder and run “spark-shell.cmd“:

5 SPARK 215 SPARK 22We have got a Scala prompt. Great 🙂

Before we run a simple code let me just quickly show you how to remove the verbose INFO messages from the console output. Navigate to the “conf” subfolder. There you will find a number template file. Create a copy of “log4j.properties.template” file and rename it to “log4j.properties”:

6 PROGRAM 1

Then open it with text editor and substitute “INFO” with “WARN”:

6 PROGRAM 2

OK, start a new cmd session and run “spark-shell.cmd” again. This time it’s much tidier:

6 PROGRAM 3

The “Hello Word” program for the big data tools is a “Word Count”:

val file = sc.textFile("C:/spark-1.3.1/README.md")
val wordsplit = file.flatMap(l => l.split(" "))
val wordmap = wordsplit.map(word => (word, 1))
val wordcount = wordmap.reduceByKey(_ + _)
wordcount.saveAsTextFile("C:/words.txt")

6 PROGRAM 4

What we have at the output is a folder with the results in a partitioned form:
6 PROGRAM 5

And if I open a partition file I will see the word counts:
6 PROGRAM 7

Cool 🙂

Finally, Spark also supports Python interactive mode, just execute another cmd file called “pyspark.cmd“:

5 SPARK 25

All right, that’s it for today. There will be more because this is the future 🙂

Advertisements

Leave a Comment here

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s