Here I’m going to provide a step by step instructions on how to install Spark on Windows.
Computer: Windows 7 x64, 8 GB RAM, i5 CPU.
Spark is written with Scala and runs in the Java virtual environment. To build Spark we need to prepare the environment first by installing: JDK, Scala, SBT and GIT. Versions are important.
Let’s start with Java. The latest JDK 7 will be used:
Default SBT memory consumption limits are too low, we need to increase:
-XX:MaxPermSize= – the maximum permanent generation size, class files are kept here
If you are not going to increase these sizes then during the assemble process SBT will fail with the memory lack errors.
Another thing to configure is the PATH values for JAVA, Scala, SBT and GIT. In the command line prompt type “SET” and look for the Path variable:
Make sure correct paths to the bin folders for every product is set. If you can’t see, add them manually based on the previously installed paths. In my case Java was not added and sbt was inserted with the mistake in the path.
OK, by now the environment is ready.
Download Spark from the official web site:
This time it’s going to take much longer. Depending on how fast is your PC and Internet connection it might take from 15 minutes or 1 hour. You can launch the Task Manager to see how java eats memory 🙂
During the process you can get the error messages. Do not worry, just address the issue accordingly and run “sbt assembly” again. If all good the process will end up with the “Done packaging” message:
OK, we are done, Spark is installed. How to test it?
From the command prompt we can enter an interactive mode of Spark. Just navigate to the bin folder and run “spark-shell.cmd“:
Before we run a simple code let me just quickly show you how to remove the verbose INFO messages from the console output. Navigate to the “conf” subfolder. There you will find a number template file. Create a copy of “log4j.properties.template” file and rename it to “log4j.properties”:
Then open it with text editor and substitute “INFO” with “WARN”:
OK, start a new cmd session and run “spark-shell.cmd” again. This time it’s much tidier:
The “Hello Word” program for the big data tools is a “Word Count”:
val file = sc.textFile("C:/spark-1.3.1/README.md")
val wordsplit = file.flatMap(l => l.split(" "))
val wordmap = wordsplit.map(word => (word, 1))
val wordcount = wordmap.reduceByKey(_ + _)
Finally, Spark also supports Python interactive mode, just execute another cmd file called “pyspark.cmd“:
All right, that’s it for today. There will be more because this is the future 🙂