Spark on Windows
Spark on Windows

Developing a Spark Scala application on windows is a tedious task. Even for a single line of code change, jar needs to be built and moved to a cluster after which it further requires manual execution. But there is a time-saving option for developers. Create a local environment on the windows machine, then integrate it with eclipse and get the most features out of it. Code, Build, Run and Debug instantly, making your development life easier! In this article, you’ll learn how to setup the environment for Spark and developĀ simple Spark app that would be executed inside eclipse.

Setting up the environment:

  1. Download pre-compiled winutils for Hadoop.
    Extract the content of bin folder to C:\hadoop\bin
  2. Download spark
    Extract the content to C:\spark\
  3. Set environment variables
    HADOOP_HOME=c:\hadoop SPARK_HOME=c:\spark
    PATH=%JAVA_HOME%\bin;%HADOOP_HOME%\bin;%SPARK_HOME%\bin;
  4. Create C:\tmp\hive directory
  5. Execute the following command in cmd that you started using the option Run as administrator.
    winutils.exe chmod -R 777 C:\tmp\hive
  6. Check the permissions.
    winutils.exe ls -F C:\tmp\hive
Hadoop on Windows
Hadoop on Windows

Creating sample Application

Download sample application

  1. Create maven project in eclipse.
    File -> New -> Other -> Maven -> Maven Project
  2. Right click on project -> Scala -> set the scala installation and select 2.11.11
  3. Right click on project -> Properties -> Scala Compiler -> check “use project settings”
  4. Edit pom.xml
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
      <modelVersion>4.0.0</modelVersion>
      <groupId>com.dataqlo.spark</groupId>
      <artifactId>spark-app</artifactId>
      <version>0.0.1-SNAPSHOT</version>
       <dependencies>
     <dependency> <!-- Spark dependency -->
     <groupId>org.apache.spark</groupId>
     <artifactId>spark-core_2.11</artifactId>
     <version>2.2.0</version>
     <scope>provided</scope>
     </dependency>
     </dependencies>
    </project>
  5. Create directory src/main/scala
  6. Create new scala object here New-> Object. Name that object as “ScalaWordCount”
    package com.dataqlo.spark
        import org.apache.spark.SparkConf
        import org.apache.spark.SparkContext
        object ScalaWordCount {
            def main(args: Array[String]) {
                //To set HADOOP_HOME.
                System.setProperty("hadoop.home.dir", "c://hadoop//");
                // create Spark context with Spark configuration
                val sc = new SparkContext(new SparkConf().setAppName("Spark WordCount").setMaster("local[2]"))
                        //Load inputFile
                        val inputFile = sc.textFile("src/main/resources/input.txt")
                        val counts = inputFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
                        counts.foreach(println)
                        sc.stop()
            }
        }
  7. To run : right click on project -> Run -> Run Configuration -> Scala Application -> Select project and full name for Main Class

Tested example under the following configuration

  • Java: 1.8.0.91
  • Winutils Hadoop: 2.7.1
  • Spark: 2.3.1 pre-build for Hadoop 2.7 and laterĀ 
  • Eclipse scala IDE: 4.7.0
  • Scala: fixed scala installation 2.11.11(built-in) [ packaged within eclipse ]
  • Maven: 3.3.9

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.