Post on 09-Jul-2015
description
INTRODUCTION TO
APACHE SPARK
Mohamed Hedi Abidi - Software Engineer @ebiznext
@mh_abidi
CONTENT
Spark Introduction
Installation
Spark-Shell
SparkContext
RDD
Persistance
Simple Spark Apps
Deploiement
Spark SQL
Spark GraphX
Spark Mllib
Spark Streaming
Spark & Elasticsearch
INTRODUCTION
An open source data analytics cluster computing framework
In Memory Data processing
100x faster than Hadoop
Support MapReduce
INTRODUCTION
Handles batch, interactive, and real-time within a single framework
INTRODUCTION
Programming at a higher level of abstraction : faster, easier development
INTRODUCTION
Highly accessible through standard APIs built in Java, Scala, Python, or SQL (for interactive queries), and a rich set of machine learning libraries
Compatibility with the existing Hadoop v1 (SIMR) and 2.x (YARN) ecosystems so companies can leverage their existing infrastructure.
INSTALLATION
Install JDK 1.7+, Scala 2.10.x, Sbt0.13.7, Maven 3.0+
Download and unzip Apache Spark 1.1.0 sources
Or clone development Version :
git clone git://github.com/apache/spark.git
Run Maven to build Apache Spark
mvn -DskipTests clean package
Launch Apache Spark standalone REPL
[spark_home]/bin/spark-shell
Go to SparkUI @
http://localhost:4040
SPARK-SHELL
we’ll run Spark’s interactive shell… within the “spark” directory, run:
./bin/spark-shell
then from the “scala>” REPL prompt, let’s create somedata…
scala> val data = 1 to 10000
create an RDD based on that data…
scala> val distData = sc.parallelize(data)
then use a filter to select values less than 10…
scala> distData.filter(_ < 10).collect()
SPARKCONTEXT
The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster.
In the shell for either Scala or Python, this is the scvariable, which is created automatically
Other programs must use a constructor to instantiate a new SparkContextval conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
RDDS
Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – It is an immutable distributed collection of data, which is partitioned across machines in a cluster
There are currently two types:
parallelized collections : Take an existing Scala collection and run functions on it in parallel
External datasets : Spark can create distributed datasets fromany storage source supported by Hadoop, including local file system, HDFS, Cassandra, HBase, Amazon S3, etc.
RDDS
Parallelized collections scala> val data = Array(1, 2, 3, 4, 5)
data: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:14
External datasetsscala> val distFile = sc.textFile("README.md")
distFile: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[7] attextFileat <console>:12
RDDS
Two types of operations on RDDs:
transformations and actions
A transformation is a lazy (not computed immediately) operation on an RDD that yields another RDD
An action is an operation that triggers a computation, returns a value back to the Master, or writes to a stable storage system
RDDS : COMMONLY USED TRANSFORMATIONS
Transformation & Purpose Example & Result
filter(func)Purpose: new RDD by selecting those data elements on which func returns true
scala> val rdd =sc.parallelize(List(“ABC”,”BCD”,”DEF”))scala> val filtered = rdd.filter(_.contains(“C”))scala> filtered.collect()Result:Array[String] = Array(ABC, BCD)
map(func)Purpose: return new RDD by applying func on each data element
scala> val rdd=sc.parallelize(List(1,2,3,4,5))scala> val times2 = rdd.map(_*2)scala> times2.collect()Result:Array[Int] = Array(2, 4, 6, 8, 10)
flatMap(func)Purpose: Similar to map but funcreturns a Seq instead of a value. For example, mapping a sentence into a Seq of words
scala> val rdd=sc.parallelize(List(“Spark isawesome”,”It is fun”))scala> val fm=rdd.flatMap(str=>str.split(“ “))scala> fm.collect()Result:Array[String] = Array(Spark, is, awesome, It, is, fun)
RDDS : COMMONLY USED TRANSFORMATIONS
Transformation & Purpose Example & Result
reduceByKey(func,[numTasks])Purpose: To aggregate values of akey using a function. “numTasks” is anoptional parameter to specify number of reduce tasks
scala> val word1=fm.map(word=>(word,1))scala> val wrdCnt=word1.reduceByKey(_+_)scala> wrdCnt.collect()Result:Array[(String, Int)] = Array((is,2), (It,1),(awesome,1), (Spark,1), (fun,1))
groupByKey([numTasks])Purpose: To convert (K,V) to(K,Iterable<V>)
scala> val cntWrd = wrdCnt.map{case (word,count) => (count, word)}scala> cntWrd.groupByKey().collect()Result:Array[(Int, Iterable[String])] =Array((1,ArrayBuffer(It, awesome, Spark,fun)), (2,ArrayBuffer(is)))
distinct([numTasks])Purpose: Eliminate duplicates from RDD
scala> fm.distinct().collect()Result:Array[String] = Array(is, It, awesome, Spark,fun)
RDDS : COMMONLY USED ACTIONS
Transformation & Purpose Example & Result
count()Purpose: Get the number ofdata elements in the RDD
scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’))scala> rdd.count()Result:Long = 3
collect()Purpose: get all the data elements in an RDD as an Array
scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’))scala> rdd.collect()Result:Array[Char] = Array(A, B, C)
reduce(func)Purpose: Aggregate the data elements in an RDD using this function which takes two arguments and returns one
scala> val rdd = sc.parallelize(List(1,2,3,4))scala> rdd.reduce(_+_)Result:Int = 10
take (n)Purpose: fetch first n data elements in an RDD. Computed by driver program.
Scala> val rdd = sc.parallelize(List(1,2,3,4))scala> rdd.take(2)Result:Array[Int] = Array(1, 2)
RDDS : COMMONLY USED ACTIONS
Transformation & Purpose Example & Result
foreach(func)Purpose: execute function foreach data element in RDD.Usually used to update an accumulator(discussed later) or interacting with external systems.
Scala> val rdd = sc.parallelize(List(1,2))scala> rdd.foreach(x=>println(“%s*10=%s”.format(x,x*10)))Result:1*10=102*10=20
first()Purpose: retrieves the firstdata element in RDD. Similar to take(1)
scala> val rdd = sc.parallelize(List(1,2,3,4))scala> rdd.first()Result:Int = 1
saveAsTextFile(path)Purpose: Writes the content of RDD to a text file or a set of text files to local file system/HDFS
scala> val hamlet = sc.textFile(“readme.txt”)scala> hamlet.filter(_.contains(“Spark")).saveAsTextFile(“filtered”)Result:…/filtered$ ls_SUCCESS part-00000 part-00001
RDDS :
For a more detailed list of actions and transformations, please refer to:
http://spark.apache.org/docs/latest/programming-guide.html#transformations
http://spark.apache.org/docs/latest/programming-guide.html#actions
PERSISTANCE
Spark can persist (or cache) a dataset in memory acrossoperations
Each node stores in memory any slices of it that it computes and reuses them in other actions on that dataset – often making future actions more than 10x faster
The cache is fault-tolerant: if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it
PERSISTANCE
PERSISTANCE
PERSISTANCE : STORAGE LEVEL
Storage Level Purpose
MEMORY_ONLY(Default level)
Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
MEMORY_ONLY_DISK_SER Similar to MEMORY_ONLY_SER, but spill artitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.
DISC_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.
Same as the levels above, but replicate each partition on two cluster nodes.
SIMPLE SPARK APPS : WORDCOUNT
Download project from github:
https://github.com/MohamedHedi/SparkSamples
sbt
compile
assembly
WordCount.scala:
val logFile = args(0)val conf = new SparkConf().setAppName("WordCount")
val sc = new SparkContext(conf)val logData = sc.textFile(logFile, 2).cache()val numApache = logData.filter(line => line.contains("apache")).count()val numSpark = logData.filter(line => line.contains("spark")).count()println("Lines with apache: %s, Lines with spark: %s".format(numApache,
numSpark))
SPARK-SUBMIT
./bin/spark-submit
--class <main-class>
--master <master-url>
--deploy-mode <deploy-mode>
--conf <key>=<value>
... # other options
<application-jar>
[application-arguments]
SPARK-SUBMIT : LOCAL MODE
./bin/spark-submit
--class com.ebiznext.spark.examples.WordCount
--master local[4]
--deploy-mode client
--conf <key>=<value>
... # other options
.\target\scala-2.10\SparkSamples-assembly-1.0.jar
.\ressources\README.md
CLUSTER MANAGER TYPES
Spark supports three cluster managers:
Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications.
Hadoop YARN – the resource manager in Hadoop 2.
MASTER URLS
Master URL Meaning
local One worker thread (no parallelism at all)
local[K] Run Spark locally with K worker threads (ideally, sethis to the number of cores on your machine).
local[*] Run Spark locally with as many worker threads as logical cores on your machine.
spark://HOST:PORT Connect to the given Spark standalone cluster master. Default master port : 7077
mesos://HOST:PORT Connect to the given Mesos cluster. Default mesos port : 5050
yarn-client Connect to a YARN cluster in client mode. The cluster location will be found based on the HADOOP_CONF_DIR variable.
yarn-cluster Connect to a YARN cluster in cluster mode. The cluster location will be found based on HADOOP_CONF_DIR.
SPARK-SUBMIT : STANDALONE CLUSTER
./sbin/start-master.sh(Windows users spark-class.cmd org.apache.spark.deploy.master.Master)
Go to the master’s web UI
SPARK-SUBMIT : STANDALONE CLUSTER
Connect Workers to Master./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
Go to the master’s web UI
SPARK-SUBMIT : STANDALONE CLUSTER
./bin/spark-submit --class com.ebiznext.spark.examples.WordCount
--master spark://localhost:7077 .\target\scala-2.10\SparkSamples-assembly-1.0.jar .\ressources\README.md
SPARK SQL
Shark is being migrated to Spark SQL
Spark SQL blurs the lines between RDDs and relational tables
val conf = new SparkConf().setAppName("SparkSQL")val sc = new SparkContext(conf)val peopleFile = args(0)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)import sqlContext._
// Define the schema using a case class.case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.val people = sc.textFile(peopleFile).map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.registerAsTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.// The columns of a row in the result can be accessed by ordinal.teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
SPARK GRAPHX
GraphX is the new (alpha) Spark API for graphs and graph-parallel computation.
GraphX extends the Spark RDD by introducing the Resilient Distributed Property Graph
case class Peep(name: String, age: Int)val vertexArray = Array((1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),(5L, Peep("Leslie", 45)))
val edgeArray = Array(Edge(2L, 1L, 7), Edge(2L, 4L, 2),Edge(3L, 2L, 4), Edge(3L, 5L, 3),Edge(4L, 1L, 1), Edge(5L, 3L, 9))
val conf = new SparkConf().setAppName("SparkGraphx")val sc = new SparkContext(conf)val vertexRDD: RDD[(Long, Peep)] = sc.parallelize(vertexArray)val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)val g: Graph[Peep, Int] = Graph(vertexRDD, edgeRDD)
val results = g.triplets.filter(t => t.attr > 7)for (triplet <- results.collect) {println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")
}
SPARK MLLIB
MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities.
Use cases :
Recommendation Engine
Content classification
Ranking
Algorithms
Classification and regression : linear regression, decisiontrees, naive Bayes
Collaborative filtering : alternating least squares (ALS)
Clustering : k-means
…
SPARK MLLIB
SparkKMeans.scala
val sparkConf = new SparkConf().setAppName("SparkKMeans")val sc = new SparkContext(sparkConf)val lines = sc.textFile(args(0))val data = lines.map(parseVector _).cache()val K = args(1).toIntval convergeDist = args(2).toDoubleval kPoints = data.takeSample(withReplacement = false, K, 42).toArrayvar tempDist = 1.0while (tempDist > convergeDist) {val closest = data.map(p => (closestPoint(p, kPoints), (p, 1)))val pointStats = closest.reduceByKey { case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2) }val newPoints = pointStats.map { pair =>(pair._1, pair._2._1 * (1.0 / pair._2._2))
}.collectAsMap()tempDist = 0.0for (i <- 0 until K) {tempDist += squaredDistance(kPoints(i), newPoints(i))
}for (newP <- newPoints) yield {kPoints(newP._1) = newP._2
}println("Finished iteration (delta = " + tempDist + ")")
}println("Final centers:")kPoints.foreach(println)sc.stop()
SPARK STREAMING
Spark Streaming extends the core API to allow high-throughput, fault-tolerant stream processing of live data streams
Data can be ingested from many sources: Kafka, Flume, Twitter, ZeroMQ, TCP sockets…
Results can be pushed out to filesystems, databases, live dashboards…
Spark’s Mllib algorithms and graph processing algorithms can be applied to data streams
SPARK STREAMING
val ssc = new StreamingContext(sparkConf, Seconds(10))
Create a StreamingContext by providing the configuration and batch duration
TWITTER - SPARK STREAMING - ELASTICSEARCH
1. Twitter access
2. Streaming from Twitterval sparkConf = new SparkConf().setAppName("TwitterPopularTags")sparkConf.set("es.index.auto.create", "true")val ssc = new StreamingContext(sparkConf, Seconds(10))val keys = ssc.sparkContext.textFile(args(0), 2).cache()val stream = TwitterUtils.createStream(ssc, None)val hashTags = stream.flatMap(status => status.getText.split(" ").filter(_.startsWith("#")))
val topCounts10 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(10)).map { case (topic, count) => (count, topic) }.transform(_.sortByKey(false))
val keys = ssc.sparkContext.textFile(args(0), 2).cache()val Array(consumerKey, consumerSecret, accessToken, accessTokenSecret) = keys.take(4)// Set the system properties so that Twitter4j library used by twitter stream// can use them to generat OAuth credentialsSystem.setProperty("twitter4j.oauth.consumerKey", consumerKey)System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret)System.setProperty("twitter4j.oauth.accessToken", accessToken)System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret)
TWITTER - SPARK STREAMING - ELASTICSEARCH
index in Elasticsearch
Adding elasticsearch-spark jar to build.sbt:libraryDependencies += "org.elasticsearch" % "elasticsearch-spark_2.10" % "2.1.0.Beta3"
Writing RDD to elasticsearch:
val conf = new SparkConf().setAppName(appName).setMaster(master)sparkConf.set("es.index.auto.create", "true")
val apache = Map("hashtag" -> "#Apache", "count" -> 10)val spark = Map("hashtag" -> "#Spark", "count" -> 15)
val rdd = ssc.sparkContext.makeRDD(Seq(apache,spark))rdd.saveToEs("spark/hashtag")