Introduction to Apache Spark

INTRODUCTION TO

APACHE SPARK

Mohamed Hedi Abidi - Software Engineer @ebiznext

@mh_abidi

CONTENT

Spark Introduction

Installation

Spark-Shell

SparkContext

Persistance

Simple Spark Apps

Deploiement

Spark SQL

Spark GraphX

Spark Mllib

Spark Streaming

Spark & Elasticsearch

INTRODUCTION

An open source data analytics cluster computing framework

In Memory Data processing

100x faster than Hadoop

Support MapReduce

INTRODUCTION

Handles batch, interactive, and real-time within a single framework

INTRODUCTION

Programming at a higher level of abstraction : faster, easier development

INTRODUCTION

Highly accessible through standard APIs built in Java, Scala, Python, or SQL (for interactive queries), and a rich set of machine learning libraries

Compatibility with the existing Hadoop v1 (SIMR) and 2.x (YARN) ecosystems so companies can leverage their existing infrastructure.

INSTALLATION

Install JDK 1.7+, Scala 2.10.x, Sbt0.13.7, Maven 3.0+

Download and unzip Apache Spark 1.1.0 sources

Or clone development Version :

git clone git://github.com/apache/spark.git

Run Maven to build Apache Spark

mvn -DskipTests clean package

Launch Apache Spark standalone REPL

[spark_home]/bin/spark-shell

Go to SparkUI @

http://localhost:4040

SPARK-SHELL

we’ll run Spark’s interactive shell… within the “spark” directory, run:

./bin/spark-shell

then from the “scala>” REPL prompt, let’s create somedata…

scala> val data = 1 to 10000

create an RDD based on that data…

scala> val distData = sc.parallelize(data)

then use a filter to select values less than 10…

scala> distData.filter(_ < 10).collect()

SPARKCONTEXT

The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster.

In the shell for either Scala or Python, this is the scvariable, which is created automatically

Other programs must use a constructor to instantiate a new SparkContextval conf = new SparkConf().setAppName(appName).setMaster(master)

new SparkContext(conf)

Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – It is an immutable distributed collection of data, which is partitioned across machines in a cluster

There are currently two types:

parallelized collections : Take an existing Scala collection and run functions on it in parallel

External datasets : Spark can create distributed datasets fromany storage source supported by Hadoop, including local file system, HDFS, Cassandra, HBase, Amazon S3, etc.

Parallelized collections scala> val data = Array(1, 2, 3, 4, 5)

data: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val distData = sc.parallelize(data)

distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:14

External datasetsscala> val distFile = sc.textFile("README.md")

distFile: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[7] attextFileat <console>:12

Two types of operations on RDDs:

transformations and actions

A transformation is a lazy (not computed immediately) operation on an RDD that yields another RDD

An action is an operation that triggers a computation, returns a value back to the Master, or writes to a stable storage system

RDDS : COMMONLY USED TRANSFORMATIONS

Transformation & Purpose Example & Result

filter(func)Purpose: new RDD by selecting those data elements on which func returns true

scala> val rdd =sc.parallelize(List(“ABC”,”BCD”,”DEF”))scala> val filtered = rdd.filter(_.contains(“C”))scala> filtered.collect()Result:Array[String] = Array(ABC, BCD)

map(func)Purpose: return new RDD by applying func on each data element

scala> val rdd=sc.parallelize(List(1,2,3,4,5))scala> val times2 = rdd.map(_*2)scala> times2.collect()Result:Array[Int] = Array(2, 4, 6, 8, 10)

flatMap(func)Purpose: Similar to map but funcreturns a Seq instead of a value. For example, mapping a sentence into a Seq of words

scala> val rdd=sc.parallelize(List(“Spark isawesome”,”It is fun”))scala> val fm=rdd.flatMap(str=>str.split(“ “))scala> fm.collect()Result:Array[String] = Array(Spark, is, awesome, It, is, fun)

RDDS : COMMONLY USED TRANSFORMATIONS

reduceByKey(func,[numTasks])Purpose: To aggregate values of akey using a function. “numTasks” is anoptional parameter to specify number of reduce tasks

scala> val word1=fm.map(word=>(word,1))scala> val wrdCnt=word1.reduceByKey(_+_)scala> wrdCnt.collect()Result:Array[(String, Int)] = Array((is,2), (It,1),(awesome,1), (Spark,1), (fun,1))

groupByKey([numTasks])Purpose: To convert (K,V) to(K,Iterable<V>)

scala> val cntWrd = wrdCnt.map{case (word,count) => (count, word)}scala> cntWrd.groupByKey().collect()Result:Array[(Int, Iterable[String])] =Array((1,ArrayBuffer(It, awesome, Spark,fun)), (2,ArrayBuffer(is)))

distinct([numTasks])Purpose: Eliminate duplicates from RDD

scala> fm.distinct().collect()Result:Array[String] = Array(is, It, awesome, Spark,fun)

RDDS : COMMONLY USED ACTIONS

count()Purpose: Get the number ofdata elements in the RDD

scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’))scala> rdd.count()Result:Long = 3

collect()Purpose: get all the data elements in an RDD as an Array

scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’))scala> rdd.collect()Result:Array[Char] = Array(A, B, C)

reduce(func)Purpose: Aggregate the data elements in an RDD using this function which takes two arguments and returns one

scala> val rdd = sc.parallelize(List(1,2,3,4))scala> rdd.reduce(_+_)Result:Int = 10

take (n)Purpose: fetch first n data elements in an RDD. Computed by driver program.

Scala> val rdd = sc.parallelize(List(1,2,3,4))scala> rdd.take(2)Result:Array[Int] = Array(1, 2)

RDDS : COMMONLY USED ACTIONS

foreach(func)Purpose: execute function foreach data element in RDD.Usually used to update an accumulator(discussed later) or interacting with external systems.

Scala> val rdd = sc.parallelize(List(1,2))scala> rdd.foreach(x=>println(“%s*10=%s”.format(x,x*10)))Result:1*10=102*10=20

first()Purpose: retrieves the firstdata element in RDD. Similar to take(1)

scala> val rdd = sc.parallelize(List(1,2,3,4))scala> rdd.first()Result:Int = 1

saveAsTextFile(path)Purpose: Writes the content of RDD to a text file or a set of text files to local file system/HDFS

scala> val hamlet = sc.textFile(“readme.txt”)scala> hamlet.filter(_.contains(“Spark")).saveAsTextFile(“filtered”)Result:…/filtered$ ls_SUCCESS part-00000 part-00001

RDDS :

For a more detailed list of actions and transformations, please refer to:

http://spark.apache.org/docs/latest/programming-guide.html#transformations

http://spark.apache.org/docs/latest/programming-guide.html#actions

PERSISTANCE

Spark can persist (or cache) a dataset in memory acrossoperations

Each node stores in memory any slices of it that it computes and reuses them in other actions on that dataset – often making future actions more than 10x faster

The cache is fault-tolerant: if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it

PERSISTANCE

PERSISTANCE : STORAGE LEVEL

Storage Level Purpose

MEMORY_ONLY(Default level)

Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.

MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

MEMORY_ONLY_DISK_SER Similar to MEMORY_ONLY_SER, but spill artitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.

DISC_ONLY Store the RDD partitions only on disk.

MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

Same as the levels above, but replicate each partition on two cluster nodes.

SIMPLE SPARK APPS : WORDCOUNT

Download project from github:

https://github.com/MohamedHedi/SparkSamples

compile

assembly

WordCount.scala:

val logFile = args(0)val conf = new SparkConf().setAppName("WordCount")

val sc = new SparkContext(conf)val logData = sc.textFile(logFile, 2).cache()val numApache = logData.filter(line => line.contains("apache")).count()val numSpark = logData.filter(line => line.contains("spark")).count()println("Lines with apache: %s, Lines with spark: %s".format(numApache,

numSpark))

SPARK-SUBMIT

./bin/spark-submit

--class <main-class>

--master <master-url>

--deploy-mode <deploy-mode>

--conf <key>=<value>

... # other options

<application-jar>

[application-arguments]

SPARK-SUBMIT : LOCAL MODE

./bin/spark-submit

--class com.ebiznext.spark.examples.WordCount

--master local[4]

--deploy-mode client

--conf <key>=<value>

... # other options

.\target\scala-2.10\SparkSamples-assembly-1.0.jar

.\ressources\README.md

CLUSTER MANAGER TYPES

Spark supports three cluster managers:

Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.

Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications.

Hadoop YARN – the resource manager in Hadoop 2.

MASTER URLS

Master URL Meaning

local One worker thread (no parallelism at all)

local[K] Run Spark locally with K worker threads (ideally, sethis to the number of cores on your machine).

local[*] Run Spark locally with as many worker threads as logical cores on your machine.

spark://HOST:PORT Connect to the given Spark standalone cluster master. Default master port : 7077

mesos://HOST:PORT Connect to the given Mesos cluster. Default mesos port : 5050

yarn-client Connect to a YARN cluster in client mode. The cluster location will be found based on the HADOOP_CONF_DIR variable.

yarn-cluster Connect to a YARN cluster in cluster mode. The cluster location will be found based on HADOOP_CONF_DIR.

SPARK-SUBMIT : STANDALONE CLUSTER

./sbin/start-master.sh(Windows users spark-class.cmd org.apache.spark.deploy.master.Master)

Go to the master’s web UI

Connect Workers to Master./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT

Go to the master’s web UI

./bin/spark-submit --class com.ebiznext.spark.examples.WordCount

--master spark://localhost:7077 .\target\scala-2.10\SparkSamples-assembly-1.0.jar .\ressources\README.md

SPARK SQL

Shark is being migrated to Spark SQL

Spark SQL blurs the lines between RDDs and relational tables

val conf = new SparkConf().setAppName("SparkSQL")val sc = new SparkContext(conf)val peopleFile = args(0)

val sqlContext = new org.apache.spark.sql.SQLContext(sc)import sqlContext._

// Define the schema using a case class.case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.val people = sc.textFile(peopleFile).map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))

people.registerAsTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.// The columns of a row in the result can be accessed by ordinal.teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

SPARK GRAPHX

GraphX is the new (alpha) Spark API for graphs and graph-parallel computation.

GraphX extends the Spark RDD by introducing the Resilient Distributed Property Graph

case class Peep(name: String, age: Int)val vertexArray = Array((1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),(5L, Peep("Leslie", 45)))

val edgeArray = Array(Edge(2L, 1L, 7), Edge(2L, 4L, 2),Edge(3L, 2L, 4), Edge(3L, 5L, 3),Edge(4L, 1L, 1), Edge(5L, 3L, 9))

val conf = new SparkConf().setAppName("SparkGraphx")val sc = new SparkContext(conf)val vertexRDD: RDD[(Long, Peep)] = sc.parallelize(vertexArray)val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)val g: Graph[Peep, Int] = Graph(vertexRDD, edgeRDD)

val results = g.triplets.filter(t => t.attr > 7)for (triplet <- results.collect) {println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")

SPARK MLLIB

MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities.

Use cases :

Recommendation Engine

Content classification

Ranking

Algorithms

Classification and regression : linear regression, decisiontrees, naive Bayes

Collaborative filtering : alternating least squares (ALS)

Clustering : k-means

SPARK MLLIB

SparkKMeans.scala

val sparkConf = new SparkConf().setAppName("SparkKMeans")val sc = new SparkContext(sparkConf)val lines = sc.textFile(args(0))val data = lines.map(parseVector _).cache()val K = args(1).toIntval convergeDist = args(2).toDoubleval kPoints = data.takeSample(withReplacement = false, K, 42).toArrayvar tempDist = 1.0while (tempDist > convergeDist) {val closest = data.map(p => (closestPoint(p, kPoints), (p, 1)))val pointStats = closest.reduceByKey { case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2) }val newPoints = pointStats.map { pair =>(pair._1, pair._2._1 * (1.0 / pair._2._2))

}.collectAsMap()tempDist = 0.0for (i <- 0 until K) {tempDist += squaredDistance(kPoints(i), newPoints(i))

}for (newP <- newPoints) yield {kPoints(newP._1) = newP._2

}println("Finished iteration (delta = " + tempDist + ")")

}println("Final centers:")kPoints.foreach(println)sc.stop()

SPARK STREAMING

Spark Streaming extends the core API to allow high-throughput, fault-tolerant stream processing of live data streams

Data can be ingested from many sources: Kafka, Flume, Twitter, ZeroMQ, TCP sockets…

Results can be pushed out to filesystems, databases, live dashboards…

Spark’s Mllib algorithms and graph processing algorithms can be applied to data streams

SPARK STREAMING

val ssc = new StreamingContext(sparkConf, Seconds(10))

Create a StreamingContext by providing the configuration and batch duration

TWITTER - SPARK STREAMING - ELASTICSEARCH

1. Twitter access

2. Streaming from Twitterval sparkConf = new SparkConf().setAppName("TwitterPopularTags")sparkConf.set("es.index.auto.create", "true")val ssc = new StreamingContext(sparkConf, Seconds(10))val keys = ssc.sparkContext.textFile(args(0), 2).cache()val stream = TwitterUtils.createStream(ssc, None)val hashTags = stream.flatMap(status => status.getText.split(" ").filter(_.startsWith("#")))

val topCounts10 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(10)).map { case (topic, count) => (count, topic) }.transform(_.sortByKey(false))

val keys = ssc.sparkContext.textFile(args(0), 2).cache()val Array(consumerKey, consumerSecret, accessToken, accessTokenSecret) = keys.take(4)// Set the system properties so that Twitter4j library used by twitter stream// can use them to generat OAuth credentialsSystem.setProperty("twitter4j.oauth.consumerKey", consumerKey)System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret)System.setProperty("twitter4j.oauth.accessToken", accessToken)System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret)

TWITTER - SPARK STREAMING - ELASTICSEARCH

index in Elasticsearch

Adding elasticsearch-spark jar to build.sbt:libraryDependencies += "org.elasticsearch" % "elasticsearch-spark_2.10" % "2.1.0.Beta3"

Writing RDD to elasticsearch:

val conf = new SparkConf().setAppName(appName).setMaster(master)sparkConf.set("es.index.auto.create", "true")

val apache = Map("hashtag" -> "#Apache", "count" -> 10)val spark = Map("hashtag" -> "#Spark", "count" -> 15)

val rdd = ssc.sparkContext.makeRDD(Seq(apache,spark))rdd.saveToEs("spark/hashtag")

Introduction to Apache Spark

Software

Transcript of Introduction to Apache Spark

Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Spark Infrastructure - Stapled Structures · Title: Spark Infrastructure - Stapled Structures Author: Spark Infrastructure Created Date: 4/20/2017 2:53:08 PM

Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

Présentation Agence SPARK RP

Calcul distribué avec Spark & FITS - LoOPS · 2020-05-26 · Apache Spark est un framework pour faire du calcul distribué. Excelle pour le traitement des grosses données. Prédécesseurs:

L’analytique en temps réel en un clic - Powerful Data ... 16 Spark integration in Talend Studio Apache •Technical preview (dispo générale en septembre) •Implémentation de

Apache Tutoriel

Spark : 5 moyens simples et rapides pour exploiter vos Big Data avec Spark et Talend

Invitation contributive spark us #1

Internet Infrastructure Review Vol.38 - フォーカスリサーチ（2） · Apache Spark＊5のクラスタよりも全文検索性能が高く、図-3 で示すように、先行研究＊6では3台のApache

Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015

Lambda architecture et Spark

Cours Big Data Chap4 - Spark

Linux Apache

Spark Streaming

Plateforme bigdata orientée BI avec Hortoworks Data Platform et Apache Spark

SPARK - Groupe Emerige

Spark SQL principes et fonctions

RAPPORT TECHNIQUE PRÉSENTÉ À L’ÉCOLE D DANS LE CADRE …publicationslist.org/data/a.april/ref-490/Langelier... · 2015-04-27 · mégadonnées, Apache Spark et Apache Hadoop.

SANQUA SPARK 1