Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Spark Meetup chez ViadeoMercredi 4 février 2015

• 19h-19h45 : Présentation de la technologie Spark et exemple de nouveaux cas métiers pouvant être traités par du BigData temps réel. Cédric Carbone - Cofondateur d'Influans – cedric@influans.com -Spark vs HadoopMapReduce-Spark Streaming vs Storm -Le Machine Learning avec Spark -Use case métier : NextProductToBuy

• 19h45-20h : Extension de Spark (Tachyon / Spark JobServer). Jonathan Lamiel - Talend Labs – jlamiel@talend.com-La mémoire partagée de Spark avec Tachyon -Rendre Spark Interactif avec Spark JobServer

• 20h-21h : Big Data analytics avec Spark & Cassandra. DuyHai DOAN - Technical Advocate at DataStax – duy_hai.doan@datastax.comApache Spark is a general data processing framework which allows you perform data processing tasks in memory. Apache Cassandra is a highly available and massively scalable NoSQL data-store. By combining Spark flexible API and Cassandra performance, we get an interesting combo for both real-time and batch processing.

Map Reduce

➜ Map() : parse inputs and generate 0 to n <key, value>

➜ Reduce() : sums all values of the same key and generate a <key, value>

WordCount Example

➜ Each map take a line as an input and break into words

• It emits a key/value pair of the word and 1

➜ Each Reducer sums the counts for each word

• It emits a key/value pair of the word and sum

Map Reduce

Hadoop MapReduce v1

MapReduce v1

Not good for low-latency jobs on smallest dataset

MapReduce v1

Good for off-line batch jobs on massive data

Hadoop 1

➜ Batch ONLY

• High latency jobs

HDFS (Redundant, Reliable Storage)

MapReduce1Cluster Resource Management + Data Processing

HIVEQuery

PigScripting

CascadingAccelerate Dev.

Hadoop2 : Big Data

Operating System

➜ Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS

• Simultaneously & with predictable levels of service

• Data analysts and real-time applications

MapReduce1Data Processing

YARN (Cluster Resource Management)

OtherData Processing

Hadoop2 : Big Data

Operating System

BATCH INTERACTIVE STREAMING GRAPH ML IN-MEMORYONLINE SEARCH

Hadoop2 : Big Data

Operating System

BATCH(MapReduce)

INTERACTIVE(Tez)

STREAMING(Storm, SamzaSpark Streaming)

GRAPH(Giraph,GraphX)

MachineLearning(MLLIb)

In-Memory(Spark)

ONLINE(Hbase HOYA)

OTHER(ElasticSearch)

https://spark.apache.org

Apache Spark™ is a fast and general engine for large-scale data processing.

The most active project

Patches

MapReduce Storm

Yarn Spark

Lines Added

MapReduce Storm

Yarn Spark

Spark won the

Daytona GraySort contest!

Run programs up to 100x faster than HadoopMapReduce in memory, or 10x faster on disk.

Sort on disk 100TB of data 3x faster than HadoopMapReduce using 10x fewer machines.

RDD & Operation

Resilient Distributed Datasets (RDDs)

Operations

➜ Transformations (e.g. map, filter, groupBy)

➜ Actions (e.g. count, collect, save)

scala> val textFile = sc.textFile("README.md")

➜ textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3

scala> textFile.count()

➜ res0: Long = 126

scala> textFile.first()

➜ res1: String = # Apache Spark

scala> val linesWithSpark = textFile.filter(line =>

line.contains("Spark"))

➜ linesWithSpark: spark.RDD[String]=spark.FilteredRDD@7dd4

scala>

textFile.filter(line=>line.contains("Spark")).count()

➜ res3: Long = 15

Streaming

Storm vs Spark

Spark Streaming Storm Storm Trident

Processing model Micro batches Record-at-a-time Micro batches

Thoughput ++++ ++ ++++

Latency Second Sub-second Second

Reliability Models Exactly once At least once Exactly once

Embedded Hadoop Distro HDP, CDH, MapR HDP HDP

Support Databricks N/A N/A

Community ++++ ++ ++

Spark Storm

Scope Batch, Streaming, Graph, ML, SQL Streaming only

Machine Learning Library (Mllib)

Collaborative Filtering

(learning)

Collaborative Filtering :

Let’s use the model

Collaborative Filtering :

similar behaviors

Prediction

Netflix Prize (2009)

Netflix is a provider of on-demand Internet streaming media

Input Data

UserID::MovieID::Rating::Timestamp1::1193::5::9783007601::661::3::9783021091::914::3::978301968Etc…

2::1357::5::9782987092::3068::4::9782990002::1537::4::978299620

The result

1 ; Lyndon Wilson ; 4.608531808535918 ; 858 ; Godfather, The (1972)1 ; Lyndon Wilson ; 4.596556961095434 ; 318 ; Shawshank Redemption, The (1994)1 ; Lyndon Wilson ; 4.575789377957803 ; 527 ; Schindler's List (1993)1 ; Lyndon Wilson ; 4.549694932928024 ; 593 ; Silence of the Lambs, The (1991)1 ; Lyndon Wilson ; 4.46311974037361 ; 919 ; Wizard of Oz, The (1939)2 ; Benjamin Harrison ; 4.99545499047152 ; 318 ; Shawshank Redemption, The (1994)2 ; Benjamin Harrison ; 4.94255532354725 ; 356 ; Forrest Gump (1994)2 ; Benjamin Harrison ; 4.80168679606128 ; 527 ; Schindler's List (1993)2 ; Benjamin Harrison ; 4.7874247577586795 ; 1097 ; E.T. the Extra-Terrestrial (1982)2 ; Benjamin Harrison ; 4.7635998147872325 ; 110 ; Braveheart (1995)3 ; Richard Hoover ; 4.962687467351026 ; 110 ; Braveheart (1995)3 ; Richard Hoover ; 4.8316542374095315 ; 318 ; Shawshank Redemption, The (1994)3 ; Richard Hoover ; 4.7307103243995385 ; 356 ; Forrest Gump (19

Real Time Big Data

Use Case

Next Product To Buy

➜ Right Person

➜ Right Product

➜ Right Price

➜ Right Time

➜ Right Channel

Questions?

Cédric Carbone

cedric@influans.com

@carbone

www.hugfrance.fr

hug-france-orga@googlegroups.com

@hugfrance

Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Internet

Transcript of Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

SPARK - Groupe Emerige

Spark Infrastructure - Stapled Structures · Title: Spark Infrastructure - Stapled Structures Author: Spark Infrastructure Created Date: 4/20/2017 2:53:08 PM

Lambda architecture et Spark

Spark (v1.3) - Présentation (Français)

Liechtenauer Composite - Meetup

Amazon Kinesis et Apache Storm...Amazon Web Services – Amazon Kinesis et Apache Storm Octobre 2014 Page 3 sur 18 Résumé Les développeurs d'Apache Storm peuvent utiliser Amazon

Meetup css

Meetup sencha

Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015

Youth spark Morocco

IBM Bluemix Nice Meetup #4-20170302 6 Meetup @INRIA - BlockChain

Spark : 5 moyens simples et rapides pour exploiter vos Big Data avec Spark et Talend

Profondeur de Nage Storm

Meetup Nantes : BuddyPress

TRAFFIC STORM - sirac.fr Storm modeles.pdf · MEU CAN L P LA Y REC TD TD TD TD Traffic Storm Basic - 1140 2058025/a 20 modules Orange - simple panneau à message variable rabattable

REX Storm Redis

Storm®4 Family - Invacare · le Storm®⁴ et 15 véhicules différents. Également disponible sur Storm®⁴ X-plore. Faible hauteur d’assise sol/siège : à partir de 400 mm

Enovatrices MeetUp #1

IBM Bluemix Paris Meetup #21-20170131 Meetup @Ingima - Iot & Sécurité

[SWNA’16] Spark