Morning Tech#1 BigData - Oxalide Academy

30
MorningTech #1 – BigData le 15 décembre 2016 –Ludovic Piot

Transcript of Morning Tech#1 BigData - Oxalide Academy

Page 1: Morning Tech#1 BigData - Oxalide Academy

MorningTech #1 – BigDatale 15 décembre 2016 –Ludovic Piot

Page 2: Morning Tech#1 BigData - Oxalide Academy

Les événements Oxalide

• Objectif : présentation d’une thématique métier ou technique• Tout public : 80 à 100 personnes• Déroulé : 1 soir par trimestre de 18h à 21h

• Introduction de la thématique par un partenaire• Tour de table avec des clients et non clients• Echange convivial autour d’un apéritif dînatoire

• Objectif : présentation d’une technologie• Réservé aux clients : public technique avec laptop – 30 personnes• Déroulé : 1 matinée par trimestre de 9h à 13h

• Présentation de la technologie• Tuto pour la configuration en ligne de commande

• Objectif : présentation d’une thématique métier ou technique• Réservé aux clients : 30 personnes• Déroulé : 1 matin par trimestre de 9h à 12h

• Big picture• Démonstration et retour d’expérience

Apérotech

Workshop

Morning Tech

Page 3: Morning Tech#1 BigData - Oxalide Academy

Les speakers

Ludovic PiotConseil / Archi / DevOps @ Oxalide

@lpiot

Page 4: Morning Tech#1 BigData - Oxalide Academy

Oxalide Recrute !Contactez-nousà[email protected]

Page 5: Morning Tech#1 BigData - Oxalide Academy

Enjeux & tendances

Page 6: Morning Tech#1 BigData - Oxalide Academy

SoLoMo et IoT – l’explosion de la data

SOcial

LOcal

MObile

Page 7: Morning Tech#1 BigData - Oxalide Academy

IoT – l’explosion de la data

Copyright!©!2014,!Hortonworks,!Inc.!All!rights!reserved.! 11!

Enterprise!Data!Trends!@!Scale!

The!volume!of!data!that!is!available!for!analysis!is!transforming!organizations,!as!well!as!

the!entire!IT!industry.!Everyone!is!seeing!data!external!to!an!organization!as!becoming!

just!as!strategic!as!internal!data.!!SemiMstructured!and!unstructured!data!volume!is!

beginning!to!dwarf!the!traditional!data!in!relational!databases!and!data!warehouses.!

• Facebook!has!around!50!PB!warehouse!and!it’s!constantly!growing.!!

• Twitter!messages!are!140!bytes!each!generating!8TB!data!per!day.!

• Data!is!more!than!doubling!every!year.!

• Almost!80%!of!data!will!be!unstructured!data.!

• Netflix:!75%!of!streaming!video!results!from!recommendations.!

• Amazon:!35%!of!product!sales!come!from!product!recommendations.!

!

!

!

Enterprise Data Trends @ Scale Organizations are redefining data strategies due to the requirements of the evolving Enterprise Data Warehouse (EDW).

Enterprise Data

VoIP

Machine Data

Social Media

Page 8: Morning Tech#1 BigData - Oxalide Academy

Les 3V : les dimensions du Gartner

• Volume : Le volume de données crées et gérées est en constante augmentation (+59% / an en 2011)

• Variété : Les types de données collectées sont très variés (texte, son, image, logs…). Nécessité que les outils de traitement prennent en compte cette diversité

• Vélocité : Besoin de rapidité pour pouvoir utiliser les données au fur et à mesure qu'elles sont collectées. Il faut les utiliser rapidement, ou elles n'ont aucune valeur.

Les 2 nouveaux V émergeant :

• Véracité : dimension apportant une notion de qualité de la donnée pour le métier

• Visibilité : pour souligner la nécessité que la data soit accessible pour le métier afin de permettre la prise de décision rapide

Page 9: Morning Tech#1 BigData - Oxalide Academy

Evolution des tendances de la BigData

batchtemps réel

prédict

rapport alertes prévision

Page 10: Morning Tech#1 BigData - Oxalide Academy

Principes

Page 11: Morning Tech#1 BigData - Oxalide Academy

BigData vs. gestion traditionnelle des données

20! Copyright!©!2014,!Hortonworks,!Inc.!All!rights!reserved.!

Traditional!Systems!vs.!Hadoop!

Hadoop!is!not!designed!to!replace!existing!relational!databases!or!data!warehouses.!!Relational!databases!are!designed!to!manage!transactions.!They!contain!a!lot!of!feature/functionality!designed!around!managing!transactions.!They!are!based!upon!schemaMonMwrite.!Organizations!have!spent!years!building!Enterprise!Data!Warehouses!(EDW)!and!reporting!systems!for!their!traditional!data.!The!traditional!EDWs!are!not!going!anywhere!either.!EDWs!are!also!based!on!schemaMonMwrite.!!!

Hadoop!is!not:!

• Relational!

• NoSQL!

• RealMtime!

• A!database!

Hadoop!is!a!data!platform!that!compliments!existing!data!systems.!Hadoop!is!designed!for!schemaMonMread!and!can!handle!the!large!data!volumes!coming!from!semiMstructured!and!unstructured!data.!With!the!low!cost!of!storage!on!Hadoop,!organizations!are!looking!at!using!Hadoop!more!for!archiving.! !

!

Traditional Systems vs. Hadoop

Traditional Database

SCALE (storage & processing)

Hadoop Distribution NoSQL MPP

Analytics EDW

schema

speed

governance

best fit use

processing

Required on write Required on read

Reads are fast Writes are fast

Standards and structured Loosely structured

Limited, no data processing Processing coupled with data

data types Structured Multi and unstructured

Interactive OLAP Analytics Complex ACID Transactions

Operational Data Store

Data Discovery Processing unstructured data Massive Storage/Processing

Page 12: Morning Tech#1 BigData - Oxalide Academy

Le stockage distribué

Copyright!©!2014,!Hortonworks,!Inc.!All!rights!reserved.! 105!

Data!Integrity!–!Writing!Data!

High!performing!applications!stream!data!to!files.!!HDFS!does!this!as!well;!the!HDFS!client!caches!packets!of!data!in!memory.!!Once!that!data!reaches!the!HDFS!block!size,!the!client!will!notify!the!NameNode.!!The!NameNode!will!provide!the!DataNode!information!about,!and!the!locations,!for!the!block!replicas.!!The!client!will!then!stream!the!packet!of!data!to!the!first!targeted!DataNode.!!Replication!is!performed!in!a!pipeline!fashion;!the!first!DataNode!will!start!writing!the!block!and!will!then!transfer!that!data!to!the!second!DataNode.!!The!second!DataNode!will!start!sending!the!data!to!the!third!DataNode!and!so!on.!

When!the!blocks!in!a!directory!reach!a!defined!limit,!which!is!controlled!via!dfs.datanode.numblocks,!the!DataNode!will!define!a!new!subdirectory.!!After!defining!the!subdirectory!it!will!start!placing!new!data!blocks!and!the!corresponding!metadata!in!that!subdirectory.!!This!is!performed!using!a!fanMout!structure!ensuring!no!single!directory!is!overloaded!with!files!or!becomes!too!deep.!!!

! !

!

Data Pipeline

DataNode 1

Data Integrity – Writing Data

6. Success!

3. Data +

checksum

4. Verify Checksum

4. Data and checksum

5. Success! 5.Success!

DataNode 4 DataNode 12

Client 2. OK,

please use DataNodes

1, 4, 12. 1. I want to write a block

of data. NameNode

Page 13: Morning Tech#1 BigData - Oxalide Academy

Le théorème de CAP

Page 14: Morning Tech#1 BigData - Oxalide Academy

Le Map/Reduce

154! Copyright!©!2014,!Hortonworks,!Inc.!All!rights!reserved.!

MapReduce!

The!original!useMcase!for!Hadoop!was!distributed!batch!processing.!MapReduce!is!a!power!application!paradigm!for!processing!massive!amounts!of!data.!!!Core!features!of!MapReduce!are:!!

• Co?locating!processing!with!data!blocks:!Take!the!computing!to!where!the!data!lives,!rather!than!querying!or!reading!data!into!a!remote!application.!Would!you!rather!move!hundreds!of!GB/TB!of!data!around!your!network,!or!would!you!rather!move!an!application!that!processes!the!same!data!to!where!the!data!actually!lives?!!

• Map!Phase:!This!is!the!initial!phase!of!all!MapReduce!jobs.!This!is!where!raw!data!can!be!read,!extracted,!transformed,!and!results!written!out!to!HDFS!or!moved!on!to!Reducers!for!aggregate!processing,!such!as!a!final!count,!sum,!min,!max,!etc.!The!Map!phase!can!also!be!thought!of!as!the!ETL!or!projection!step!for!MapReduce.!

!• Reduce!Phase:!This!is!the!final!phase!where!data!is!sorted!on!a!userMdefined!key!

and!grouped!by!that!same!key.!!!The!Reducer!has!the!option!to!perform!an!

!

MapReduce Map$Phase$ Shuffle/Sort$

Mapper $

Mapper $

Mapper $

Data$is$shuffled$across$the$network$

and$sorted$

NM + DN

NM + DN

NM + DN

Reduce$Phase$

Reducer $

Reducer $

NM + DN

NM + DN

Page 15: Morning Tech#1 BigData - Oxalide Academy

La table des latences

Page 16: Morning Tech#1 BigData - Oxalide Academy

Le pipeline BigData

data answersingest / collect store process analyse

Time to answer (latency)Throughput

Cost

Page 17: Morning Tech#1 BigData - Oxalide Academy

La Lambda Architecture

Copyright!©!2014,!Hortonworks,!Inc.!All!rights!reserved.! 199!

Defining!Data!Layers!

There!are!multiple!ways!of!organizing!data!in!an!Enterprise!Data!Warehouse!and!the!same!goes!for!Hadoop.!!

One!way!is!the!Lambda!Architecture,!which!defines!different!data!layers.!!A!Hadoop!cluster!can!work!by!itself!or!be!integrated!with!HBase!and!other!EDWs!and!ODSs!to!build!different!data!layers!that!meet!the!data!needs!of!an!organization.!

The!process!of!building!different!data!layers!is!a!familiar!concept!within!data!warehousing!and!analytics.!!The!data!layers!are!built!in!a!Hadoop!cluster!for!the!same!reasons!they!have!been!built!in!data!warehouses!for!the!last!30!years,!the!facilitate!speed.!!There!are!3!data!layers:!

• Batch!Layer:!!Immutable!master!data!set!(source!of!truth).!!Used!to!create!views!for!the!batch!layer.!

• Serving!Layer:!Contains!preMcomputed!views.!!!

• Speed!Layer:!!Contains!additional!levels!of!preMcomputed!views,!structures!and!indexes!to!reduce!the!latency!that!exists!in!the!serving!layer.!

!

!

Defining Data Layers

Serving Layer

Standardize, Cleanse, Integrate, Filter, Transform

Batch Layer

Extract & Load

Conform, Summarize, Access

Speed Layer

•  Organize data based on source/derived relationships

•  Allows for fault

and rebuild process

•  There are lots of different ways of organizing data in an enterprise data platform that includes Hadoop.

Page 18: Morning Tech#1 BigData - Oxalide Academy

Ecosystème

Page 19: Morning Tech#1 BigData - Oxalide Academy

Evolution des traitements Big Data

Page 20: Morning Tech#1 BigData - Oxalide Academy

Evolution des traitements Big Data

Dataflow

Dataproc

BigQueryBigTable

CloudSQL

CloudPub/Sub

Page 21: Morning Tech#1 BigData - Oxalide Academy

Demo Time

Amazon S3

http://bit.ly/2grJMMf

Shard 0

Amazon KinesisAmazon Cognito

Amazon EC2

R Shiny-Server

https://github.com/lpiot/amazon-kinesis-IoT-sensor-demo

Page 22: Morning Tech#1 BigData - Oxalide Academy

Machine learning& deep learning

Page 23: Morning Tech#1 BigData - Oxalide Academy

La démarche de datascience

Page 24: Morning Tech#1 BigData - Oxalide Academy

Le Machine Learning

• Jeu de données : labellisé (avec les réponses)• Objectif d’apprentissage :

• Régression (prévision)• Classification

Apprentissage supervisé

Page 25: Morning Tech#1 BigData - Oxalide Academy

Hypothèse et fonction de coût

But du jeu :Trouver une fonction h qui représente fidèlement les données.

Régression linéaire :ℎ " = $% + $'"' + $("( + ⋯+ $*"*

Page 26: Morning Tech#1 BigData - Oxalide Academy

Le Machine Learning

• Jeu de données : non-labellisé (sans réponse)• Objectif d’apprentissage :

• Identifier / détecter des structures dans les données

Apprentissage non-supervisé

Page 27: Morning Tech#1 BigData - Oxalide Academy

Algorithmes de classification

But du jeu :Trouver l’algorithme qui distingue au mieux les structures dans les données.

Page 28: Morning Tech#1 BigData - Oxalide Academy

Réseaux neuronaux

• Basés sur le fonctionnementd’un cerveau

• Hypothèse non linéaire !• Classification multi-classe

• Comme avant, on essayede minimiser la fonction de coût en modifiant peu àpeu les coefficients Θ(i)

Page 29: Morning Tech#1 BigData - Oxalide Academy

Questions ?

?

Page 30: Morning Tech#1 BigData - Oxalide Academy

Sources

• [6, 10] : Hortonworks : Operations Management with HDP

• [8, 11, 12] : http://www.slideshare.net/1Strategy/2016-utah-cloud-summit-big-data-architectural-patterns-and-best-practices-on-aws