One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as...

Indian Institute of Science Bangalore, India

भारतीय विज्ञान संस्थान

बंगलौर, भारत

One Trillion Edges :

Graph Processing at Facebook Scale A v e r y C h i n g , S e r g e y E d u n o v , M a j a K a b i l j o ,

D i o n y s i o s L o g o t h e t i s , S a m b h a v i M u t h u k r i s h n a n

F a c e b o o k

P r e s e n t e d b y : S w a p n i l G a n d h i

2 1 s t N o v e m b e r 2 0 1 8

Konigsberg* Bridge Problem

Its negative resolution by Leonhard Euler in 1736 laid the foundations of graph theory.

* Located in Kingdom of Prussia (Now in Russia)

Graphs are common Web & Social Networks

‣ Web graph, Citation Networks, Twitter, Facebook, Internet

Knowledge networks & relationships ‣ Google’s Knowledge Graph, CMU’s NELL

Cybersecurity ‣ Telecom call logs, financial transactions, Malware

Internet of Things ‣ Transport, Power, Water networks

Bioinformatics ‣ Gene sequencing, Gene expression networks

Graph Algorithms

Traversals: Paths & flows between different parts of the graph

‣ Breadth First Search, Shortest path, Minimum Spanning Tree, Eulerian paths, Max-Cut

Clustering: Closeness between sets of vertices ‣ Community detection & Evolution, Connected

components, K-means clustering, Max Independent Set

Centrality: Relative importance of vertices ‣ PageRank, Betweenness Centrality

Graphs are Central to Analytics

Wikipedia

< / > < / > < / > XML

Hyperlinks PageRank Top 20 Pages

Title PR Text

Title Body Topic Model

(LDA) Word Topics

Word Topic

Editor Graph

Community

Detection

Community

User Com.

Term-Doc

Discussion

User Disc.

Community

Topic Com.

But Graphs can be challenging Shared memory algorithms don’t scale!

Do not fit naturally to Hadoop/MapReduce ‣ Multiple MR jobs (iterative MR)

‣ Topology & Data written to HDFS each time

‣ Tuple, rather than graph-centric, abstraction

Lot of work on parallel graph libraries for HPC ‣ Boost Graph Library, Graph500

‣ Storage & compute are (loosely) coupled, not fault tolerant

‣ But everyone does not have a supercomputer! • If in-case you do own a supercomputer, stick with HPC

algorithms 6

PageRank using MR

7 MapReduce : https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-

osdi04.pdf

PageRank using MR MR will run for multiple iterations (Typically 30)

Mapper will ‣ Initially, load adjacency list and initialize default PR ‣ Subsequent iterations will load adjacency list and

new PR ‣ Emit two types of messages from Map

Reducer will ‣ Reconstruct the adjacency list for each vertex ‣ Update the PageRank values for the vertex based

on neighbour’s PR messages ‣ Write adjacency list and new PR values to HDFS, to

be used by next Map iteration

8 SQL v/s MapReduce : http://www.science.smith.edu/dftwiki/images/6/6a/ComparisonOfApproachesToLargeScaleDataAnalysis.p

Two Birds

Half-caf Double Expresso

Less Data movement over Network

Fault Tolerance

Credits : Google Images

One Stone

Pregel

Credits : Google Images

It’s a word play on the English proverb : “Kill two birds with one stone”

Pregel To overcome these challenges, Google came up with Pregel

Valiant’s BSP

Superstep 1

Superstep 2

P1 P2 P3 P4

Computation

Communication

Barrier Synchronization

Computation

“Often expensive and should be used as sparingly as possible”

Vertex State Machine

In superstep 0, every vertex is in the active state.

A vertex deactivates itself by voting to halt.

It can be reactivated by receiving an (external) message.

Algorithm termination is based on every vertex voting to halt.

Vertex Centric Programming

Vertex Centric Programming Model ‣ Logic written from perspective on a single vertex.

‣ Executed on all vertices.

Vertices know about ‣ Their own value(s) ‣ Their outgoing edges

6 6 2 6

6 6 6 6

3 6 2 1 Superstep 0

Superstep 1

Superstep 2

Superstep 3

Active

to Halt

Message

Finding Largest Value in a Graph using Pregel

Worker

Advantages

Makes distributed programming easy ‣ No locks, semaphores, race conditions ‣ Separates computing from communication phase

Vertex-level parallelization ‣ Bulk message passing for efficiency

Stateful (in-memory) ‣ Only messages & checkpoints hit disk

Lifecycle of a Pregel Program

17 Apache Giraph, Claudio Martella, Hadoop Summit, Amsterdam, April

Applications

SSSP class ShortestPathVertex: public Vertex<int, int, int> {

void Compute(MessageIterator* msgs) { int mindist = IsSource(vertex_id()) ? 0 : INF; for ( ; !msgs->Done(); msgs->Next())

mindist = min(mindist, msgs->Value());

if (mindist < GetValue()) {

*MutableValue() = mindist; OutEdgeIterator iter = GetOutEdgeIterator();

for ( ; !iter.Done(); iter.Next())

SendMessageTo(iter.Target(),

mindist + iter.GetValue());

VoteToHalt();

In the 0th superstep, only source vertex will

update its value

SSSP (1/6)

Input Graph

Worker 1

Worker 2

Worker 3

Worker 4

SSSP (2/6)

Superstep 0

Worker 1

Worker 2

Worker 3

Worker 4

SSSP (3/6)

Superstep 1

Worker 1

Worker 2

Worker 3

Worker 4

SSSP (4/6)

Superstep 2

Worker 1

Worker 2

Worker 3

Worker 4

SSSP (5/6)

Superstep 3

Worker 1

Worker 2

Worker 3

Worker 4

SSSP (6/6)

Worker 1

Worker 2

Worker 3

Worker 4

Algorithm has converged

Apache Giraph

Platform Improvements (1/2)

Efficient Memory Management (MM) ‣ Vertex and Edge data stored using serialized byte array

‣ Better MM -> Less GC

Support for Multi-Threading ‣ Maximized resource utilization

‣ Linear speed-up for CPU bound applications like K-Means Clustering

Platform Improvements (2/2)

Flexible IO Format ‣ Reduces Pre-processing

‣ Allows Vertex and Edge data to be loaded from different sources

Sharded Aggregator ‣ Aggregator responsibilities are balanced across workers

‣ Different Aggregators can be assigned to different workers.

Refer Class-room Discussion

Beyond Pregel

Master Compute ‣ Allows centralized execution of computation

‣ Refer Class-room Discussion

Worker Phases ‣ Special methods which by-pass Pregel Model, but add

ease of usability

‣ Applicability is application specific

Computation Composability ‣ Decouples Vertex and Computation

‣ Existing Computation implementation can be decoupled for multiple applications

Superstep Splitting

Master runs same “Message Heavy” Superstep for fixed number of iterations

For an iteration: ‣ Vertex computation invoked if vertex passes hash

function H ‣ Message sent to destination only if destination passes

hash function H’

Applicable to computation where messages are not “aggregatable”. ‣ If they can be aggregated (commutative and associative)

then stick with Combiners

Example : Friends-of-Friends Computation

Key Take-aways

Usability, Performance and scalability improvement to Apache Giraph ‣ Code available as open-source to try out !

Memoir detailing Facebook’s experience of using Giraph for production applications

Headline Grabber :

“Scales to a Trillion Edge graph”

One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as...

Documents

Transcript of One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as...

Bienvenue à CFX - Cash FX Group Forex Group... · 2020. 4. 28. · Welcome To The CFX Online Learning Center The successful 53 trillion Forex Market now available to 8 billon people.

Chapitre 7 ANALYSE DE RÉSEAUX - Campus de Metz · • Passage à l’échelle / Big Data : –Giraph on Hadoop –GraphX on Spark –Bases de données de type graphes (Neo4J, etc)

Bases de données multimédia I - Introduction · Un système de gestion de base de données (DBMS) est un logiciel facilitant la création, la maintenance et la manipulation d’une

Developpement Web - JDBC´Developpement Web - JDBC´ Preambule´ Drivers Types de pilotes • Type 1 : traduit JDBC en ODBC et utilise le pilote ODBC pour communiquer avec le DBMS

Un document publié par le United States Environmental Protection Agency démontre qu´entre cinq cents milliard et un trillion de sacs de plastique sont.

Offshore RMB Express - Bank of China...2 Offshore RMB Express annual amount of RMB 4.21 trillion in 2018 was up by 7.2% YoY. RTGS turnover was RMB 24.7 trillion in January 2019, up

Apache giraph

-Publilius Syrus - Texas Medical Center2016 - $3.35 Trillion $10,000 per person living in the United States 17.8 percent of the GDP ~ 20% by 2025 Another perspective: U.S. healthcare

データベース - DB Lab.ishikawa/lectures/db...– SQL Server（Microsoft） – HiRDB（日立） • フリーのDBMS – PostgreSQL – MySQL – Firebird – SQLite：組込み用DBMS

GVdK = CUY Bases de données Access, Mysql et autres Systèmes de Gestion de Base de Données Relationnelle (=SGBDR) (=DBMS)

Les nanocelluloses, une nouvelle voie de valorisation de la ......Global potential market for products using nanotechnology is estimated at ~US $ 1 to 3 trillion by 2020. • Wood

Scaling Iterative Graph Computations with GraphMaplingliu/papers/2015/IEEE-SC2015-GraphMap.pdf · Apache Giraph [1] and Hama [2] have shown remark-able initial success. However, existing

Un document publié par le « United States Environmental Protection Agency » démontre qu’entre 500 milliards et 1 trillion de sacs de plastique sont utilisés.

Base de données - ÉTS : Cours par sigle · •Une base de données, ou plus précisément un système de gestion de base de données (SGBD - DBMS), est un ensemble de logiciels

Cenaero - 2017.forum-immobilier.be2017.forum-immobilier.be › sites › default › files › pdf › conference_-_l… · “Smart Cities market to reach US$1.56 trillion by 2020

Bases de données multimédia I - Introductionlear.inrialpes.fr/~douze/enseignement/2014-2015/cours_chap1_2pp.pdf · Un système de gestion de base de données (DBMS) est un logiciel

ORM2Pwn: Exploiting injections in Hibernate ORM2015.zeronights.ru/assets/files/36-Egorov-Soldatov.pdfMotivation Modern applications work with DBMS not directly but via ORM In Java,

IT 인프라자원및 DBMS(Application)...보안장비운영자 데이터베이스운영자 유닉스서버운영자 SSO 를 통한웹 속 ②발급승인 V I P 실시간데이터제