Running the largest HDFS cluster - University of …cloud.berkeley.edu/data/hdfs-scalability.pdf4!...

Running the largest HDFS cluster

Hairong Kuang, Tom Nykiel hairong@fb.com tomasz@fb.com

Agenda �  HDFS at Facebook

�  Improving HDFS scalability

�  HDFS federation

What is HDFS �  HDFS:

�  Storage layer for Hadoop Open Source Apache project �  Scale: petabytes of data on thousands of nodes

�  Characteristics: �  Uses clusters of commodity computers �  Use replication across servers to deal with unreliable storage/servers �  Metadata-data separation - simple design �  Slightly Restricted file semantics

�  Focus is mostly sequential access �  Single writers �  No file locking features

�  Supports moving computation close to data �  Single ‘storage + compute’ cluster vs. Separate clusters

HDFS Architecture

Metadata ops

Client

Metadata (Name, #replicas, …): /users/foo/data, 3, …

Namenode

Client

Datanodes

Rack 1 Rack 2

Replication

Block ops

Datanodes

Blocks

Metadata ops

Facebook Use of HDFS

titan ODS

Hive Scribe …

•  Quiz: What is the total number of HDFS clusters used in Facebook ???

•  The biggest one: warehouse cluster storing Hive tables

The largest HDFS cluster �  Thousands of nodes

�  Close to 100 PB of configured capacity

�  100+ million files

�  Thousands of concurrent clients access the cluster

�  At peak hour, thousands of audit requests per second

�  It is growing each day

Growth of the cluster

Number of files(m) Used Capacity (PB)

Agenda

�  Improving HDFS scalability

�  Scale of the system �  System monitoring �  Communication �  Synchronization �  Data structures and algorithms �  Network awareness �  Handling persistency �  Memory management �  Tiny bugs – huge losses

Scale of the system �  FSDirectory

�  Information about all files/directories in the namespace

�  BlocksMap �  Information about all the blocks in the filesystem

�  Other associated structures - examples: �  Queues for storing replication status �  Table of datanodes

Memory utilization:

�  In memory state

�  FSImage + FSEditsLog �  ensuring persistency

�  System logs �  debugging and monitoring

System monitoring - logging

Metadata ops

Client

Namenode

Datanodes

Replication

Block ops

Datanodes

Blocks

Communication

Namenode

Datanodes

Blocks

Synchronization

ONE LANE ROAD

Data structures and algorithms

Network awareness

Namenode

Datanodes

Master Rack

Network switch

Handling persistency

Namenode

SecondaryNamenode

FSImage t0

FSEdits FSImage

Client

Memory management

�  Namenode running with enormous heap space

�  Problem: A full GC takes at least 10 minutes �  The NameNode is non-responsive !

�  Improvements: �  Configuration changes

�  Avoid unnecessary creations of temporary data

Tiny bugs - huge losses

�  A bug in the MR application layer caused the scan of the whole /tmp subtree for each job submitted: �  Huge number of VALID requests to the NameNode

�  Another bug in the application layer exploded the number of metadata read requests by 12 times.

Agenda

�  HDFS federation

Static Partitioned Clusters

Cluster overlay

NN1 NN2

DN1 DN1 DN2 DN2 DN3 DN3

Federation

NN1 NN2

DN1 DN2 DN3

Federation

NN1 NN2

DN1 DN2 DN3

Conclusion

�  We have tons of data stored in HDFS in many clusters,

including one of the largest clusters in the world.

�  We need to deal with problems never faced before

�  Our job is to keep it running efficiently, not lose data,

and make it highly available !

Future �  Improve NameNode availability

�  Manual / Automatic failover

�  Improve I/O efficiency

�  Cross data-center support

Running the largest HDFS cluster - University of …cloud.berkeley.edu/data/hdfs-scalability.pdf4!...

Documents

Transcript of Running the largest HDFS cluster - University of …cloud.berkeley.edu/data/hdfs-scalability.pdf4!...

FICHE PROGRAMME OGM/031 - Carmencarmen.carmencarto.fr/IHM/metadata/GALLI/Publication/OGM_031_… ·

Information technology — Metadata registries …metadata-standards.org/Document-library/Projects/11179...Information technology — Metadata registries (MDR) — Part 3: Registry

Geographic information — Temporal schemapeople.ischool.berkeley.edu/~ryanshaw/pdf/ISO_19108.pdf · temporal aspects of metadata about geographic information. Since this standard

Aaf oct2 governance and metadata v2

Métadonnées relatives aux indicateurs mondiaux et ...uis.unesco.org/sites/default/files/documents/metadata-global... · 2 Métadonnées relatives aux indicateurs mondiaux et thématiques

« OISEAUX » DE LA PLAINE DU FOREZ - Accueil | …carmen.application.developpement-durable.gouv.fr/IHM/metadata/RHA/... · DOCUMENT D ’O BJECTIFS DU SITE FR 821 2024 « PLAINE

Hurence - · PDF fileLambda Architecture Hurence | Page 7/ 16 Lambda architectures Data Lake HDFS. ... Put ElasticSearch Production.yml Storage.yml Visualizers Put HDFS (Data Lake)

mrjob Documentation$ python my_job.py -r emr s3://my-inputs/input.txt $ python my_job.py -r hadoop hdfs://my_home/input.txt If your code spans multiple ﬁles, see Uploading your source

Prototype Flipty outil collaboratif de classe inversée - Design Metadata 2014

Processus d'annotation sémantique · and annotation based on semantics Ontology Thesaurus Intro Strore DOI Access metadata platforms Access semantic Variables Concepts différents

Installation des packages de métadonnées d’appareildownload.microsoft.com/.../device-metadata-package-… · Web viewVous pouvez également mettre à jour vos paramètres de distribution

Big Data Technologies BigData - CentraleSupélec Metz · Big Data façon Hadoop : ... Le NameNodeconserve la cartographie du HDFS + les évolutions (les «logs») Permet de savoir

Manage Traceability with Apache Atlas flexible metadata repository.

Dublin Core - Paris Nanterre Universitymediadix.parisnanterre.fr/stockage_doc/metadonneesbibnum/...20/10/10 C. Morel-Pair Dublin Core Metadata Initiative Atelier à Dublin (Ohio) en

Hadoop File Systemnael/cs202/lectures/lec14.pdfHDFS client caches the data into a temporary file. When the data reached a HDFS block size the client contacts the Namenode. Namenode

Renforcez la sécurité de vos données SAS grâce aux Metadata … · 2016. 9. 26. · Les bibliothèques liées aux métadonnées (« Metadata-Bound Libraries » ou « MBL ») restreignent

20/11/2019 Big Data Technologies BigData · Big Data façon Hadoop : ... Le NameNodeconserve la cartographie du HDFS + les évolutions (les «logs») Permet de savoir où sont les

Visualisation des données sur le portail Coriolis...Coriolis-données, atelier technique Brest, 5 juillet 2018 Processus d’importation des données MetaData : Recopie des métadonnées

DOCOB Plati re ZPS version finale2 - Accueil | Carmencarmen.application.developpement-durable.gouv.fr/IHM/metadata/RHA/Publication/docob/FR...Avenant Zone de Protection Spéciale –

Wild Cetacean Identification using Image Metadata