Running the largest HDFS cluster - University of …cloud.berkeley.edu/data/hdfs-scalability.pdf4!...

Running the largest HDFS cluster

Hairong Kuang, Tom Nykiel [email protected] [email protected]

Agenda �  HDFS at Facebook

�  Improving HDFS scalability

�  HDFS federation

What is HDFS �  HDFS:

�  Storage layer for Hadoop Open Source Apache project �  Scale: petabytes of data on thousands of nodes

�  Characteristics: �  Uses clusters of commodity computers �  Use replication across servers to deal with unreliable storage/servers �  Metadata-data separation - simple design �  Slightly Restricted file semantics

�  Focus is mostly sequential access �  Single writers �  No file locking features

�  Supports moving computation close to data �  Single ‘storage + compute’ cluster vs. Separate clusters

4

HDFS Architecture

Read

Metadata ops

Client

Metadata (Name, #replicas, …): /users/foo/data, 3, …

Namenode

Client

Datanodes

Rack 1 Rack 2

Replication

Block ops

Datanodes

Write

Blocks

Metadata ops

Facebook Use of HDFS

HDFS

HBase

titan ODS

Hive Scribe …

•  Quiz: What is the total number of HDFS clusters used in Facebook ???

•  The biggest one: warehouse cluster storing Hive tables

The largest HDFS cluster �  Thousands of nodes

�  Close to 100 PB of configured capacity

�  100+ million files

�  Thousands of concurrent clients access the cluster

�  At peak hour, thousands of audit requests per second

�  It is growing each day

Growth of the cluster

Number of files(m) Used Capacity (PB)

2009

2010

2011

Agenda

�  Improving HDFS scalability

�  Scale of the system �  System monitoring �  Communication �  Synchronization �  Data structures and algorithms �  Network awareness �  Handling persistency �  Memory management �  Tiny bugs – huge losses

Scale of the system �  FSDirectory

�  Information about all files/directories in the namespace

�  BlocksMap �  Information about all the blocks in the filesystem

�  Other associated structures - examples: �  Queues for storing replication status �  Table of datanodes

Memory utilization:

�  In memory state

�  FSImage + FSEditsLog �  ensuring persistency

�  System logs �  debugging and monitoring

System monitoring - logging

10

Read

Metadata ops

Client

Namenode

Datanodes

Replication

Block ops

Datanodes

Blocks

Communication

Namenode

Datanodes

Datanodes

Blocks

Synchronization

ONE LANE ROAD

AHEAD

Data structures and algorithms

Network awareness

Namenode

Datanodes

Datanodes

Rack1

Master Rack

Rack2

Network switch

Handling persistency

Namenode

SecondaryNamenode

FSImage t0

FSEdits FSImage

t1

Client

Client

Memory management

�  Namenode running with enormous heap space

�  Problem: A full GC takes at least 10 minutes �  The NameNode is non-responsive !

�  Improvements: �  Configuration changes

�  Avoid unnecessary creations of temporary data

Tiny bugs - huge losses

�  A bug in the MR application layer caused the scan of the whole /tmp subtree for each job submitted: �  Huge number of VALID requests to the NameNode

�  Another bug in the application layer exploded the number of metadata read requests by 12 times.

Agenda

�  HDFS federation

Static Partitioned Clusters

HDFS2

Cluster overlay

NN1 NN2

DN1 DN1 DN2 DN2 DN3 DN3

Federation

NN1 NN2

DN1 DN2 DN3

Federation

NN1 NN2

DN1 DN2 DN3

NN3

Conclusion

�  We have tons of data stored in HDFS in many clusters,

including one of the largest clusters in the world.

�  We need to deal with problems never faced before

�  Our job is to keep it running efficiently, not lose data,

and make it highly available !

Future �  Improve NameNode availability

�  Manual / Automatic failover

�  Improve I/O efficiency

�  Cross data-center support

Running the largest HDFS cluster - University of …cloud.berkeley.edu/data/hdfs-scalability.pdf4!...

Documents

Transcript of Running the largest HDFS cluster - University of …cloud.berkeley.edu/data/hdfs-scalability.pdf4!...