This post provides an introduction to following concepts :
- Hadoop Basics
- What is HDFS ?
- What is YARN ?
Lets start with the simplest question first.
What is Big Data ?
Big data is a term coined for huge volume of data(in terrabytes or petabytes) that is difficult to manage using traditional DBMS.
Here are some examples:
Stock markets generate petabytes of new trade data everyday.
Every day millions of images and posts get added in facebook, twitter etc
This data can be categorized in to 3 types :
- Structured data : This is data from RDBMS systems
- Semi-structured data : emails, posts etc
- Unstructured data : Photos, videos, music etc
What is Hadoop ?
Hadoop is a framework that addresses the big data problem by allowing distributed processing of large data sets across clusters of commodity computers using a simple programming model.
It is an open source data management tool that provides scale-out(you can keep adding commodity hardware as the need increases) storage and distributed processing.
Hadoop has 2 main components :
- HDFS : Storage component
- YARN : Processing component
Lets try to get an understanding of HDFS and YARN.
HDFS (Hadoop Distributed File System)
Hadoop Distributed File System(HDFS) works on a concept of Master/Slave architecture, where each cluster has 1 Namenode(master) and several Datanodes(slave nodes).
In HDFS, data is stored in blocks. The default block size is 128MB in Hadoop 2.x and 64 MB in Hadoop 1.x.. this also can be configured.
So, if you are trying to store a 1 GB(=1024MB) file in Hadoop 2.x, it can be stored in 1024/128 = 8 blocks.
In Hadoop 2.x, following components provides storage:
- Name node
- Secondary name node
- Data nodes
Name node :
- This is the master component in HDFS
- It controls and manages the data stores in block in data nodes by storing the metadata in main memory. The metadata stored includes List of files, List of blocks for each file, List of data nodes for each block etc.
- Namenode saves these details into files called FSImage and EditLogs. FSImage keeps snapshot of state when server starts and EditLogs contain the transaction log.
- It keeps track of overall directory structure and the placement of data block
- In a write operation, it tells which data node data will be written to and in a read operation, it informs which data node data will be read from.
- Slaves which are deployed on each commodity machine and actually store the data.
- These are responsible for serving read/write requests for the clients.
Secondary Name node :
- Name node is a single point of failure in Hadoop 1.x. In Hadoop 1.x, if you lose namenode, cluster will be down
- Secondary name node is a backup. It’s not a hot standby.. i.e it can’t replace namenode in case of failure.
- Secondary name node connects to namenode every 1 hour.. It keeps backup of namenode every 1 hour
Replication and Rack awareness in Hadoop
Replication is a concept in Hadoop, where copies of the same block of data are stored in different datanodes so that in case of failure in a machine, data isn’t lost.
Default replication factor is 3, so each data block has 2 other copies. The replication of data is a sequential process.
YARN (Yet Another Resource Negotiator)
YARN components execute processing using the cluster components.
YARN consists of the following components:
- Resource Manager (Master)
- Typically only one resource manager created per cluster
- Resource Manager manages use of resources across the cluster
- Node Manager (Slave)
- Node managers run on the slave nodes of the cluster
- They launch and monitor containers.
- Application Master
- Resource Manager assigns a Node manager to run the Application Master
- Applcation master executes a job using one or more containers.
- Containers execute specific processes with a set of resources like cpu, memory allocated to them.
© 2015, www.techkatak.com. All rights reserved.