Trending Technology Machine Learning, Artificial Intelligent, Block Chain, IoT, DevOps, Data Science

Recent Post

Codecademy Code Foundations

Search This Blog

Hadoop Ecosystem Explained | Big Data Hadoop Ecosystem Components | Hadoop Training |

Hadoop is an open-source framework that allows distributed processing of large data sets across clusters of commodity machines using a simple programming model.

Hadoop was first developed as a Big Data Processing system in 2006 at Yahoo! 

This idea is based on Google's MapReduce,which was first published by Google based on their proprietary MapReduce implementation.

In the past few years, Hadoop has become a widely used framework for storing & processing Big Data.

Big Data Picture With Hadoop

 


HDFS (Hadoop Distributed File System) - Storage

Stores different types of large data sets (i.e. structured, unstructured and semi structured data)

HDFS creates a level of abstraction over the resources, from where we can see the whole HDFS as a single unit.

Stores data across various nodes and maintains the log file about the stored data (metadata)

HDFS has two core components, i.e. NameNode and DataNode.

Hadoop-based Big Data System :

Hadoop : Master/Slave Architecture

Scenario : A project Manager managing a team of four employees. He assigns project to each of them and tracks the progress.


YARN (Yet Another Resource Negotiator)

Performs all your processing activities by allocating resource and scheduling tasks

Two services : ResourceManager and NodeManager

ResourceManager : Manages resources and schedule applications running on top of YARN

NodeManager : Manages containers and monitors resource utilization in each container

 

ResourceManager & NodeManager 

ResourceManager : Master daemon that manages all other daemons & accepts job submission.

Allocates first container for the AppMaster.

NodeManager : Responsible for containers, monitoring their resource usage i.e. (CPU, memory, disk, network) & reports the same to RM.

AppMaster : 

  • One per application
  • Coordinates and manages MR jobs
  • Negotiates resources from RM

Container : Allocates cluster resources (memory, CPU etc.) to a slave node (NM).

MapReduce : Data Processing Using Programming 

Core component in a Hadoop Ecosystem for processing.

Helps in writing applications that processes large data sets using distributed and parallel algorithms.

In a MapReduce program, Map() and Reduce() are two functions

Map function performs actions like filtering , grouping and sorting

Reduce function aggregates and summarizes the result produced by map function.

Anatomy of a MapReduce Program 


PIG : Data Processing Service using Query 

PIG has two parts: Pig Latin, the language and the pig runtime, for the execution environment.

1 line of Pig latin = approx. 100 lines of Map-Reduce job

The compiler internally converts pig latin to MapReduce.

It gives you a platform for building data flow for ETl (Extract, Transform and Load)

PIG first loads the data, then performs various functions like grouping, filtering, joining, sorting, etc. and finally dumps the data on the screen or stores in HDFS.

HIVE : Data Processing Service using Query 

A data warehousing component which analyses data sets in a distributed environment using SQL-like interface.

The query language of Hive Query Language (HQL)

2 basic components : Hive Command Line and JDBC/ODBC driver

Supports user defined functions (UDF) to accomplish specific needs.

 What is Apache Hive ?

  • Data Warehousing Software Utility.
  • Used for Data Analysis
  • Built for SQL users
  • Manages Querying of Structured data
  • Simplifies and abstracts load on Hadoop
  • No need to learn Java and Hadoop API

Apache Hive Features :


Spark : In - memory Data Processing 

A framework for real time data analytics in a distributed computing environment.

Written in Scala and was originally developed at the University of California , Berkeley.

It executes in-memory computations increases speed of data processing over Map-Reduce.

100x faster than Hadoop for large scale data processing by exploiting in-memory computations.

Spark comes packed with high-level libraries

Provides various services like MLlib, graphX, SQL + Data Frames, Streaming services.

Support various language like R, SQl, PYTHON, SCALA , JAVA .

Seamlessly integrates in complex workflow.


Hbase : NoSQL Database

An open source, non-relational distributed database a NoSQL database.

Supports all types of data and that is why, it's capable of handling anything and everything.

It is modelled after Google's BigTable

It gives us a fault tolerant way of storing sparse data

It is written in Java, and HBase application can be written in REST , Avro and Thrift APIs.

HDFS v/s HBase 


Oozie : Job Scheduler 

Oozie is a job scheduler in Hadoop ecosystem

Two kinds of Oozie jobs : Oozie workflow and Oozie Coordinator.

Oozie workflow : Sequential set of actions to be executed.

Oozie Coordinator : Oozie jobs which are triggered when the data is made available to it or even triggered based on time.

Apache Flume : Data Ingesting Service

Ingests unstructured and semi-structured data into HDFS.

It helps us in collecting, aggregating and moving large amount of data sets.

It helps us to ingest online streaming data from various sources like network traffic, social media email messages, log files etc, in HDFS.

 Apache Sqoop : Data Ingesting Service 

Another data ingesting service

Sqoop can import as well as wxport structured data from RDBMS

Flume only ingests unstructured data or semi-structured data into HDFS.

 ZooKeeper : Coordinator

An open-source server which enables highly reliable distributed coordination.

Apache Zookeeper coordinates with various Hadoop services in a distributed environment 

Performs synchronization, configuration maintenance, grouping and naming.

Apache Ambari : Cluster Manager :

Software for provisioning , managing and monitoring Apache Hadoop clusters.

Gives us step by step process for installing Hadoop services.

Handles configuration of Hadoop services.

Provides a central management services for starting , stopping and re-configuring Hadoop services.

Monitors health and status of the Hadoop cluster.

No comments:

Post a Comment

Popular Posts