What is Big Data and Why do Enterprises care about Big Data?

Posted Posted in Hadoop, Hadoop

In this article, we will discuss what is Big data and why do enterprises care about Big data.we will learn: What is wrong with our traditional DWH solutions? When RDBMS could not help much Technical issues we face with RDBMS How Hadoop is different from RDBMS Core features of Hadoop To understand : what is Big […]

CAP Theorem

Posted Posted in Cassandra, Hadoop, Hadoop, HBase, MongoDB

CAP Theorem for a distributed architecture: Consistency: The data in database remains consistent after the execution of an operation.Example: After an insert operation all clients will see the same data Availability:The System is always On/Available. Partition Tolerance:The system continues to function even if the servers are not able to communicate with each other CAP provides […]

HBase Basics

Posted Posted in Hadoop, Hadoop

Difference between HBase and RDBMS? Hbase RDBMS Column Oriented Row Oriented Flexible schema, add columns on the fly Fixed Schema Good with sparse table Not optimized for sparse table Join using MapReduce not optimized Not Applicable Horizontal Scalability (Add hardware) Hard to shard and scale Good for structured and semi structured data Good for structured […]

Data loading using Flume

Posted Posted in Hadoop, Hadoop

What is Flume? Flume is a service to collect, aggregate and move large amount of streaming data to HDFS efficiently. Components in Flume: Source Sinks Channels Interceptors Channel selectors Agent Sink processor Flume event  Application server generates a lot of streaming data like audio, video, continuous twitts etc. and that data is passed to Flume […]

Sqoop basics

Posted Posted in Hadoop, Hadoop

What is Sqoop? Sqoop is a tool for bulk loading/transferring data between Hadoop and RDBMS vice versa efficiently. Sqoop imports tables or entire database to HDFS and generates java classes/map reduce code to transfer the data. Also it allows to import/export data fromSQL database to Hive warehouse. Sqoop is developed by Cloudera. Connectors for Sqoop: Teradata MySQL Oracle Microstrategy […]

Security in Hadoop Cluster

Posted Posted in Hadoop, Hadoop

Hadoop has 2 core components. HDFS Name node Data node YARN Resource manager Node manager Hadoop Cluster modes: Standalone mode No Daemons, everythings runs in a single jvm suitable for development phase Has no DFS Pseudo Distributed mode Hadoop daemons run on local machine Fully distributed mode Hadoop daemons run on cluster of machines Security […]

Spark for fast batch processing

Posted Posted in Hadoop, Hadoop

What is Spark? Apache Spark is an open source and it is written in Scala,a functional programming language that runs on JMV.Spark is a parallel data processing framework that helps hadoop to develop fast,unified big data applications combining batch,streaming and interactive analytics. Spark is developed at UC Berkrlry and it generalizes the map reduce framework. […]