Working of Hive in Hadoop ecosystem

Vikrant thakur
3 min readOct 6, 2020

--

Introduction :

Big data is the most discussed topic in the technology world today. The total potential of big data is much higher than one used today because of the amount of big data being generated everyday and traditional data management system. To solve the challenges related to big data management and processing Apache software foundation introduced Hadoop. Hadoop is an open-source framework to store and process big data in distributed environment. The Hadoop ecosystem contains large set of modules or tools for handling different tasks related to big data processing. Hive is a tool from Hadoop ecosystem to process structured big data from HDFS.

To understand the need and use of hive we first need to understand the working of Hadoop and its ecosystem.

Hadoop Ecosystem :- It is a platform to solve different problems related to big data which includes various commercial and Apache projects. It works on four major stages of big data problems.

1) Data storage

2) Data processing

3) Data access

4) Data management

Following are the components that together form a Hadoop ecosystem and work on all the above stages of big data solutions.

· HDFS [Hadoop Distributed File System]

· YARN [Yet Another Resource Negotiator]

· MapReduce

· Spark [Data processing]

· PIG, HIVE: Query based processing of data services

· HBase: NoSQL Database

· Mahout, Spark MLLib

· Solar, Lucene: Searching and Indexing

· Zookeeper: Managing cluster

· Oozie: Job Scheduling

Hadoop contains two major modules called MapReduce and HDFS. Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It is on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

1) User Interface — Hive is a data warehouse software that makes interaction between user and HDFS. The three different user interface that hive supports are Web UI, Hive Command line and HD insight.

2) Metastore — The database server to store meta-data about the big data. It stores the tables, Column names, their data types and HDFS mapping.

3) HiveQL process engine — The query language to query on the meta-data of the table and other information.

4) Hive execution engine — The Conjunction part of HiveQL process engine.

5) HDFS — Data storage platform to store data in file systems.

Working of Hive -

The working of Hive tool is explained using the following steps-

1) Execute Query- The Hive user interface sends query to driver(JDBC,ODBC or other)

2) Get Plan — Uses query compiler to check syntax, function and requirements of query.

3) Get Metadata- The compiler sends query to Metastore.

4) Send Metadata- Metastore sends query response as Metadata.

5) Send Plan- Compiler checks requirements and resends the plan to driver.

6) Execute Plan- Driver sends plan to execution engine.

7) Execute jobs- The job is sent to job tracker.

8) Metadata ops- The execution engine process Metadata operation using Metastore.

9) Fetch Result- Gets results from execution engine

10) Send Results- Sends the data to driver from where it is send to User interface.

Conclusion

Hive is a data warehouse platform on the top of the Hadoop ecosystem. It is used to process big data and queries like SQL are performed using Hive Query Language. It helps in performing MapReduce jobs. Hive works in a very systematic way where the entire work is divided into different steps and are execute using the major components of hive like user interface, Execution engine and HDFS.

--

--