Apache Hive is a data warehouse infrastructure project built on top of Apache Hadoop for providing data summarization, ad-hoc queries, data aggregation, and analysis of datasets. The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data stored in Apache Hadoop and perform analysis on top of it.
Changing the default Metastore in Hive
Even though the Derby database is the default metastore in the hive, we can change it by editing hive-site.xml in the hive installation directory.
<configuration>
<property>
<name>hive.metastore.local</name>
<value>true</value></property>
<property><name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://<ip address of mysqlmachine>/hive?createDatabaseIfNotExist=true</value>
</property>
<property><name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value></property>
<property><name>javax.jdo.option.ConnectionUserName</name>
<value>username</value></property>
<property><name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
</property>
</configuration>
Hive Queries
Create Database
First, create the database testdb
if does not exist and list the databases.
-- Create Database if does not exists
CREATE DATABASE IF NOT EXISTS testdb;
-- Use the newly craete Database
USE testdb;
-- List all the databases
SHOW DATABASES;
SELECT Unique Records in Hive
Given below Sample Table
hive> SELECT * FROM employees LIMIT 10;
+--------+------------+------------+-----------+--------+------------+
| emp_no | birth_date | first_name | last_name | gender | hire_date |
+--------+------------+------------+-----------+--------+------------+
| 10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 |
| 10002 | 1964-06-02 | Bezalel | Simmel | F | 1985-11-21 |
| 10003 | 1959-12-03 | Parto | Bamford | M | 1986-08-28 |
| 10004 | 1954-05-01 | Chirstian | Koblick | M | 1986-12-01 |
| 10005 | 1955-01-21 | Kyoichi | Maliniak | M | 1989-09-12 |
| 10006 | 1953-04-20 | Anneke | Preusig | F | 1989-06-02 |
| 10007 | 1957-05-23 | Tzvetan | Zielinski | F | 1989-02-10 |
| 10008 | 1958-02-19 | Saniya | Kalloufi | M | 1994-09-15 |
| 10009 | 1952-04-19 | Sumant | Peac | F | 1985-02-18 |
| 10010 | 1963-06-01 | Duangkaew | Piveteau | F | 1989-08-24 |
+--------+------------+------------+-----------+--------+------------+
SELECT DISTINCT birth_date,first_name,last_name,gender,hire_date
FROM employees
GROUP BY last_name;
Upload Data in HDFS
Upload the drivers.csv
and timesheet data into HDFS(/user/maria-dev/) from the below data source
Now Upload the Files
Create an Internal Table for the Driver’s Data
As the database is recently created, no tables exist in the new database. The following query will result in an empty result.
-- Show all the Tables
SHOW TABLES;
Now create a Table using the Hive DDL statement.
-- Create Table drivers if does not exists
CREATE TABLE IF NOT EXISTS drivers_int(driverId INT ,name STRING ,ssn BIGINT, location STRING ,certified STRING ,wageplan STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES('skip.header.line.count' = '1');
SHOW TABLES;
In this DDL statement notice the following things
FIELDS TERMINATED BY ','
It means that the data for this table will be separated by a comma or the data will be in CSV formatSTORED AS TEXTFILE
ROW FORMAT DELIMITED
TBLPROPERTIES
Table Definition in Hive
describe <table name>;
Describe table schema in detail
describe extended <tablename>;
-- Describes Table
-- describe extended <tablename>;
DESCRIBE EXTENDED drivers_int;
Load the data by overwriting the existing data
-- Insert the drivers_int table with the drivers.csv data by overwriting existing data
LOAD DATA INPATH '/user/maria_dev/drivers.csv' OVERWRITE INTO TABLE drivers_int;
--List the data from Tables
SELECT * FROM temp_drivers LIMIT 10;
After executing LOAD DATA
we can see the table drivers_int
was populated with data from drivers.csv
.
Note that Hive consumed the data file drivers.csv
during this step. If you look in the File Browser, you will see drivers.csv
is no longer there.
As we are creating an internal table, data from HDFS will be moved to the Hive warehouse.
Create a table for the driver timesheet
-- Create Table TimeSheet
CREATE TABLE IF NOT EXISTS timesheet_int( driverId INT ,week INT ,hours_logged INT ,miles_loggged INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES('skip.header.line.count' = '1');
SHOW TABLES;
Load CSV data into the Table timesheet
-- Load data into Time Sheet Table
LOAD DATA INPATH '/user/maria_dev/timesheet.csv' OVERWRITE INTO TABLE timesheet_int;
-- List the data from TimeSheet Table
SELECT * FROM timesheet_int LIMIT 10;
Find the sum of hours and miles logged score for a year for each driver
SELECT driverId ,sum(hours_logged) ,sum(miles_loggged) FROM timesheet_int GROUP BY driverId LIMIT 10;
Join the two tables to get the driver Name for aggregated data
-- Join the two tables to get the aggregrated hours and miles data along with the respective driver name
SELECT d.driverId, d.name ,t.total_hours ,t.total_miles
FROM drivers_int d
JOIN
(SELECT driverId ,sum(hours_logged) total_hours ,sum(miles_loggged) total_miles
FROM timesheet_int
GROUP BY driverId) t
ON (d.driverId = t.driverId)
LIMIT 10;
Change Column Data Type in Hive
ALTER TABLE <Table Name> CHANGE <column name> <column name> <New Data Type>;
//Example
ALTER TABLE Employee CHANGE salary salary BIGINT;
Rename a Table in Hive
ALTER TABLE hive_table_name RENAME TO new_table_name;
Retrieve Maximum Salary in Hive Table
Let us suppose we have a table employee consisting of name and salary
- Second-Highest Salary
SELECT * FROM (SELECT salary ,ROW_NUMBER() over (ORDER BY salary) as row_no FROM employee GROUP BY SALARY) res WHERE res.row_no = 2 ;
- Third-Highest Salary
SELECT * FROM
(SELECT salary ,ROW_NUMBER() over (ORDER BY salary) as row_num FROM
employee GROUP BY SALARY) res
WHERE res.row_num = 3;