If you are switching from Hortwonwork Data Platform(HDP) 2.6 To 3.0+, you will have a hard time accessing Hive Tables through the Apache Spark shell. HDP 3 introduced something called Hive Warehouse Connector (HWC) which is a Spark library/plugin that is launched with the Spark application. You need to understand how to use HWC to access Spark tables from Hive in HDP 3.0 and later. You can also export tables to Hive from Spark and vice versa using this connector.
In HDP 3.0 and later, Spark and Hive use independent catalogs for accessing SparkSQL or Hive tables on the same or different platforms. Both Spark and Hive have a different catalog in HDP 3.0 and later. A table created by Spark resides in the Spark catalog, whereas the table created by Hive resides in the Hive catalog. When we create a database in the new platform, it will fall under the catalog namespace, which is similar to how tables belong to the database namespace. These tables are interoperable when we use the Hive warehouse connector.
Hive Warehouse Connector Operations
You can use the Hive Warehouse Connector (HWC) API to access any type of table in the Hive catalog from Spark. When you use SparkSQL, standard Spark APIs access tables in the Spark catalog.
We can read and write Apache Spark Data Frames and Streaming Data frames to and from Apache Hive using this Hive warehouse connector. It supports the following applications.
- Spark shell
- PySpark
- spark-submit script
The spark thrift server is not supported.
Operations Supported by the Hive Warehouse Connector.
Below are some operations supported by the Hive Warehouse connection.
- Describing a Table
- Creating a table for ORC-formatted data
- Selecting Hive data and retrieving a Data Frame
- Writing a Data Frame to Hive in Batch
- Executing a Hive update statement
- Reading table data from Hive, transforming it in Spark, and writing it to a new Hive table
- Writing a DataFrame or Spark stream to Hive using Hive Streaming
Launching Spark Shell with HWC for Scala
- Locate the hive-warehouse-connector-assembly jar in
/usr/hdp/current/hive_warehouse_connector/
. - Add the connector jar to the app submission using
--jars
.
/usr/hdp/current/spark2-client/bin/spark-shell --conf spark.sql.hive.hiveserver2.jdbc.url="jdbc:hive2://sandbox-hdp.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2" spark.hadoop.hive.zookeeper.quorum="sandbox-hdp.hortonworks.com:2181" --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.1.0-187.jar
scala>
Replace the sandbox-hdp.hortonworks.com
with the IP Address of your cluster.
3. Use the Hive Warehouse API to access to Apache Hive Database and tables
scala> import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession
scala> import com.hortonworks.hwc.HiveWarehouseSession._
import com.hortonworks.hwc.HiveWarehouseSession._
scala> val hive = HiveWarehouseSession.session(spark).build()
hive: com.hortonworks.spark.sql.hive.llap.HiveWarehouseSessionImpl = com.hortonworks.spark.sql.hive.llap.HiveWarehouseSessionImpl@1bfafce1
//Select Hive Database
scala>hive.setDatabase("foodmart")
//Show tables
scala>hive.showTables()
Launching Spark Shell with HWC for PySpark
- Locate the hive-warehouse-connector-assembly jar in
/usr/hdp/current/hive_warehouse_connector/
. - Add the connector jar to the app submission using
--jars
.
pyspark --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.1.0-187.jar \
--py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.0.1.0-187.zip
Replace the “` sandbox-hdp.hortonworks.com “` with the IP Address of your cluster.
3. Use the Hive Warehouse API to access to Apache Hive Database and tables
from pyspark_llap import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()
//Select Hive Database
hive.setDatabase("foodmart")
//Show tables
hive.showTables()
Since this is an early phase of this connector, you can experience many issues while using different features of this API.