HIve_JSON_Image
Hive JSON

 

JSON is the most popular data exchange format over web. Your application may directly interact with live system or gets dump from another application; Whatsoever is the problem, being Data Engineer or Scientist its essential to know how json data can be loaded to existing hive tables.

Simplified Understanding :

  1. [CONFIGURE] A JAR’s (hive-hcatalog-core.jar) location needs to be added in your hive-site.xml OR your hive session which contains JSON SerDe
  2. [STAGE] Load the complete file in a temporary table with a single cell  as a string.
  3. Create a table with the columns which you required from JSON file
  4. [LOAD] ]Insert into table select from temp table, using get_json_object() function parse the json string from temp table and load into main table

Detailed Steps:

HCatalog is a storage management layer of Hadoop which enables different Hadoop processing tools to interact with each other. Hive-HCatalog-Core.jar contains the json SerDe which helps to parse the json string and extract fields from it.

The Jar is found in Cloudera VM at

/usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core-1.1.0-cdh5.10.0.jar

There are two approaches to configure jars. One by setting paths in hive-env.sh. and another by directly setting jars location in session. It depends on your cluster administration, but both will get your job done.

Configuring via hive-env.sh

  • hive-env.sh is located at /usr/lib/hive/conf
  • Use hive-env.sh.template to create hive-env.sh if it does not exist. By using sudo cp hive-env.sh.template hive-env.sh
  • Find “HIVE_AUX_JARS_PATH”
    # Folder containing extra ibraries required for hive compilation/execution can be controlled by:
    # export HIVE_AUX_JARS_PATH=
  • Replace
    export HIVE_AUX_JARS_PATH=/usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core-1.1.0-cdh5.10.0.jar
  • Hive Services need to be restarted to reflect changes.
    sudo service hive-server2 restart

Configuring via Hive Session

hive> set hive.aux.jars.path=/usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core-1.1.0-cdh5.10.0.jar
  • The function to parse JSON string is get_json_object(param1,param2) which requires two parameters and returns values corresponding to its key.
  • Param1 is the column name where the json string is placed in this example “jsondata” column of stage_circuits table.
  • param2 is the json field inside document. $ represents the root object and FIELD1…9 are the key names of the json document.
# creating datasbe
CREATE DATABASE IF NOT EXISTS f1;

# creating the main circuits table
CREATE TABLE f1.circuits (
circuitId INT,
circuitRef STRING,
name STRING,
location STRING,
country STRING,
lat DECIMAL(38,18),
lng DECIMAL(38,18),
alt STRING,
url STRING
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION '/user/cloudera/f1/Hive/circuits';

# staging table 
CREATE TEMPORARY TABLE f1.stage_circuits(
jsondata STRING
);

# loading staging table
LOAD DATA INPATH '/user/cloudera/f1/Dataset/circuits.json' OVERWRITE INTO TABLE f1.stage_circuits;

# parsing json string from staging table and loading into main table
INSERT OVERWRITE TABLE f1.circuits
SELECT  CAST(get_json_object(jsondata,'$.FIELD1') AS INT) circuitId,
get_json_object(jsondata,'$.FIELD2') circuitRef,
get_json_object(jsondata,'$.FIELD3') name,
get_json_object(jsondata,'$.FIELD4') location,
get_json_object(jsondata,'$.FIELD5') country,
CAST(get_json_object(jsondata,'$.FIELD6') AS DECIMAL(38,18)) lat,
CAST(get_json_object(jsondata,'$.FIELD7') AS DECIMAL(38,18)) lng,
get_json_object(jsondata,'$.FIELD8') alt,
get_json_object(jsondata,'$.FIELD9') url
from stage_circuits;

External References:

get_json_object()
hive.aux.jars.path

Dataset Download

 

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.