JSON

The JSON SerDe supports querying of JSON data stored in Hive tables through the use of JSON Pointer definitions. The Serde features are used to define tables which can include a combination of regular columns, columns parsed out of the JSON document, fragments from the source document.

The SerDe supports querying single dimension as well as repeating groups sourced from the JSON document. In case of repeating groups, the SerDe returns the value as arrays. These arrays can be further flattened through the use of functions provided by other UDFs (using a combination of record_index and index_range from the UDF library) which support nullable values, positional values within the array etc.

The table definition using the SerDe consists of the following sections

  • column definitions - which will be the table columns available for queries

  • source table definition - details of the source table from where the regular columns - JSON document is queried

  • column mapping definition - the mapping of the exposed columns to the source table - the JSON Pointer mappings of values in the JSON document to the table columns

  • SerDe classes - the input and output classes used by the table definition

The base table used for this example

CREATE TABLE source_document (
   source_id  int COMMENT 'pkey',
   docum_msg  string COMMENT 'json data',
   person_id  string
 )
 STORED AS ORC
 TBLPROPERTIES ("orc.compress"="SNAPPY");
hive> describe source_document;
+----------------+--------------------+------------------------+
|   col_name     |   data_type        |    comment             |
+----------------+--------------------+------------------------+
| source_id      |    int             |      pkey              |
| docum_msg      |    string          |      json data         |
| person_id      |    string          |                        |
+----------------+--------------------+------------------------+

JSON document used for the example.

{
    "person": [
    {
      "name": "John",
      "age": 30,
      "car": "honda"
    },
    {
      "name": "Julie",
      "age": 18,
      "car": "corolla"
    },
    {
      "name": "Pat",
      "age": 8,
      "car": "mini"
    }
  ],
  "included": [
    {
      "type": "people",
      "count": "3",
      "attributes": {
        "type": "students",
        "full-time": "y",
        "gender": "mix"
      }
    }
  ]
}

An external table definition, which references the source table containing JSON data is shown below.

create external table target_document (
   target_id   int,
   name        array<String>,
   age         array<String>
 )
 ROW FORMAT SERDE 'io.invariant.hive.serde2.json.JSONSerDe'
 with SERDEPROPERTIES (
   "hive.table.type"="orc",
   "hive.table.schema"="scratch",
   "hive.table.name"="source_document",
   "table.derived.column.source_id"="target_id",
   "table.derived.jsondoc.column"="docum_msg",
   "column.jsonpath.node.person"="/person",
   "column.jsonpath.person.name"="/name",
   "column.jsonpath.person.age"="/age"
 )
 STORED AS
 INPUTFORMAT 'io.invariant.hive.serde2.json.format.InvHiveBasedJSONInputFormat'
 OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
 LOCATION "/apps/hive/warehouse/scratch.db/source_document"
 TBLPROPERTIES (
 "json.path.property"="json-pointer"
 );

Table definition

create external table target_document (
  target_id  int,
  name       array<String>,
  age        array<String>
 )
 ROW FORMAT SERDE 'io.invariant.hive.serde2.json.JSONSerDe'

The SerDe used to define the row format. This will be the SerDe used when defining tables querying JSON String values

SerDe Properties

Key

Sample Value

Description

hive.table.type

ORC

Type of table. Currently ORC & AVRO types are supported

hive.table.schema

scratch

schema of the source table

hive.table.name

source_document

source table name

table.derived.jsondoc.column

docum_msg

Column in the source table that contains the JSON string to be queried

table.derived.column.source_id

target_id

An existing column from source table mapped to target

Indicates the source_id column from source table to be mapped to target_id in the target table. Please ensure the target column definition is consistent with the set of columns defined in the external table in the top section

JSON mapping

JSON mapping follows a definition of a node and leaves of a node made up of the elements under the leaf. This allows for supporting repeating groups where a node may have repeating elements which will be mapped to an array datatype on the target table. Please ensure repeating groups are defined as arrays in the target table definition.

  "column.jsonpath.node.person"="/person"

Node definition “column.jsonpath.node.person" indicates node named person defined by "/person" which represents the repeating group of person and related attributes

All elements, attributes to be queried mapped within this node should be called column.jsonpath.person.COLUMNNAME and this column name should be defined in the external table definition.

"column.jsonpath.person.name"="/name",
"column.jsonpath.person.age"="/age"

Element definition " column.jsonpath.person.name " indicates an element to be found under person node as the JSON pointer definition “/name "

INPUTFORMAT io.invariant.hive.serde2.json.format.InvHiveBasedJsonInputFormat'

=> default value for defining external tables that query JSON string value

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' 

=> default value for defining external tables that query JSON string value

LOCATION  "/apps/hive/warehouse/scratch.db/source_document"

=> location of the data files of the source table

TBLPROPERTIES (
 "json.path.property"="json-pointer"
 );

The table properties are included to indicate the type of JSON path definition used for indicating parsed column values. This version includes support for json-pointer and follows the RFC 6901, the following table contains examples of JSON pointer definitions and the parsed result.

Sample JSON

{
      "foo": ["bar", "baz"],
      "": 0,
      "a/b": 1,
      "c%d": 2,
      "e^f": 3,
      "g|h": 4,
      "i\\j": 5,
      "k\"l": 6,
      " ": 7,
      "m~n": 8
 }
   
  The following JSON strings evaluate to the accompanying values:
    "/foo"       ["bar", "baz"]
    "/foo/0"     "bar"
    "/"          0
    "/a~1b"      1
    "/c%d"       2
    "/e^f"       3
    "/g|h"       4
    "/i\\j"      5
    "/k\"l"      6
    "/ "         7
    "/m~0n"      8

However our recommendation is to shred them using nodes and simple elements and use the Hive SQL for iterative querying as that would result in one shredding routine that can support many query use cases.

Normal operations can be performed on this external table which outputs a combination of base table columns and elements from the JSON document. The below output is an example of a query using the record index and index range to display a combination that includes repeating groups.

hive>  select target_id, record_index( name, n ) as nameval,
       record_index( age, n ) as ageval from target_document lateral view  
       index_range( size( age )) n1 as n;

11      John    30
11      Julie   18
11      Pat     8
12      jill    21
12      snow    11
12      prat    8
13      ashok   23
13      ram     29
13      praty   8
hive>  select target_id, record_index( name, n ) as nameval,
       record_index( age, n ) as ageval from target_document lateral view  
       index_range( size( age )) n1 as n where target_id = 11;

11      John    30
11      Julie   18
11      Pat     8

Select JSON Nodes as String Output

The SerDe can be used to select nodes from an JSON document as string values to support JSON Shredding efforts. It follows the same convention in terms of establishing nodes of interest and subsequently defining elements within a node group to be exposed as columns. To expose the node as a json string, the column needs to be defined with a mapping of "="/_nodeToString". The SerDe supports both single and array of nodes in the result if the document node element in unbounded. Arrays can be further exposed as records by using provided user defined functions.

Example

create external table jsontestWithSrcJson (
   target_id    int,
   name         array<String>,
   age          array<String>
 )
 ROW FORMAT SERDE 'io.invariant.hive.serde2.json.JSONSerDe'
 with SERDEPROPERTIES (
   "hive.table.type"="orc",
   "hive.table.schema"="scratch",
   "hive.table.name"="source_document",
   "table.derived.column.source_id"="target_id",
   "table.derived.jsondoc.column"="docum_msg",
   "column.jsonpath.node.person"="/person",
   "column.jsonpath.person.name"="/name",
   "column.jsonpath.person.age"="/age",
   "column.jsonpath.person.nodestring"="_nodeToString"
 )
 STORED AS
 INPUTFORMAT 'io.invariant.hive.serde2.json.format.InvHiveBasedJSONInputFormat'
 OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
 LOCATION "/apps/hive/warehouse/scratch.db/source_document"
 TBLPROPERTIES (
 "json.path.property"="json-pointer"
 );

Last updated