JSON
The JSON SerDe supports querying of JSON data stored in Hive tables through the use of JSON Pointer definitions. The Serde features are used to define tables which can include a combination of regular columns, columns parsed out of the JSON document, fragments from the source document.
The SerDe supports querying single dimension as well as repeating groups sourced from the JSON document. In case of repeating groups, the SerDe returns the value as arrays. These arrays can be further flattened through the use of functions provided by other UDFs (using a combination of record_index and index_range from the UDF library) which support nullable values, positional values within the array etc.
The table definition using the SerDe consists of the following sections
column definitions - which will be the table columns available for queries
source table definition - details of the source table from where the regular columns - JSON document is queried
column mapping definition - the mapping of the exposed columns to the source table - the JSON Pointer mappings of values in the JSON document to the table columns
SerDe classes - the input and output classes used by the table definition
The base table used for this example
JSON document used for the example.
An external table definition, which references the source table containing JSON data is shown below.
Table definition
The SerDe used to define the row format. This will be the SerDe used when defining tables querying JSON String values
SerDe Properties
Key
Sample Value
Description
hive.table.type
ORC
Type of table. Currently ORC & AVRO types are supported
hive.table.schema
scratch
schema of the source table
hive.table.name
source_document
source table name
table.derived.jsondoc.column
docum_msg
Column in the source table that contains the JSON string to be queried
table.derived.column.source_id
target_id
An existing column from source table mapped to target
Indicates the source_id column from source table to be mapped to target_id in the target table. Please ensure the target column definition is consistent with the set of columns defined in the external table in the top section
JSON mapping
JSON mapping follows a definition of a node and leaves of a node made up of the elements under the leaf. This allows for supporting repeating groups where a node may have repeating elements which will be mapped to an array datatype on the target table. Please ensure repeating groups are defined as arrays in the target table definition.
Node definition “column.jsonpath.node.person" indicates node named person defined by "/person" which represents the repeating group of person and related attributes
All elements, attributes to be queried mapped within this node should be called column.jsonpath.person.COLUMNNAME and this column name should be defined in the external table definition.
Element definition " column.jsonpath.person.name " indicates an element to be found under person node as the JSON pointer definition “/name "
=> default value for defining external tables that query JSON string value
=> default value for defining external tables that query JSON string value
=> location of the data files of the source table
The table properties are included to indicate the type of JSON path definition used for indicating parsed column values. This version includes support for json-pointer and follows the RFC 6901, the following table contains examples of JSON pointer definitions and the parsed result.
Sample JSON
However our recommendation is to shred them using nodes and simple elements and use the Hive SQL for iterative querying as that would result in one shredding routine that can support many query use cases.
Normal operations can be performed on this external table which outputs a combination of base table columns and elements from the JSON document. The below output is an example of a query using the record index and index range to display a combination that includes repeating groups.
Select JSON Nodes as String Output
The SerDe can be used to select nodes from an JSON document as string values to support JSON Shredding efforts. It follows the same convention in terms of establishing nodes of interest and subsequently defining elements within a node group to be exposed as columns. To expose the node as a json string, the column needs to be defined with a mapping of "="/_nodeToString". The SerDe supports both single and array of nodes in the result if the document node element in unbounded. Arrays can be further exposed as records by using provided user defined functions.
Example
Last updated