XML

The XML SerDe allows users to query XML data stored in Hive tables through the use of XPath definitions. It can be used to define tables with a combination of regular columns and values parsed from XML string. Additionally, repeating groups within the XML document can be queried and extracted values returned as arrays. These arrays can be further flattened by using functions provided by other UDFs which support nullable values, positional values within the array etc.

The SerDe creates a single XML document sourced from a column in the source table. In essence the definition expects the root elements to be defined and any repeating groups within the document to be handled by the node and node leaf definition.

The table definition using the SerDe consists of the following sections

  • column definitions - which will be the table columns available for queries

  • source table definition - details of the source table from where the regular columns -XML string is queried

  • column mapping definition - the mapping of the exposed columns to the source table - the XPath mappings of values in the XML string to the table columns

  • SerDe classes - the input and output classes used by the table definition

A sample table definition is shown below.

CREATE external TABLE customer (
          uid                  string,
          firstname            string,
          lastname             string,
          ssn                  string,
          dob                  string
)
ROW FORMAT SERDE 'io.invariant.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
  "hive.table.type"="orc",
  "hive.table.schema"="scratch",
  "hive.table.name"="customer_ty2016",
  "table.derived.column.dln"="uid",
  "table.derived.xml.column"="cust_detl_xml",
  "column.xpath.node.cheader"="/CUST-DETAIL/Header",
  "column.xpath.cheader.firstname"="Account/Primary/FirstName/text()",
  "column.xpath.cheaderlastname"="Account/Primary/LastName/text()",
  "column.xpath.cheader.ssn"="Account/Primary/SSN/text()",
  "column.xpath.cheader.dob"="Account/Primary/DateOfBirth/text()"
 )
STORED AS 
INPUTFORMAT 'io.invariant.hive.serde2.xml.format.InvHiveBasedXmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION  "/etl/data/app.db/customer_ty2016"
TBLPROPERTIES (
  "xmlinput.start"="<CUST-DETAIL",
  "xmlinput.end"="</CUST_DETAIL>"
);

Table definition

CREATE external TABLE customer (
          uid                  string,
          firstname            string,
          lastname             string,
          ssn                  string,
          dob                  string
)

An external table is created with a set of columns

ROW FORMAT SERDE 'io.invariant.hive.serde2.xml.XmlSerDe'

The SerDe used to define the row format. This will be the SerDe used when defining tables querying XML String values

SerDe properties

Key

Sample Value

Description

hive.table.type

ORC

Type of table. Currently ORC & AVRO types are supported

hive.table.schema

app

schema of the source table

hive.table.name

customer_ty2016

source table name

table.derived.xml.column

cust_detl_xml

Column in the source table that contains the XML string to be queried

table.derived.column.custid

uid

Non XML column mapped to the target.

Indicates the uid column from source table to be mapped to custid in the target table. Please ensure the target column definition is consistent with the set of columns defined in the external table in the top section

XML mapping

XML mapping follows a definition of a node and leaves of a node made up of the elements/attributes under the leaf. This allows for supporting repeating groups where a node may have repeating elements which will be mapped to an array datatype on the target table. Please ensure repeating groups are defined as arrays in the target table definition.

 "column.xpath.node.cheader"="/CUST-DETAIL/Header",

Node definition "column.xpath.node.cheader" indicates node named cheader defined by "/CUST-DETAIL/Header"

All elements, attributes to be queried mapped within this node should be called column.xpath.cheader.COLUMNNAME and this column name should be defined in the external table definition.

"column.xpath.cheader.firstname"="Filer/Primary/FirstName/text()",
"column.xpath.cheader.lastname"="Filer/Primary/LastName/text()",
"column.xpath.cheader.ssn"="Filer/Primary/SSN/text()",              
"column.xpath.cheader.dob"="Filer/Primary/DateOfBirth/text()"

Element/attribute definition "column.xpath.cheader.prifirstname" indicates an element/attribute to be found under cheader node as the XPath definition "Filer/Primary/FirstName/text()"

INPUTFORMAT 'io.invariant.hive.serde2.xml.format.InvHiveBasedXmlInputFormat'

=> default value for defining external tables that query XML string value

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' 

=> default value for defining external tables that query XML string value

LOCATION  "/apps/hive/warehouse/scratch.db/customer_ty2016"  

=> location of the data files of the source table

TBLPROPERTIES (                                                              
     "xmlinput.start"="<CUST-DETAIL",
     "xmlinput.end"="</CUST-DETAIL>"
   );               

=> XML start and end string definition. Please note the start of the XML string should not be the root element "<root element name" without the > tag to handle attributes etc. The end will be the end tag of the root element.

Select XML Nodes as String Output

The SerDe can be used to select nodes from an XML document as string values to support XML Shredding efforts. It follows the same convention in terms of establishing nodes of interest and subsequently defining elements/attributes within a node group to be exposed as columns. To expose the node as an xml string, the column needs to be defined with a mapping of "="_nodeToString". The SerDe supports both single and array of nodes in the result if the document node element in unbounded. Arrays can be further exposed as records by using provided user defined functions.

Example

CREATE external TABLE doc_node (
           cust_id              string,
           case_no              int,
           case_vals            array<string>
 )
ROW FORMAT SERDE 'io.invariant.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
  "hive.table.type"="orc",
  "hive.table.schema"="default",
  "hive.table.name"="helloworldorc",
  "hive.table.type"="orc",
  "table.derived.column.custid"="cust_id",
  "table.derived.column.caseid"="case_no",
  "table.derived.xml.column"="xmlstring",
  "column.xpath.node.caformw2"="/Submission/Attachment[ScheduleHeader/FormName='CAFormW2']",
  "column.xpath.caformw2.case_vals"="_nodeToString"
)
STORED AS 
INPUTFORMAT 'io.invariant.hive.serde2.xml.format.InvHiveBasedXmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION  "/etl/data/app.db/doc_node"
TBLPROPERTIES (
  "xmlinput.start"="<Submission",
  "xmlinput.end"="</Submission>"
);

Last updated