# XML

The XML SerDe allows users to query XML data stored in Hive tables through the use of XPath definitions. It can be used to define tables with a combination of regular columns and values parsed from XML string. Additionally, repeating groups within the XML document can be queried and extracted values returned as arrays. These arrays can be further flattened by using functions provided by other UDFs which support nullable values, positional values within the array etc. &#x20;

The SerDe creates a single XML document sourced from a column in the source table. In essence the definition expects the root elements to be defined and any repeating groups within the document to be handled by the node and node leaf definition. &#x20;

The table definition using the SerDe consists of the following sections

* column definitions  - which will be the table columns available for queries
* source table definition -  details of the source table from where the regular columns -XML string is queried
* column mapping definition - the mapping of the exposed columns to the source table - the XPath mappings of values in the XML string to the table columns
* SerDe classes  -  the input and output classes used by the table definition

A sample table definition is shown below.

```
CREATE external TABLE customer (
          uid                  string,
          firstname            string,
          lastname             string,
          ssn                  string,
          dob                  string
)
ROW FORMAT SERDE 'io.invariant.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
  "hive.table.type"="orc",
  "hive.table.schema"="scratch",
  "hive.table.name"="customer_ty2016",
  "table.derived.column.dln"="uid",
  "table.derived.xml.column"="cust_detl_xml",
  "column.xpath.node.cheader"="/CUST-DETAIL/Header",
  "column.xpath.cheader.firstname"="Account/Primary/FirstName/text()",
  "column.xpath.cheaderlastname"="Account/Primary/LastName/text()",
  "column.xpath.cheader.ssn"="Account/Primary/SSN/text()",
  "column.xpath.cheader.dob"="Account/Primary/DateOfBirth/text()"
 )
STORED AS 
INPUTFORMAT 'io.invariant.hive.serde2.xml.format.InvHiveBasedXmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION  "/etl/data/app.db/customer_ty2016"
TBLPROPERTIES (
  "xmlinput.start"="<CUST-DETAIL",
  "xmlinput.end"="</CUST_DETAIL>"
);
```

### &#x20;Table definition

```
CREATE external TABLE customer (
          uid                  string,
          firstname            string,
          lastname             string,
          ssn                  string,
          dob                  string
)
```

An external table is created with a set of columns &#x20;

```
ROW FORMAT SERDE 'io.invariant.hive.serde2.xml.XmlSerDe'
```

The SerDe used to define the row format. This will be the SerDe used when defining tables querying XML String values

### &#x20;SerDe properties

| Key                         | Sample Value     | **Description**                                                                                                                                                                                                                                                            |
| --------------------------- | ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| hive.table.type             | ORC              | Type of table. Currently ORC & AVRO types are supported                                                                                                                                                                                                                    |
| hive.table.schema           | app              | schema of the source table                                                                                                                                                                                                                                                 |
| hive.table.name             | customer\_ty2016 | source table name                                                                                                                                                                                                                                                          |
| table.derived.xml.column    | cust\_detl\_xml  | Column in the source table that contains  the XML string to be queried                                                                                                                                                                                                     |
| table.derived.column.custid | uid              | <p>Non XML column mapped to the target.</p><p>Indicates the uid column from source table to be mapped to custid in the target table. Please ensure the target column definition is consistent with the set of columns defined in the external table in the top section</p> |

### &#x20;XML mapping

XML mapping follows a definition of a node and leaves of a node made up of the elements/attributes under the leaf. This allows for supporting repeating groups where a node may have repeating elements which will be mapped to an array datatype on the target table. Please ensure repeating groups are defined as arrays in the target table definition.

```
 "column.xpath.node.cheader"="/CUST-DETAIL/Header",
```

&#x20;Node definition  "column.xpath.node.cheader" indicates node named cheader defined by **"/CUST-DETAIL/Header"**

All elements, attributes to be queried mapped within this node should be called   **column.xpath.cheader.COLUMNNAME** and this column name should be defined in the external table definition.

```
"column.xpath.cheader.firstname"="Filer/Primary/FirstName/text()",
"column.xpath.cheader.lastname"="Filer/Primary/LastName/text()",
"column.xpath.cheader.ssn"="Filer/Primary/SSN/text()",              
"column.xpath.cheader.dob"="Filer/Primary/DateOfBirth/text()"
```

Element/attribute definition "column.xpath.cheader.prifirstname" indicates an element/attribute to be found under cheader node as the XPath definition "Filer/Primary/FirstName/text()"

```
INPUTFORMAT 'io.invariant.hive.serde2.xml.format.InvHiveBasedXmlInputFormat'
```

&#x20;            \=> default value for defining external tables that query XML string value

```
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' 
```

&#x20;            \=> default value for defining external tables that query XML string value&#x20;

```
LOCATION  "/apps/hive/warehouse/scratch.db/customer_ty2016"  
```

&#x20;            \=> location of the data files of the source table

```
TBLPROPERTIES (                                                              
     "xmlinput.start"="<CUST-DETAIL",
     "xmlinput.end"="</CUST-DETAIL>"
   );               
```

&#x20;            \=> XML start and end string definition. Please note the start of the XML string should not be the root element "\<root element name" without the >   tag to handle attributes etc. The end will be the end tag of the root element.

#### Select XML Nodes as String Output

The SerDe can be used to select nodes from an XML document as string values to support XML Shredding efforts. It follows the same convention in terms of establishing nodes of interest and subsequently defining elements/attributes within a node group to be exposed as columns. To expose the node as an xml string, the column needs to be defined with a mapping of "="\_nodeToString".  The SerDe supports both single and array of nodes in the result if the document node element in unbounded. Arrays can be further exposed as records by using provided user defined functions.

#### Example

```
CREATE external TABLE doc_node (
           cust_id              string,
           case_no              int,
           case_vals            array<string>
 )
ROW FORMAT SERDE 'io.invariant.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
  "hive.table.type"="orc",
  "hive.table.schema"="default",
  "hive.table.name"="helloworldorc",
  "hive.table.type"="orc",
  "table.derived.column.custid"="cust_id",
  "table.derived.column.caseid"="case_no",
  "table.derived.xml.column"="xmlstring",
  "column.xpath.node.caformw2"="/Submission/Attachment[ScheduleHeader/FormName='CAFormW2']",
  "column.xpath.caformw2.case_vals"="_nodeToString"
)
STORED AS 
INPUTFORMAT 'io.invariant.hive.serde2.xml.format.InvHiveBasedXmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION  "/etl/data/app.db/doc_node"
TBLPROPERTIES (
  "xmlinput.start"="<Submission",
  "xmlinput.end"="</Submission>"
);
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.invariant.io/platform/apache-hadoop/hive/serde/xml.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
