Inventory Manager

Schema Management, Data Pipeline and Workflow

Introduction

The Invariant Inventory Manager module is designed to manage the metadata, configurations across source and targets, track data inventory between source and target data-stores. It provides a set of utilities to move data, and reconcile between the data stores used by the Invariant HDFS pipeline.

Metadata Changes and Versioning

The inventory manager is initialized as part of the installation with a starting schema version. From that point on each version of the data source schema versions, additions of data sources, data source tables are tracked as unique version within the inventory manager.

Version updates

The inventory manager is updated with any changes from source systems and tracked as a new version.

  • Setting up a new data source by additions to config file

  • Adding tables to an existing data source by additions to config file

  • Changes to table metadata due to additions/changes to data types in the source system

  • A combination of any of these changes

After the necessary changes are applied to the configuration files, the changes are recorded in the inventory manager using the following shell commands

[usr@inv]# systemVersion.sh updateVersion <versionNo>
Example [user@inv]# systemVersion.sh updateVersion 1.23

The “versionNo” is the version number of the metadata to be recorded by the system. This version no can follow the project established numbering scheme. It is an alphanumeric field of 20 characters length.

To obtain a list of versions tracked by the inventory manager use the list utility. The list of versions with the date they were saved will be returned.

[usr@inv]# systemVersion.sh listAllVersions
     1.23   2018-08-01
     1.24   2018-09-01
     1.25   2018-10-01

Generate Pipeline XML and Discovery Views DDL

The metadata is used to generate XML configurations to support the discovery HDFS pipeline as well as the table DDL for the incremental, reconciliation and the current view. The artifacts can be generated on demand using the console utilities and the output is written to the configured directory. The artifacts are generated for the current version of the metadata only.

The genArtifacts script is used to generate the XML configuration supporting the discovery HDFS pipeline. This will generate configurations of in the target table definition and the message structure corresponding to insert, update and delete event. The target table definition file is named <datasourcename>_<table-name>.xml, the database event definitions are named <datasourcename>_<table-name>_d.xml, …_u.xml, …_i.xml corresponding to delete, update and insert.

[user@inv]# genArtifacts.sh generatePPLXML
12/21/2018  12:40 AM             2,226 datasource1_TABLTWO.xml
12/21/2018  12:40 AM             6,134 datasource1_TABLTWO_d.xml
12/21/2018  12:40 AM             6,134 datasource1_TABLTWO_i.xml
12/21/2018  12:40 AM             6,134 datasource1_TABLTWO_u.xml
12/21/2018  12:40 AM             2,226 datasource2_TABLTWO.xml
12/21/2018  12:40 AM             6,134 datasource2_TABLTWO_d.xml12/21/2018  12:40 AM             6,134 datasource2_TABLTWO_i.xml
12/21/2018  12:40 AM             6,134 datasource2_TABLTWO_u.xml

The genArtifacts script is used to generate the DDL necessary to setup the discovery HDFS views as well. Running genArtifacts with generateDDL produces the create table DDL for the incremental, reconciliation and current view corresponding to the current version present in the inventory manager.

[user@inv]# genArtifacts.sh generateDDL
12/21/2018  01:01 AM    <DIR>          ..
12/21/2018  01:01 AM               589 datasource1_TABLTWO_ddl_incr.sql
12/21/2018  01:01 AM               589 datasource1_TABLTWO_ddl_recon.sql
12/21/2018  01:01 AM               589 datasource1_TABLTWO_ddl_curr.sql

The table create DDL is named <datasourcename>_<table-name>_ddl_incr.sql, <datasourcename>_<table-name>_ddl_recon.sql and <datasourcename>_<table-name>_ddl_curr.sql corresponding to the table in incremental, reconciliation and current view respectively.

In future releases, once the initial schema version is established in a project subsequent changes will likely be alterations to existing tables and the occasional addition of new tables. To support this activity the inventory manager will support the incremental changes by generating alter changes which support incremental addition to the target database. The genArtifacts script will be extended to generate the incremental DDL by passing a different set of parameters.

[user@inv]# genArtifacts.sh generateAlterDDL 1.2 1.3
12/21/2018  01:01 AM    <DIR>          ..
12/21/2018  01:01 AM               589 datasource1_TABLTWO_altr_ddl_incr.sql
12/21/2018  01:01 AM               589 datasource1_TABLTWO_altr_ddl_recon.sql
12/21/2018  01:01 AM               589 datasource1_TABLTWO_altr_ddl_curr.sql

The DDL generated by the inventory manager is meant to serve as a good starting point. Additions, such as bucketing, additional indexes, partitions may be added to suit the project objectives. In this case the inventory manager generated artifacts can be altered as necessary and the result saved in source control to support code migration from staging to production environments.

Reconciliation, Merging Current View

Inventory manager sources data to the reconciliation view for all of the tables from the various configured data sources. This should be schedule nightly, ideally right after midnight. This data is then used for validating the incremental data, which is sourced near-real time nightly. Any errors in the pipeline are addressed as part of the validation using the data. By default, the reconciliation data is stored for 30-day period but this time period is configurable. It is expected to record the inventory counts and reconciliation fixes in that period. The data for reconciliation is sourced into the recon schema as configured in the target data store configuration.

The data from incremental view is merged into current view periodically for all of the configured tables by data source. This merge leverages the sourced date time, database action to effectively maintain the current view to reflect the data state of the tables in the source system. In addition, the date column is updated to reflect the time of the update in the current view so downstream data jobs can use that to drive processing logic i.e. select data changed for a particular time window.

Reconciliation Sourcing

A set of utilities are provided to obtain status of the reconciliation data sourcing and current view merge. In addition, administrative configurations are available to email an operational group of any errors in sourcing the recon data or on merging of the current view so operational and environments team can troubleshoot further.

[usr@inv]# reconDataSourcing.sh dataSourcingStatus allTables
[usr@inv]# reconDataSourcing.sh dataSourcingStatus allTables 2018-10-01  2018-10-10
[usr@inv]# reconDataSourcing.sh dataSourcingStatusByTable TableA
[usr@inv]# reconDataSourcing.sh dataSourcingStatusByTable TableA 2018-10-01  2018-10-10

The reconDataSourcing command can be used to get the status of all tables configured or by a particular table. The status of the recon data sourcing is returned for the last day by default. The status can be retrieved for all configured tables or by table based on using appropriate parameter. In addition to last day status the other date-based option is supported for both options where reconDataSourcing is queried using a start date and end date. A maximum of 10 days is supported.

For operational support purposes reconDataSourcing command can be used to source data on demand to the reconciliation schema for validation purposes or for other data fixes. This is supported for sourcing a day of data or for a range of days for a maximum of 10 days. The team maintaining the inventory should perform the necessary due diligence to ensure there is no adverse impact to the source systems from excessive data sourcing. Ensure that the amount of records being sourced and its impact on the source system is well understood before starting the data sourcing on demand.

[usr@inv]# reconDataSourcing.sh sourceData  TableA    2018-10-01  
[usr@inv]# reconDataSourcing.sh sourceData  TableA    2018-10-01  2018-10-10

Current View Merge

The reconDataSourcing command can be used to get the status current view data merging that happens daily. Using the currViewMerge console command we can get status of all merge on the current view of all tables, a particular table for the last day or for a time period for a maximum of 10 days.

[usr@inv]# currViewMerge.sh statusOfAllTables
[usr@inv]# currViewMerge.sh statusByTable TableA
[usr@inv]# currViewMerge.sh statusOfAllTables     12-10-01 12-10-20
[usr@inv]# currViewMerge.sh statusByTable TableA  12-10-01 12-10-20

For operational support purposes, currViewMerge command can be used to merge data into curr view on demand to support data fixes. This is supported for a day of or for a range of days, for a maximum of 10 days. The team needs to perform necessary diligence to ensure the impact of merging too many days of data as its likely to impact the job queues affecting all of the other data operations. It is highly recommended to merge in small increments so as to keep the impact on the data infrastructure to a minimum.

Last updated