Documents
  • Invariant Documents
  • Platform
    • Data Platform
      • Install Overview
      • System Requirement
      • Software Requirement
      • Prepare the Environment
      • Installing Ambari Server
      • Setup Ambari Server
      • Start Ambari Server
      • Single Node Install
      • Multi-Node Cluster Install
      • Cluster Install from Ambari
      • Run and monitor HDFS
    • Apache Hadoop
      • Compatible Hadoop Versions
      • HDFS
        • HDFS Architecture
        • Name Node
        • Data Node
        • File Organization
        • Storage Format
          • ORC
          • Parquet
        • Schema Design
      • Hive
        • Data Organization
        • Data Types
        • Data Definition
        • Data Manipulation
          • CRUD Statement
            • Views, Indexes, Temporary Tables
        • Cost-based SQL Optimization
        • Subqueries
        • Common Table Expression
        • Transactions
        • SerDe
          • XML
          • JSON
        • UDF
      • Oozie
      • Sqoop
        • Commands
        • Import
      • YARN
        • Overview
        • Accessing YARN Logs
    • Apache Kafka
      • Compatible Kafka Versions
      • Installation
    • Elasticsearch
      • Compatible Elasticsearch Versions
      • Installation
  • Discovery
    • Introduction
      • Release Notes
    • Methodology
    • Discovery Pipeline
      • Installation
      • DB Event Listener
      • Pipeline Configuration
      • Error Handling
      • Security
    • Inventory Manager
      • Installation
      • Metadata Management
      • Column Mapping
      • Service Configuration
      • Metadata Configuration
      • Metadata Changes and Versioning
        • Generating Artifacts
      • Reconciliation, Merging Current View
        • Running daily reconciliation and merge
      • Data Inventory Reports
    • Schema Registry
  • Process Insight
    • Process Insight
      • Overview
    • Process Pipeline
      • Data Ingestion
      • Data Storage
    • Process Dashboards
      • Panels
      • Templating
      • Alerts
        • Rules
        • Notifications
  • Content Insight
    • Content Insight
      • Release Notes
      • Configuration
      • Content Indexing Pipeline
    • Management API
    • Query DSL
    • Configuration
  • Document Flow
    • Overview
  • Polyglot Data Manager
    • Polyglot Data Manager
      • Release Notes
    • Data Store
      • Concepts
      • Sharding
    • Shippers
      • Filerelay Container
    • Processors
    • Search
    • User Interface
  • Operational Insight
    • Operational Insight
      • Release Notes
    • Data Store
      • Concepts
      • Sharding
    • Shippers
      • Filerelay Container
    • Processors
    • Search
    • User Interface
  • Data Science
    • Data Science Notebook
      • Setup JupyterLab
      • Configuration
        • Configuration Settings
        • Libraries
    • Spark DataHub
      • Concepts
      • Cluster Setup
      • Spark with YARN
      • PySpark Setup
        • DataFrame API
      • Reference
  • Product Roadmap
    • Roadmap
  • TIPS
    • Service Troubleshooting
    • Service Startup Errors
    • Debugging YARN Applications
      • YARN CLI
    • Hadoop Credentials
    • Sqoop Troubleshooting
    • Log4j Vulnerability Fix
Powered by GitBook
On this page
  • Pipeline System Configurations
  • Mapping configuration
  • Broker Configuration
  • Broker Topics
  • HDFS Adapter Configuration
  1. Content Insight
  2. Content Insight

Configuration

Pipeline System Configurations

The data pipeline also consists of a series of definitions which are for additional tuning as well as contains the environment specific details.

  • avrotypemappingdef.yml

  • orctypemappingdef.yml

  • brkadapter.properties

  • inbound.topics.brkadapter.properties

  • invariant-hdfs-adapter.yml

Mapping configuration

The data pipeline is configured to translate data types between source and target systems based on the nature of HDFS file format. the supported sources are DB2, Oracle and target HDFS formats are AVRO and ORC.

The avrotypemappingdef.yml is used to define the source datatype to Avro data type.

The orctypemappingdef.yml is used to map source datatype to Orc data types

In this definition CHAR, VARCHAR, TIMESTAMP, DATE, XML datatypes from a DB2 source will be mapped to Avro STRING.

avrotypemapping.yml

dataserdetype: AVRO
   mapping:
   - STRING:  => Target (Avro ) Data type
     - !dbtypeMapping
       dbtype: DB2          => Source DBMS
       type: [CHAR,VARCHAR,TIMESTAMP,DATE,XML] => Source DBMS data types
   - DECIMAL:
     - !dbtypeMapping
       dbtype: DB2
       type: [DECIMAL]
   - INT:
     - !dbtypeMapping
       dbtype: DB2
       type: [SMALLINT,INTEGER]
   - BIGINT:
     - !dbtypeMapping
       dbtype: DB2
       type: [BIGINT,LONG]

In this definition CHAR, VARCHAR, XML datatypes from a DB2 source will be mapped to ORC STRING.

orctypemappingdef.yml

dataserdetype: ORC
   mapping:
   - STRING:
     - !dbtypeMapping
       dbtype: DB2
       type: [CHAR,VARCHAR,XML]
   - DATE:
     - !dbtypeMapping
       dbtype: DB2
       type: [DATE]
   - TIMESTAMP:
     - !dbtypeMapping
       dbtype: DB2
       type: [TIMESTAMP]
   - DECIMAL:
     - !dbtypeMapping
       dbtype: DB2
       type: [DECIMAL]
   - INT:
     - !dbtypeMapping
       dbtype: DB2
       type: [SMALLINT,INTEGER]
   - BIGINT:
     - !dbtypeMapping
       dbtype: DB2
       type: [BIGINT,LONG]             

Broker Configuration

brkadapter.properties is used to define the broker properties used to read the database events streamed from the DBMS. This includes the list of brokers, topic and group information used to read from the brokers.

metadata.outbound.broker.topic=testtopic            ==> Topic to publish outbound messages , does not apply to database event processing pipeline
metadata.broker.list=brk1.invariant.io:9092,brk2.invariant.io:9092,brk3.invariant.io:9092 ==> list of broker:port information to be used for reading the message set
metadata.message.processer.group=HdfsGrp28Sep2018   ==> Unique group named used for this pipeline.
metadata.auto.offset.reset=earliest                 ==> Offset to start the read - earliest reads from the  begining of the queue when there is not an offset marked from the process group
metadata.max.partition.fetch.bytes=10240000         ==> Maximum size in bytes for fetching message sets from the broker

Broker Topics

inbound.topics.brkadapter.properties file is used to list the topics from which the data will be consumed. The list of topics should be comma separated

metadata.inbound.broker.topic=table1,table2      ==> Topic containing the database events used by the pipeline.

HDFS Adapter Configuration

invariant-hdfs-adapter.yml is used to configure the Hadoop environment variables, target schema as well as credentials used to interact with HDFS.

hadoop.home: /opt/inv/current/hadoop-client/            ==> Hadoop home to source the libraries required by the processer
hive.home: /opt/inv/current/hive-server2/               ==> Hive home to source the libraries required by the processer
connect.hdfs.principal: invapp                          ==> Hadoop user for the processer to write to hdfs and perform supporting operations
hadoop.conf.dir: /opt/inv/current/hadoop-client/conf/   ==> Hadoop configuration directory to source Hadoop configurations required by the processer
hdfs.url: hdfs://invnmnd.invariant.io:8020             ==> Hadoop Name Node URI
hive.metastore.uris: thrift://invmeta.invariant.io:9083  ==> Hive MetaStore URI
hive.integration: true                                   ==> Update hive  metastore with partition , data file information
hive.database: stream                                    ==> target database for writes 
tables.dir: stream                                       ==> target directory to persist data files
hadoop.hive.warehouse.basedir: /etl/                     ==> Top level directory for stream system
flush.size: 50                                           ==> Write size to flush data in memory
hdfs.authentication.kerberos: false                      ==> kerberos based authentication flag
hdfs.namenode.principal: invapp                          ==> kerberos principal used for  hdfs interactions
connect.hdfs.keytab: ./invapp.server.keytab              ==> kerberos keytab used for authentication
HDFS_AUTHENTICATION_KERBEROS_CONFIG: false               ==>  kerberos based authentication flag, param to be deprecated in future updates
partitioner.class:
io.invariant.invhdfsadapter.partitioner.DailyPartitioner ==> partitioning strategy used for filenaming, file splits
locale:  Locale.US                                       ==> locale used for pipeline processing
timezone: America/Los_Angeles                            ==> time zone used for pipeline processing

PreviousRelease NotesNextContent Indexing Pipeline

Last updated 5 years ago