Documents
  • Invariant Documents
  • Platform
    • Data Platform
      • Install Overview
      • System Requirement
      • Software Requirement
      • Prepare the Environment
      • Installing Ambari Server
      • Setup Ambari Server
      • Start Ambari Server
      • Single Node Install
      • Multi-Node Cluster Install
      • Cluster Install from Ambari
      • Run and monitor HDFS
    • Apache Hadoop
      • Compatible Hadoop Versions
      • HDFS
        • HDFS Architecture
        • Name Node
        • Data Node
        • File Organization
        • Storage Format
          • ORC
          • Parquet
        • Schema Design
      • Hive
        • Data Organization
        • Data Types
        • Data Definition
        • Data Manipulation
          • CRUD Statement
            • Views, Indexes, Temporary Tables
        • Cost-based SQL Optimization
        • Subqueries
        • Common Table Expression
        • Transactions
        • SerDe
          • XML
          • JSON
        • UDF
      • Oozie
      • Sqoop
        • Commands
        • Import
      • YARN
        • Overview
        • Accessing YARN Logs
    • Apache Kafka
      • Compatible Kafka Versions
      • Installation
    • Elasticsearch
      • Compatible Elasticsearch Versions
      • Installation
  • Discovery
    • Introduction
      • Release Notes
    • Methodology
    • Discovery Pipeline
      • Installation
      • DB Event Listener
      • Pipeline Configuration
      • Error Handling
      • Security
    • Inventory Manager
      • Installation
      • Metadata Management
      • Column Mapping
      • Service Configuration
      • Metadata Configuration
      • Metadata Changes and Versioning
        • Generating Artifacts
      • Reconciliation, Merging Current View
        • Running daily reconciliation and merge
      • Data Inventory Reports
    • Schema Registry
  • Process Insight
    • Process Insight
      • Overview
    • Process Pipeline
      • Data Ingestion
      • Data Storage
    • Process Dashboards
      • Panels
      • Templating
      • Alerts
        • Rules
        • Notifications
  • Content Insight
    • Content Insight
      • Release Notes
      • Configuration
      • Content Indexing Pipeline
    • Management API
    • Query DSL
    • Configuration
  • Document Flow
    • Overview
  • Polyglot Data Manager
    • Polyglot Data Manager
      • Release Notes
    • Data Store
      • Concepts
      • Sharding
    • Shippers
      • Filerelay Container
    • Processors
    • Search
    • User Interface
  • Operational Insight
    • Operational Insight
      • Release Notes
    • Data Store
      • Concepts
      • Sharding
    • Shippers
      • Filerelay Container
    • Processors
    • Search
    • User Interface
  • Data Science
    • Data Science Notebook
      • Setup JupyterLab
      • Configuration
        • Configuration Settings
        • Libraries
    • Spark DataHub
      • Concepts
      • Cluster Setup
      • Spark with YARN
      • PySpark Setup
        • DataFrame API
      • Reference
  • Product Roadmap
    • Roadmap
  • TIPS
    • Service Troubleshooting
    • Service Startup Errors
    • Debugging YARN Applications
      • YARN CLI
    • Hadoop Credentials
    • Sqoop Troubleshooting
    • Log4j Vulnerability Fix
Powered by GitBook
On this page
  1. Polyglot Data Manager

Processors

Parsing and normalizing data

Operational Insight makes use of Logstash framework in its data pipeline for log event processing. Logstash supports a pluggable pipeline architecture. It accepts inputs from a variety of sources, parses and transforms the data using user defined rules, and writes out the parsed data to an Elasticsearch cluster. Logstash provides rich capabilities for processing and transforming logs as well as other forms of data. It supports a large and extensible array of input, filter, and output plugins and codecs, allowing any type of event to be enriched and transformed as part of the ingestion process.

Log Event Processing Pipeline

The event processing pipeline has three stages: inputs → filters → outputs. Inputs generate events, filters modify them, and outputs ship them elsewhere. Inputs and outputs support codecs that can be used to encode or decode the data as it enters or exits the pipeline without having to use a separate filter.

Inputs

Inputs are used to get data into the pipeline. Some commonly-used inputs are:

  • file: read from a file on the filesystem, much like the UNIX command “tail -0F”.

  • syslog: listens on the well-known port 514 for syslog messages in the RFC3164 format.

  • beats: processes events sent by filerelay and other metric beats.

Filters

Filters are intermediary processing stage in the pipeline. They can be combined with conditional logic to perform an action on an event if it meets certain criteria. Some examples include:

  • grok: parse and structure arbitrary text. Use to parse unstructured log data into something structured which can be queried

  • mutate: perform general transformations on event fields. Rename, remove, replace, and modify fields in the events.

  • drop: drop an event completely

  • clone: make a copy of an event, possibly adding or removing fields.

Outputs

Outputs form the final phase of the pipeline. An event can pass through multiple outputs, but once all output processing is complete, the event has finished its execution. Some common outputs are:

  • Elasticsearch: send event data to Elasticsearch.

  • File: write event data to a file on disk.

Codecs

Codecs are stream filters that can operate as part of an input or output. Codecs enable easy separation of the transport of messages from the serialization process. Popular codecs include json, multiline, and plain (text).

  • json: encode or decode data in the JSON format.

  • multiline: merge multiple-line text events such as java exception and stack-trace messages into a single event.

PreviousFilerelay ContainerNextSearch

Last updated 4 years ago

For more details, please refer to

Logstash documentation