Documents
  • Invariant Documents
  • Platform
    • Data Platform
      • Install Overview
      • System Requirement
      • Software Requirement
      • Prepare the Environment
      • Installing Ambari Server
      • Setup Ambari Server
      • Start Ambari Server
      • Single Node Install
      • Multi-Node Cluster Install
      • Cluster Install from Ambari
      • Run and monitor HDFS
    • Apache Hadoop
      • Compatible Hadoop Versions
      • HDFS
        • HDFS Architecture
        • Name Node
        • Data Node
        • File Organization
        • Storage Format
          • ORC
          • Parquet
        • Schema Design
      • Hive
        • Data Organization
        • Data Types
        • Data Definition
        • Data Manipulation
          • CRUD Statement
            • Views, Indexes, Temporary Tables
        • Cost-based SQL Optimization
        • Subqueries
        • Common Table Expression
        • Transactions
        • SerDe
          • XML
          • JSON
        • UDF
      • Oozie
      • Sqoop
        • Commands
        • Import
      • YARN
        • Overview
        • Accessing YARN Logs
    • Apache Kafka
      • Compatible Kafka Versions
      • Installation
    • Elasticsearch
      • Compatible Elasticsearch Versions
      • Installation
  • Discovery
    • Introduction
      • Release Notes
    • Methodology
    • Discovery Pipeline
      • Installation
      • DB Event Listener
      • Pipeline Configuration
      • Error Handling
      • Security
    • Inventory Manager
      • Installation
      • Metadata Management
      • Column Mapping
      • Service Configuration
      • Metadata Configuration
      • Metadata Changes and Versioning
        • Generating Artifacts
      • Reconciliation, Merging Current View
        • Running daily reconciliation and merge
      • Data Inventory Reports
    • Schema Registry
  • Process Insight
    • Process Insight
      • Overview
    • Process Pipeline
      • Data Ingestion
      • Data Storage
    • Process Dashboards
      • Panels
      • Templating
      • Alerts
        • Rules
        • Notifications
  • Content Insight
    • Content Insight
      • Release Notes
      • Configuration
      • Content Indexing Pipeline
    • Management API
    • Query DSL
    • Configuration
  • Document Flow
    • Overview
  • Polyglot Data Manager
    • Polyglot Data Manager
      • Release Notes
    • Data Store
      • Concepts
      • Sharding
    • Shippers
      • Filerelay Container
    • Processors
    • Search
    • User Interface
  • Operational Insight
    • Operational Insight
      • Release Notes
    • Data Store
      • Concepts
      • Sharding
    • Shippers
      • Filerelay Container
    • Processors
    • Search
    • User Interface
  • Data Science
    • Data Science Notebook
      • Setup JupyterLab
      • Configuration
        • Configuration Settings
        • Libraries
    • Spark DataHub
      • Concepts
      • Cluster Setup
      • Spark with YARN
      • PySpark Setup
        • DataFrame API
      • Reference
  • Product Roadmap
    • Roadmap
  • TIPS
    • Service Troubleshooting
    • Service Startup Errors
    • Debugging YARN Applications
      • YARN CLI
    • Hadoop Credentials
    • Sqoop Troubleshooting
    • Log4j Vulnerability Fix
Powered by GitBook
On this page
  • Cluster
  • Node
  • Index
  • Type
  • Document
  1. Polyglot Data Manager
  2. Data Store

Concepts

Key concepts for understanding the Elasticsearch index store are described next.

Cluster

A cluster is a collection of nodes that together holds the data to be stored. The cluster is identified by a unique name and provides federated indexing and search capabilities across all nodes.

Make sure that the cluster name is unique for different environments. For example, use ops-insight-dev for development and ops-insight-prod for production clusters.

Node

A node is a single server that is part of the cluster. It stores data and participates in the cluster’s indexing and search capabilities. A node is identified by a unique name and is important for administration purposes, where you want to identify which servers in the network correspond to nodes in the cluster. A node can only be part of a cluster if the node is set up to join the cluster by its name.

Index

An index is a collection of documents that have somewhat similar characteristics. For example, we can assign an index for business process data, another index for application logs, and yet another index for customer data. An index is identified by a name and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents within it.

In a single cluster, you can define as many indexes as you want. An example index is shown below

"inv_logs": {
     "aliases": {},
     "mappings": {
        "_default_": {
           "_all": {
              "enabled": true
           },
           "properties": {
              "@version": {
                 "type": "keyword"
              },
              "geoip": {
                 "dynamic": "true",
                 "properties": {
                    "continent_code": {
                       "type": "keyword"
                    },
                    "country_name": {
                       "type": "keyword"
                    },
                    "ip": {
                       "type": "ip"
                    },
                    "location": {
                       "type": "geo_point"
                    }
                 }
              }
           ...
          "verb": {
                 "type": "text",
                 "norms": false,
                 "fields": {
                    "raw": {
                       "type": "keyword",
                       "ignore_above": 256
                    }
                 }
              }
           }
        }
     },
     "settings": {
        "index": {
           "refresh_interval": "5s",
           "number_of_shards": "5",
           "provided_name": "inv_logs",
           "creation_date": "1488616444658",
           "number_of_replicas": "1",
           "uuid": "2VLaSZpXQUeHftnAvqu8eQ",
           "version": {
              "created": "5000099"
           }
        }
     }
 }

Type

Within an index, there can be one or more types. A type is a logical category/partition of the index defined complete by the user. In general, a type is defined for documents that have a set of common fields. For example, for a case management system with multiple case types and we may wish to store all our data in a single index. In this index, you may define a type for case data, another type for user data, and yet another type for comments data.

Document

A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order. This document is expressed in JSON (JavaScript Object Notation).

Within an index/type, we can store as many documents as needed.

PreviousData StoreNextSharding

Last updated 4 years ago