Documents
  • Invariant Documents
  • Platform
    • Data Platform
      • Install Overview
      • System Requirement
      • Software Requirement
      • Prepare the Environment
      • Installing Ambari Server
      • Setup Ambari Server
      • Start Ambari Server
      • Single Node Install
      • Multi-Node Cluster Install
      • Cluster Install from Ambari
      • Run and monitor HDFS
    • Apache Hadoop
      • Compatible Hadoop Versions
      • HDFS
        • HDFS Architecture
        • Name Node
        • Data Node
        • File Organization
        • Storage Format
          • ORC
          • Parquet
        • Schema Design
      • Hive
        • Data Organization
        • Data Types
        • Data Definition
        • Data Manipulation
          • CRUD Statement
            • Views, Indexes, Temporary Tables
        • Cost-based SQL Optimization
        • Subqueries
        • Common Table Expression
        • Transactions
        • SerDe
          • XML
          • JSON
        • UDF
      • Oozie
      • Sqoop
        • Commands
        • Import
      • YARN
        • Overview
        • Accessing YARN Logs
    • Apache Kafka
      • Compatible Kafka Versions
      • Installation
    • Elasticsearch
      • Compatible Elasticsearch Versions
      • Installation
  • Discovery
    • Introduction
      • Release Notes
    • Methodology
    • Discovery Pipeline
      • Installation
      • DB Event Listener
      • Pipeline Configuration
      • Error Handling
      • Security
    • Inventory Manager
      • Installation
      • Metadata Management
      • Column Mapping
      • Service Configuration
      • Metadata Configuration
      • Metadata Changes and Versioning
        • Generating Artifacts
      • Reconciliation, Merging Current View
        • Running daily reconciliation and merge
      • Data Inventory Reports
    • Schema Registry
  • Process Insight
    • Process Insight
      • Overview
    • Process Pipeline
      • Data Ingestion
      • Data Storage
    • Process Dashboards
      • Panels
      • Templating
      • Alerts
        • Rules
        • Notifications
  • Content Insight
    • Content Insight
      • Release Notes
      • Configuration
      • Content Indexing Pipeline
    • Management API
    • Query DSL
    • Configuration
  • Document Flow
    • Overview
  • Polyglot Data Manager
    • Polyglot Data Manager
      • Release Notes
    • Data Store
      • Concepts
      • Sharding
    • Shippers
      • Filerelay Container
    • Processors
    • Search
    • User Interface
  • Operational Insight
    • Operational Insight
      • Release Notes
    • Data Store
      • Concepts
      • Sharding
    • Shippers
      • Filerelay Container
    • Processors
    • Search
    • User Interface
  • Data Science
    • Data Science Notebook
      • Setup JupyterLab
      • Configuration
        • Configuration Settings
        • Libraries
    • Spark DataHub
      • Concepts
      • Cluster Setup
      • Spark with YARN
      • PySpark Setup
        • DataFrame API
      • Reference
  • Product Roadmap
    • Roadmap
  • TIPS
    • Service Troubleshooting
    • Service Startup Errors
    • Debugging YARN Applications
      • YARN CLI
    • Hadoop Credentials
    • Sqoop Troubleshooting
    • Log4j Vulnerability Fix
Powered by GitBook
On this page
  1. Polyglot Data Manager

Polyglot Data Manager

Managing Data from diverse sources

PreviousOverviewNextRelease Notes

Last updated 2 months ago

Invariant Polyglot Data Manager (PDM) seamlessly integrates with the Invariant data platform and other data sources, delivering powerful SQL query support for both interactive and batch workloads. Capable of handling datasets of any size, PDM scales effortlessly to meet the demands of enterprise environments. By leveraging file system caching for data stored in object storage, along with various connectors, PDM reduces latency and minimizes the need for frequent data retrieval. Caching frequently accessed data on local storage devices alleviates the load on distributed file systems (e.g., HDFS), resulting in faster query execution times and improved overall system performance. This caching mechanism ensures that ad-hoc activities can be performed on the same data lake without disrupting ongoing workloads.

At its core, PDM features a distributed query engine that enables parallel data processing across multiple servers. It allows users to query across various data sources without the need for complex ETL processes to centralize the data. This empowers analysts to efficiently run both ad-hoc and batch workloads, conducting SQL-based analysis on large, distributed datasets. PDM specializes in data analytics, excelling at sampling vast datasets to uncover patterns and trends, facilitating quick, data-driven decision-making. Once these patterns are identified and validated, resulting models can be scaled using existing data lake resources, leveraging predefined ETL and ELT pipelines.

The PDM driver is a crucial client component that enables seamless communication between applications and the PDM Server Engine. It facilitates interaction with external systems, allowing users to run queries on data stored across diverse environments, such as databases, data lakes, and cloud storage. Clients can easily integrate with the PDM engine through a variety of platforms, including desktop applications, web interfaces, and modern BI tools like Tableau, providing smooth access to data for comprehensive analysis and reporting.

Use Cases

PDM works seamlessly with both cloud and on prem data sources. You can use PDM for

  • Serving as the SQL query engine behind business intelligence tools.

  • Providing a fast and interactive SQL querying experience for big data.

  • Federating queries across multiple data sources.

  • Query data stored in distributed data lakes such as Amazon S3, Google Cloud Storage, or Hadoop HDFS