Documents
  • Invariant Documents
  • Platform
    • Data Platform
      • Install Overview
      • System Requirement
      • Software Requirement
      • Prepare the Environment
      • Installing Ambari Server
      • Setup Ambari Server
      • Start Ambari Server
      • Single Node Install
      • Multi-Node Cluster Install
      • Cluster Install from Ambari
      • Run and monitor HDFS
    • Apache Hadoop
      • Compatible Hadoop Versions
      • HDFS
        • HDFS Architecture
        • Name Node
        • Data Node
        • File Organization
        • Storage Format
          • ORC
          • Parquet
        • Schema Design
      • Hive
        • Data Organization
        • Data Types
        • Data Definition
        • Data Manipulation
          • CRUD Statement
            • Views, Indexes, Temporary Tables
        • Cost-based SQL Optimization
        • Subqueries
        • Common Table Expression
        • Transactions
        • SerDe
          • XML
          • JSON
        • UDF
      • Oozie
      • Sqoop
        • Commands
        • Import
      • YARN
        • Overview
        • Accessing YARN Logs
    • Apache Kafka
      • Compatible Kafka Versions
      • Installation
    • Elasticsearch
      • Compatible Elasticsearch Versions
      • Installation
  • Discovery
    • Introduction
      • Release Notes
    • Methodology
    • Discovery Pipeline
      • Installation
      • DB Event Listener
      • Pipeline Configuration
      • Error Handling
      • Security
    • Inventory Manager
      • Installation
      • Metadata Management
      • Column Mapping
      • Service Configuration
      • Metadata Configuration
      • Metadata Changes and Versioning
        • Generating Artifacts
      • Reconciliation, Merging Current View
        • Running daily reconciliation and merge
      • Data Inventory Reports
    • Schema Registry
  • Process Insight
    • Process Insight
      • Overview
    • Process Pipeline
      • Data Ingestion
      • Data Storage
    • Process Dashboards
      • Panels
      • Templating
      • Alerts
        • Rules
        • Notifications
  • Content Insight
    • Content Insight
      • Release Notes
      • Configuration
      • Content Indexing Pipeline
    • Management API
    • Query DSL
    • Configuration
  • Document Flow
    • Overview
  • Polyglot Data Manager
    • Polyglot Data Manager
      • Release Notes
    • Data Store
      • Concepts
      • Sharding
    • Shippers
      • Filerelay Container
    • Processors
    • Search
    • User Interface
  • Operational Insight
    • Operational Insight
      • Release Notes
    • Data Store
      • Concepts
      • Sharding
    • Shippers
      • Filerelay Container
    • Processors
    • Search
    • User Interface
  • Data Science
    • Data Science Notebook
      • Setup JupyterLab
      • Configuration
        • Configuration Settings
        • Libraries
    • Spark DataHub
      • Concepts
      • Cluster Setup
      • Spark with YARN
      • PySpark Setup
        • DataFrame API
      • Reference
  • Product Roadmap
    • Roadmap
  • TIPS
    • Service Troubleshooting
    • Service Startup Errors
    • Debugging YARN Applications
      • YARN CLI
    • Hadoop Credentials
    • Sqoop Troubleshooting
    • Log4j Vulnerability Fix
Powered by GitBook
On this page
  1. Discovery

Methodology

Data Analysis Methodology

PreviousRelease NotesNextDiscovery Pipeline

Last updated 6 years ago

Data Analytics Lifecycle

Defining an analytics roadmap and using a strategic framework to use with the roadmap is key to the success of any data analytics project. All data analytics project must start with a clear understanding the customer requirements and organizational needs. This should be followed by an established data discovery lifecycle to structure the project. The discovery lifecycle is key to successfully planning and executing all steps from start to finish.

The high-level goal or problem statement can be continuously refined as the discovery lifecycle is iterative. The objective should be to refine the business questions and models until you arrive upon the desired final model which can be operationalized. The discovery lifecycle is loosely based on Crisp-DM (Cross Industry Standard Process for Data Mining), which is a widely adopted methodology for data mining and knowledge discovery.

CRISP-DM is a measured, step-by-step approach with a systems perspective for managing the complete lifecycle of the analytic initiatives. The process is broken into six major phases, but the exact sequence is not strict and implementers are free to move back and forth between different phases as needed.

In the figure above, the outer loop represents the cyclical nature of data mining and knowledge discovery process. Typically, the discovery process can continue even after the model is built and operationalized. The analysts can continue to use the process to look for new insights with the results from each iteration triggering new questions and learning from the previous ones.

The Discovery lifecycle includes the following major stages that should be executed iteratively:

  • Understand Business Needs

  • Acquire and Understand Data

  • Build and Refine Model

  • Evaluate Model

  • Operationalize Model