Introduction

Discovery

Discovery helps business carry out Ad-Hoc analysis of business data in an iterative manner. It provides a fast data ingestion and lookup system for heterogeneous dataset and its content. Businesses may want to store heterogeneous data from various sources into Hadoop or NoSQL databases and run analytics on its contents. This requires a platform that can help to enterprise build a scalable and structured centralized data store.

Discovery is built to leverage Hadoop and is an ideal platform for exploratory data analysis. Transforming and loading data into EDW for reporting and business intelligence needs is expensive and it makes sense to analyze the data to make sure that we are looking and loading the right data before we incur the expense of expensive ETL tasks.

Concepts

Data Warehouse

The modern data warehouse is a key part of the enterprise analytics solution. How data is captured, stored and analyzed is central to the company’s operations.

A data warehouse is a copy of transaction data specifically structured for query and analysis — Ralph Kimball

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision-making process — Bill Inmon

The choice of technology and methodology should be made so that it is scalable and able to evolve as per the organization’s data needs.

Enterprise Data Lake

A data lake is a centralized repository to store structured, semi-structured and unstructured data from various sources in as-is raw format. This is different from data warehouses which are geared more towards cleaning and storing data for efficient queries and feeding data marts for reporting. Data lakes can stored data in a variety of formats which can then be used for Adhoc SQL queries, search as well as analytical processing and machine learning. Hadoop is one of the most popular platform in use for building data lakes.

When building a data warehouse or data lake, it is important to understand the analytics lifecycle and align it to the organization’s strategic goals. The next sections will outline the discovery methodology and how the Invariant platform can support you in achieving those analytical goals.

ETL/ELT

Traditional data collection processes include data extraction from source, transformation to get it to the desired structure before storing it in the target system for analysis. This is commonly referred to as Extract-Transform-Load (ETL) process. Another processing technique is Extract-Load-Transform (ELT), in which the extracted data is loaded onto the target systems and then transformed if needed.

Data Archival

Enterprises may wish to adhere to certain data retention requirements due to regulatory or business needs. However, it is costly to keep all the data around in a database or warehouse, so the data is eventually backed up and stored in external offline stores. Hadoop provides an inexpensive option to keep data around for online analysis.

Last updated