Schema Design

Hadoop supports schema-less storage but we still need to make decisions about the directory structure as the data flows through the system. To access and manage data via Hive, schemas need to be first defined. Metadata for the stored data plays an important role in the analysis process and a shared catalog can help the load scripts, query tools and BI applications.

HDFS Schema Design

Important to create a structured and organized repository of data

Standard directory structure
Stage data in separate location
Enforce access control

File Location

Standard location where files are stored. User files under /user/<name>

data/
             ods/
             bpm/
             logs/
 group/
             fraud-analysis/
             claims-analysis/
 user/
             bob/
             john/
app/
             hive/
dashboard/
             call-center/
app-logs/
metadata/

Data is stored in files within the Hadoop filesystem. Data can be separated based on functional use with enforced access control.

PreviousParquet NextHive

Last updated 6 years ago