Data Node
In the Hadoop Distributed File System (HDFS), the DataNode is a crucial component that is responsible for the actual storage and retrieval of data. While the NameNode manages the metadata and namespace (e.g., the directory structure and block locations), the DataNode is responsible for holding the actual data blocks that make up the files in the system.
The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Each DataNode runs on a worker machine in the Hadoop cluster, and there are typically many DataNodes in a large Hadoop cluster, each storing a portion of the data. The number of DataNodes can grow or shrink depending on the cluster’s size and storage requirements.
Key Responsibilities of the DataNode
Storing Data:
DataNodes are responsible for storing the actual data in the form of blocks. Each file in HDFS is split into blocks (by default 128MB or 256MB per block), and these blocks are stored across various DataNodes in the cluster.
The DataNode writes, reads, and replicates data blocks when requested by the NameNode or clients.
Block Management:
Each block stored in the DataNode is part of a larger file, and the DataNode manages those blocks.
DataNodes are responsible for replicating data blocks to ensure fault tolerance. They also delete blocks when instructed by the NameNode (e.g., when files are deleted or blocks need to be re-replicated).
Heartbeats:
DataNodes send regular heartbeats to the NameNode to let it know that the DataNode is alive and functioning properly.
If the NameNode does not receive a heartbeat from a DataNode within a set period, it considers the DataNode as dead and will replicate the blocks stored on the DataNode to other healthy DataNodes.
Block Reporting:
DataNodes send a block report to the NameNode periodically. This report contains a list of all the blocks stored on that DataNode.
This helps the NameNode keep track of which blocks are located on which DataNodes and whether any blocks are missing or under-replicated.
Data Block Retrieval and Writing:
When a client requests data from HDFS, the NameNode provides the client with the locations of the blocks. The client can then directly communicate with the DataNodes to retrieve the blocks of the file.
Similarly, when a client wants to write data, the NameNode directs the client to multiple DataNodes where the file’s blocks should be stored.
Data Integrity:
DataNodes are responsible for ensuring that the data stored on them is not corrupted. Each block stored on the DataNode has an associated checksum that is used to verify the integrity of the data.
If the checksum doesn’t match the actual data during a read operation, the NameNode is notified, and the block is either repaired or re-replicated from another copy.
Block Replication:
DataNodes help maintain the replication factor of blocks. If the replication level of a block falls below the desired replication factor (e.g., due to a DataNode failure), the DataNode will cooperate with other DataNodes to replicate the block to other nodes to restore redundancy.
By default, HDFS maintains 3 replicas of each block, but this replication factor can be configured.
How DataNodes Work in HDFS
When a file is written to HDFS, the data is broken into blocks, and these blocks are distributed across various DataNodes. Here's a high-level overview of the process:
File Write Process:
A client wants to store a file in HDFS.
The NameNode determines the DataNodes that should store the blocks for that file based on the available storage and replication factor.
The client communicates with the DataNodes to store the blocks. DataNodes write the blocks to disk.
The DataNode sends an acknowledgment to the NameNode to confirm that the block has been stored.
The NameNode updates its metadata to track the locations of the blocks.
File Read Process:
A client wants to read a file from HDFS.
The NameNode provides the client with the locations of the blocks for that file.
The client communicates directly with the DataNodes to fetch the blocks of the file.
Once the blocks are retrieved, the client reassembles them into the original file.
Block Replication:
DataNodes check the replication status of their blocks and report it to the NameNode.
If a DataNode fails or if a block becomes under-replicated (due to a failure or a node going down), the NameNode instructs other DataNodes to replicate the missing blocks.
This ensures that HDFS maintains the configured replication factor and provides fault tolerance.
Data Integrity and Error Handling:
Each block of data stored on the DataNode has a checksum.
The DataNode verifies the integrity of its blocks during read operations using the checksum. If a mismatch occurs, the DataNode will request the NameNode to replicate the block from another copy.
In case of data corruption, the NameNode triggers a recovery process by instructing DataNodes to replicate or repair the block.
Communication Between DataNodes and NameNode
Heartbeats: DataNodes send periodic heartbeats to the NameNode to indicate they are alive and functioning. The absence of heartbeats from a DataNode signals to the NameNode that the DataNode has failed.
Block Reports: DataNodes send block reports to the NameNode to inform it of the blocks stored on them. This helps the NameNode maintain an up-to-date inventory of where each block is stored in the cluster.
DataNode Failures and Recovery
Detection: The NameNode continuously monitors the health of DataNodes by waiting for heartbeats. If a DataNode does not send a heartbeat within a specified interval, the NameNode considers it unavailable.
Replication for Fault Tolerance: In the event of a DataNode failure, HDFS ensures that the data is still available by replicating the blocks that were stored on the failed DataNode to other DataNodes. The NameNode triggers replication of these blocks to ensure the required replication factor is met.
Data Recovery: If a block is corrupted or lost due to a DataNode failure, the NameNode identifies the missing or corrupted blocks and schedules replication from other available copies of the blocks.
DataNode in High Availability Setup
In a Hadoop High Availability (HA) configuration, there may be multiple DataNodes. If one DataNode goes down, another can take its place, but this is handled at the block level through replication.
Block-level Replication: If a DataNode fails, the NameNode replicates the lost blocks to other DataNodes to restore the replication factor.
DataNode Recovery: Once a DataNode is back online after a failure, it will automatically begin re-storing and re-replicating blocks to restore the system’s replication requirements.
DataNode Architecture
The DataNode architecture consists of the following components:
Data Block Storage: The actual data blocks are stored in the local storage of the DataNode’s machine.
DatanodeDaemon: The daemon (or service) running on the DataNode machine that manages storage, replication, and communication with the NameNode.
Block Scanner: A component in the DataNode that regularly checks for data integrity and scans blocks for corruption.
Disk and Network Management: The DataNode interacts with the local disk and network to store/retrieve data blocks and communicate with the NameNode and clients.
Last updated