What storage for Hadoop?

At the heart of the open source framework is a clustered file system called the Hadoop Distributed File System (HDFS) . HDFS was designed to store very large volumes of data on a large number of machines equipped with unmarked hard drives.

Originally, HDFS was designed in a "write once, read many" logic, well suited to the logic of batch processing of MapReduce, Hadoop's initial data analysis framework.

Based on the local storage of each of the nodes, the first HDFS versions ensured data security by replicating all the data written on the cluster. By default, each data was written on three different nodes (including a copy on the local node).

This approach was obviously neither the most elegant nor the most efficient in terms of redundancy - especially since the first version of HDFS operated only with a unique "namenode", a real flaw in the architecture cluster metadata management. But since SATA hard disk drives were often relied on economically, this device still offered an economical storage solution compared to traditional storage bays.

Káº¿t quáº£ hÃ¬nh áº£nh cho cÃ´ng nghá»‡

Another problem, the first HDFS versions were essentially optimized to maximize data rates and not to deal with random transactional transactions. The fruit, again, of the batch nature of MapReduce .

Gradually, HDFS has evolved to try to correct its initial limitations. With version 2.0 of Hadoop, the main weakness of HDFS has been removed: the HDFS High Availability function has made it possible to replicate the "name node" in active / passive mode, thus protecting the file system against a possible failure. one of the namenodes. This offers fault tolerance.

More recently, version 3.0 of HDFS has brought native integration of erasure codes. The erasure coding(from a contribution from Intel and Cloudera) allows HDFS to store data more efficiently on the cluster and improve data recovery and reliability. Most importantly, with erasure coding, Hadoop no longer has to store 3 complete replicas of data on the cluster. Technically, each folder can be associated with a particular policy that will determine the distribution of stored blocks. The files created in the folder will inherit this rule. According to the Hadoop documentation, "compared to triple replication, the implementation of erasure coding saves 50% of storage capacity while improving fault tolerance." These gains obviously translate into an overall reduction in storage costs.

Overcoming the limitations of HDFS

Regardless of the improvements made to HDFS, another "flaw" in the distributed file system is that it does not conform to the POSIX standard. Some features and familiar commands on a traditional file system are not available. This has led some actors to try to find a substitution solution for HDFS. MapR, one of the pioneers of Hadoop, has developed its own distributed file management system that addresses the problem of fragility related to HDFS 'name nodes (it distributes metadata information on data nodes). .

It also adds advanced features like snapshots, replication or cloning. This file system has recently evolved to gain multi-datacenter, and multi-cloud (Azure, AWS) capabilities and, above all, sophisticated tiering capabilities. Renamed MapR-XD, he also gained support for the S3 protocol ; This allows MapR to include AWS Cloud Object Storage with its storage system and tiering policy.

Several storage array manufacturers such as Dell EMC, HPE, or IBM have also developed HDFS compatibility layers for some of their storage arrays. Their goal is to replace the decentralized storage model of HDFS with a central SAN or NAS model. Their argument is twofold: centralized storage is more efficient, more optimized, and easier to manage than HDFS storage. Above all, it makes it possible to lighten the nodes of the storage functions to allow them to concentrate solely on the analytical calculations.

For companies, other benefits can be highlighted. First, they are already familiar with SAN and NAS storage models. Then they are used to the rich storage services offered by the arrays (snapshots, replication, etc.). Finally, relying on the arrays that already store their primary and secondary data avoids having to transfer data to HDFS. It also allows you to rely on standard protocols such as NFS or SMB to add additional data to the Hadoop cluster.

Finally, note that traditional storage is not the only alternative to HDFS storage. More and more, object storage is also emerging as an alternative to traditional Hadoop storage, thanks in particular to the ramp-up of the s3a client.

Tortoise Corner

What storage for Hadoop?

0 Comments

Post a Comment

Popular

Tips to make a chatbot: make simple, think user

Developers, forget the minor bugs in your tests