Since its creation, Hadoop has established itself as the reference framework for volume data analysis. The problem is that before analyzing data, it has long been necessary to load them into the Hadoop cluster on the HDFS file system. However, the latter is neither the most efficient nor the simplest nor the most economical way to store large volumes of data.

The problem is that to feed the increasingly sophisticated algorithms developed by datascientists, more and more data is needed. And these data reside more and more in object storage systems: they have indeed the double merit of offering a satisfactory performance while having a very low cost to the gigabyte. They are also the storage medium of choice for a large part of next-generation web applications.

Very quickly, the idea appeared to add the support of the S3 protocol to Hadoop to allow the framework to access natively data stored in an S3 compatible storage. This support has the advantage of allowing Hadoop to analyze this data without having to preload them beforehand on the HDFS nodes of the cluster.

Kết quả hình ảnh cho công nghệ

A native support of S3 in Hadoop

S3 support in Hadoop has been steadily growing over the last ten years. The first native support for the object storage protocol in the framework appeared in 2008 with the inclusion of the native S3 client (or s3n , for S3 Native Filesystem). The integration of s3n allowed Hadoop to read and write data natively in an existing S3 store and thus to directly manipulate data stored in one or more "buckets", without having to import them in HDFS beforehand. This first implementation, however, suffered from performance problems and did not support some essential functions of the S3 protocol.

In 2014, version 2.7 of Hadoop officially replaced the s3n client with a new generation client called s3a. This new customer has made many improvements, including the one about S3's identity and access management mechanisms. It has also significantly improved performance (especially with tools like Hive or Spark).

The arrival of s3a, however, did not eliminate all the problems of using the S3 protocol with Hadoop. In particular, one of the major problems to deal with is the fact that unlike most local object stores, the AWS S3 service is only possibly consistent. By this is meant that after a write, update or erase operation, it is sometimes necessary for a certain amount of time to elapse before the changes are reflected in the system. Thus, by listing the contents of a directory, it is possible that S3 returns a state that is not consistent with the last operations performed. This can be catastrophic if you make a MapReduce, Hive or Spark query.

S3Guard: a mechanism to ensure data consistency with AWS S3

The developers have worked on the development of S3Guard, which aims to protect against possible inconsistencies in AWS S3 by maintaining a log of writes operations performed on the service. This log is kept in the form of metadata in a DynamoDB database and allows reading to ensure the consistency of the information returned.

Since the metadata stored in DynamoDB can be used as a cache, the implementation of S3Guard also helps to speed up certain read operations and boost the performance of queries running directly on an S3 storage store. It should be noted that to date S3Guard remains an experimental function of the s3a client.