Hadoop in the Cloud or: the evolution of Hadoop into an API, Storage & Processing at Scale

Note: In an effort to share more often and react to feedback I might receive, I'm going to break this up into a series of posts rather than writing a book.

Part 1: An Introduction
Part 2: Capabilities and Components
Part 3: Storage & Processing at Scale (you are here)

Hadoop provides many benefits when it comes to storing substantial amounts of data as well as providing a framework for efficient mass-scale processing of data stored in the storage layer of a Hadoop cluster. Given the popularity of Hadoop and the foundation of innovation in the Hadoop ecosystem, we have many tools we can use to meet many needs, using tools besides Hadoop but often using tools beyond Hadoop.

Much of the information below is not new and I'll only go into it to ensure all readers understand the value components like HDFS & YARN can bring (remember capabilities from the last post in this series?), which is critical in the conversation about running them in the cloud. Why run ANY technology system ANYWHERE if there is no clear value coming from it? I'm going to assume some knowledge of Hadoop is known but if anything is not clear, please ask questions in the comments and I will refine.

HDFS

The Hadoop Distributed File System (HDFS) provides a storage environment that allows for any data type of virtually any size, type or characteristic in a general-purpose file service. There are some limitations of the features, for example small files aren't typically stored efficiently in an HDFS as well as the read-only nature of the file system that can complicate some scenarios but many times these are solved for in the larger architecture of a solution or implementation (out of scope in this case, you'll understand why in the next post).

I would say that HDFS is mostly what enables the traditional ETL workload to become an ELT workload. ETL stands for Extract, Transform and Load which was a widespread practice in data warehousing and analytics in previous decades. It involved reading data from sources, transforming it (either in an expensive data store like a data warehouse appliance or at least an SMP RDBMS that used expensive storage) and then writing it out to a data store that was consumable for analytics purposes by a business unit that needed answers yesterday. ELT enabled a change in workflow that allows the same process to unlock value for the business, answer questions but do so in a more flexible and agile fashion. For example, in the ETL method one always has to know the structure of your data before you can write it into or transform it in the expensive RDBMS while ELT at least allows you to get datasets into a store and enables exploration of data beyond samples.

HDFS like many enterprise technologies has been built with the expectation that hardware will fail. The solution in HDFS was to make multiple copies of the data (usually 3 by default) and this usually allows inexpensive disks to be used to store data rather than typical expensive disks which occupy Enterprise-class storage systems. Multiple copies of the data are also key in giving the option for any processing that needs the data to have the option of fault tolerance, you can't continue processing data if there is only 1 copy of the data and a cluster node storing that data goes offline.

Another key capability of HDFS is that data is usually split into multiple chunks of data to allow reading of that data (and subsequent transformation or processing) on multiple nodes. Since we're already storing data on multiple computers why not also process data on multiple computers, so we can breakup large processing jobs into smaller jobs running on as many nodes in the cluster as possible. When taking this approach overall processing time goes down. That brings us to negotiating resources across many computer nodes in a cluster.

YARN

When operating large-scale processing across many computers it can be a challenge to ensure the processing running on that cluster is operating efficiently. Yet Another Resource Negotiator (or YARN for short) was created to give Hadoop operators the capability of maximizing the resources across a cluster of computer nodes that can both store and process data. YARN allows different workloads to run across the same cluster of storage and compute resources and each have fine-grained control over the resources available for processing. For example, if a Hadoop cluster is used by the data warehousing team for reading data from source line of business data stores and writing into an enterprise data warehouse the administrators of the cluster can give them 70% of the overall compute resources (CPU cores & memory) while also ensuring the application R&D department has 30% of the compute resources for their needs as well. When viewing at this level of control it is almost a cloud-like capability when it comes to elasticity (although when deployed on premise a Hadoop cluster does have a top-end compute and storage capacity, more on cloud soon).

Both YARN and HDFS provide capabilities that otherwise are hard to come by and are relatively straightforward and easy to understand. The architecture of the services are simple and often involve Master & Worker roles that ensure the cluster is operational and can provide necessary capabilities like fault tolerance, scalability, resiliency and other Enterprise requirements.

Next time I'll will discuss cloud concepts like virtualization, IaaS and PaaS and why organizations find them valuable.

Comments

  • Anonymous
    June 07, 2017
    Nice article and looking forward to the rest of the series. I have a question about the basis for this statement:"It involved reading data from sources, transforming it (either in an expensive data store like a data warehouse appliance or at least an SMP RDBMS that used expensive storage)"In reviewing this full disclosure report from a recent TPC-H data warehouse benchmarking result, http://c970058.r58.cf2.rackcdn.com/fdr/tpch/hp~tpch~1000~hpe_proliant_dl380_gen9~fdr~2016-03-24~v01.pdfI see the total storage cost is $144,000 while the MS SQL Server software cost is $297,000 or almost 2X the cost of storage. Isn't it time to retire the myth that storage costs are the reason that data warehouses built on RDBMS products are so expensive?Phil HummelDell EMC
    • Anonymous
      June 07, 2017
      Hi Phil - Thank you for commenting. I wouldn't disagree that RDBMS software can be quite expensive. When thinking about costs I tend to think of incremental costs vs. overall. For example if I already have an appliance or SMP database that is hosting 2 TB (example, arbitrary size) of data and I need to expand that to 4 TB, what does the expansion costs look like? Depending on the overall data usage you may not need to add CPU cores (or that many) or increased memory, just simply additional disk storage. I've ran into cases like this when it comes to additional sources of data that need to get transformed into the output warehouse dataset. If all transformation is taking place in the data warehouse (appliance or SMP RDBMS) then the additional storage is typically the primary expansion costs.And when buying additional storage one has to consider is the price per GB/TB of the storage used. This is highly dependent upon the type of SAN used as well as types of drives used but given the resiliency built into Hadoop software (HDFS & YARN) it's not uncommon to buy very expensive disks in the neighborhood of $0.07 to $0.10 per GB while many Enterprise storage systems require disks that cost in the range of $0.30 to $0.50 per GB. So it's not that the storage is more expensive than the software, it's more about the costs of the storage needed vs. lower cost storage that could be an option if the software and/or engineered systems were tuned to work with it.Oh and I've ran into a few enterprise customers who don't pay for support of their open source software so when it comes to Hadoop (and some RDBMS like PostgresSQL, & others) can sometimes be an option (though I wouldn't recommend it).