Hadoop: The Definitive Guide (2015)
Appendix B. Cloudera’s Distribution Including Apache Hadoop
Cloudera’s Distribution Including Apache Hadoop (hereafter CDH) is an integrated Apache Hadoop–based stack containing all the components needed for production, tested and packaged to work together. Cloudera makes the distribution available in a number of different formats: Linux packages, virtual machine images, tarballs, and tools for running CDH in the cloud. CDH is free, released under the Apache 2.0 license, and available at http://www.cloudera.com/cdh.
As of CDH 5, the following components are included, many of which are covered elsewhere in this book:
Apache Avro
A cross-language data serialization library; includes rich data structures, a fast/compact binary format, and RPC
Apache Crunch
A high-level Java API for writing data processing pipelines that can run on MapReduce or Spark
Apache DataFu (incubating)
A library of useful statistical UDFs for doing large-scale analyses
Apache Flume
Highly reliable, configurable streaming data collection
Apache Hadoop
Highly scalable data storage (HDFS), resource management (YARN), and processing (MapReduce)
Apache HBase
Column-oriented real-time database for random read/write access
Apache Hive
SQL-like queries and tables for large datasets
Hue
Web UI to make it easy to work with Hadoop data
Cloudera Impala
Interactive, low-latency SQL queries on HDFS or HBase
Kite SDK
APIs, examples, and docs for building apps on top of Hadoop
Apache Mahout
Scalable machine-learning and data-mining algorithms
Apache Oozie
Workflow scheduler for interdependent Hadoop jobs
Apache Parquet (incubating)
An efficient columnar storage format for nested data
Apache Pig
Data flow language for exploring large datasets
Cloudera Search
Free-text, Google-style search of Hadoop data
Apache Sentry (incubating)
Granular, role-based access control for Hadoop users
Apache Spark
A cluster computing framework for large-scale in-memory data processing in Scala, Java, and Python
Apache Sqoop
Efficient transfer of data between structured data stores (like relational databases) and Hadoop
Apache ZooKeeper
Highly available coordination service for distributed applications
Cloudera also provides Cloudera Manager for deploying and operating Hadoop clusters running CDH.
To download CDH and Cloudera Manager, visit http://www.cloudera.com/downloads.