Big Data Resources - Big Data Bootcamp: What Managers Need to Know to Profit from the Big Data Revolution (2014)

Big Data Bootcamp: What Managers Need to Know to Profit from the Big Data Revolution (2014)

APPENDIX A. Big Data Resources

General Information

· The home of the U.S. Government’s open data.

·        Feinleib, David. “Actionable Insights from Big Data.”

·        McCandless, David. “Information Is Beautiful: Ideas, Issues, Knowledge, Data—Visualized!”

·        McCandless, David. “The Beauty of Data Visualization.”

·        Mayer-Schönberger, Viktor and Kenneth Cukier. Big Data: A Revolution That Will Transform How We Live, Work, and Think. John Murray Publishers, London: 2013.

·        Rosling, Hans. “The Best Stats You’ve Ever Seen.”

·        Silver, Nate. The Signal and the Noise:   Why So Many Predictions Fail—But Some Don’t. Penguin Press, New York: 2012.

·        Tableau Software data visualization gallery.

Big Data Software and Services

The Big Data Landscape provides the most comprehensive list of Big Data software and services. The following are great launching points for getting started with Big Data:

·        Amazon Elastic MapReduce (EMR):

·        Amazon Kinesis:

·        Apache Hadoop:

·        CartoDB:

·        Cassandra:

·        Cloudera:

·        Google Cloud Dataflow:

·        Hortonworks:

·        MapR:

·        MongoDB:

·        New Relic:

·        QlikTech:

·        Splunk:

·        Tableau Software:

Big Data Glossary

The following glossary defines some of the key terms used in the Big Data world. There are many more—and more to come as Big Data continues to evolve.

Amazon Kinesis—A cloud-based service for real-time processing of streaming Big Data such as financial data and tweets.

Apache Hadoop—An open source framework for processing large quantities of data, traditionally batch-oriented in nature.1

Apache Hive—Software for querying large datasets contained in distributed storage like Hadoop.   Hive enables data querying using a SQL-like query language called HiveQL.

Batch—An approach to analyzing data in which data is processed in large chunks called batches.

Data analyst—A person responsible for analyzing, processing, and visualizing data.

Google BigQuery—A cloud-based service provided by Google for analyzing large quantities of data on Google’s infrastructure, using SQL-like queries.

Google Cloud Dataflow—A cloud-based service for data integration, preparation, real-time stream processing, and multi-step data processing pipelines. Google positions Cloud Dataflow as the successor to Hadoop and MapReduce.

HDFS—Hadoop distributed file system, a distributed file system for storing large quantities of data that run on commodity hardware.

Machine data—Data such as system logs generated by machines like computers, network equipment, cars, and other devices.

MapReduce—A programming model for processing large quantities of data in parallel on many nodes in a computer cluster.

MongoDB—An open source document database and the leading NoSQL database.2

NoSQL—A database system in which data is not stored in traditional relational form but rather in key-value, graph, or other format.

Quantified Self—A movement to better understand ourselves by tracking our personal data.3

Real-time—The analysis of data as it becomes available. Real-time analysis is in contrast to batch-based analysis, which occurs minutes, hours, or days after data is received.

Relational database—Software that stores data in a structured form that indicates the relationships between different data tables and elements. Oracle, Microsoft SQL Server, MySQL, and PostgreSQL are well-known relational databases.

Semi-structured data—Data that shares characteristics of both structured and unstructured data.

SQL—Structured query language; a language for storing data to and retrieving data from relational databases.

Unstructured data—Data such as raw text, images, and videos that does not contain well-defined structures defining how one piece of data relates to another.

Visualization—A way to see large quantities of data in graphical form.

Volume, variety, and velocity—also known as the three Vs, these are three commonly used measures of Big Data. Volume is how much data there is; variety refers to the kinds of data; and velocity refers to how quickly that data is moving.