Time Series Databases: New Ways to Store and Access Data (2014)
Chapter 8. What’s Next?
The shape of the data landscape has changed, and it’s about to undergo an even bigger upheaval. New technologies have made it reasonable and cost effective to collect and analyze much larger amounts of data, including time series data. That change, in turn, has enticed people to greatly expand where, how, and how much data they want to collect. It isn’t just about having data at a much larger scale to do the things we used to do at higher frequency, such as tracking stock trades in fractions of seconds or measuring residential energy usage every few minutes instead of once a month. The combination of greatly increasing scale plus emerging technologies to collect and analyze data for valuable insights is creating the desire and ability to do new things.
This ability to try something new raises the question: what’s next? Before we take a look forward, let’s review the key ideas we have covered so far.
A New Frontier: TSDBs, Internet of Things, and More
The way we watch the world is new. Machine sensors “talk to” servers and machines talk to each other. Analysts collect data from social media for sentiment analysis to find trends and see if they correlate to the behavior of stock trading. Robots wander across the surface of the oceans, taking repeated measurements of a variety of parameters as they go. Manufacturers not only monitor manufacturing processes for fine-tuning of quality control, they also produce “smart parts” as components of high-tech equipment to report back on their function from the field. The already widespread use of sensor data is about to vastly expand as creative companies find new ways to deploy sensors, such as embedding them into fabric to make “smart clothes” to monitor parameters including heart function. There are also many wearable devices for reporting on a person’s health and activity. One of the most widespread sources of machine data already in action is from system logs in data center monitoring. As techniques such as those described in this report become widely known, more and more people are choosing to collect data as time series. Going forward, where will you find time series data? The answer is: essentially everywhere.
These types of sensors take an enormous number of measurements, which raises the issue of how to make use of the enormous influx of data they produce. New methods are needed to deal with the entire time series pipeline from sensor to insight. Sensor data must be collected at the site of measurement and communicated. Transport technologies are needed to carry this information to the platform used for central storage and analysis. That’s where the methods for scalable time series databases come in. These new TSDB technologies lie at the heart of the IoT and more.
This evolution is natural—doing new things calls for new tools, and time series databases for very large-scale datasets are important tools. Services are emerging to provide technology that is custom designed to handle large-scale time series data typical of sensor data. In this book, however, we have focused on how to build your own time series database, one that is cost effective and provides excellent performance at high data rates and very large volume.
We recommend using Apache Hadoop–based NoSQL platforms—such as Apache HBase or MapR-DB—for building large-scale, non-relational time series databases because of their scalability and the efficiency of data retrieval they provide for time series data. When is that the right solution? In simple terms, a time series database is the right choice when you have a very large amount of data that requires a scalable technology and when the queries you want to make are mainly based on a time span.
New Options for Very High-Performance TSDBs
We’ve described some open source tools and new approaches to build large-scale time series databases. These include open source tools such as Open TSDB, code extensions to modify Open TSDB that were developed by MapR, and a convenient user interface called Grafana that works with Open TSDB.
The design of the data workflow, data format, and table organization all affect performance of a time series database. Data can be loaded into wide tables in a point-by-point manner in a NoSQL-style, non-relational storage tier for better performance and scalability as compared to a traditional relational database schema with one row per data point. For even faster retrieval, a hybrid table design can be achieved with a data flow that retrieves data from wide table for compression into blobs and reloads the table with row compaction. Unmodified Open TSDB produces this hybrid-style storage tier. To greatly improve the rate of ingestion, you can make use of the new open source extensions developed by MapR to enable direct blob insertion. This style also solves the problem of how to quickly ingest sufficient data to carry out a test of a very large volume database. This novel design has achieved rates as high as 100 million data points a second, a stunning advancement.
We’ve also described some of the ways in which time series data is useful in practical machine learning. For example, models based on the combination of a time series database for sensor measurements and long-term, detailed maintenance histories make it possible to do predictive maintenance scheduling. This book also looked at the advanced topic of building a geo-temporal database.
Looking to the Future
What’s next? The sky’s the limit…and so is the ocean, the farm, your cell phone, the stock market, medical nano-sensors implanted in your body, and possibly the clothes you are wearing. We started our discussion with some pioneering examples of extremely valuable insights discovered through patterns and trends in time series data. From the Winds and Current Charts of Maury and the long-term environmental monitoring started by Keeling with his CO2 measurements to the modern exploration of our planet by remote sensors, time series data has been shown to be a rich resource. And now, as we move into uncharted waters of new invention, who knows where the journey will take us?
The exciting thing is that by building the fundamental tools and approaches described here, the foundation is in place to support innovations with time series data. The rest is up to your imagination.
Figure 8-1. An excerpt from Maury’s Wind and Current Charts that were based on time series data. These charts were used by ship captains to optimize their routes.