Delving into Environmental Data Science - Applying Domain Expertise to Solve Real-World Problems Using Data Science - Data Science For Dummies (2016)

Data Science For Dummies (2016)

Part 5

Applying Domain Expertise to Solve Real-World Problems Using Data Science

Chapter 19

Delving into Environmental Data Science

IN THIS CHAPTER

check Modeling environmental-human interaction

check Applying statistical modeling to natural resources in the raw

check Predicting for a location-dependent environmental phenomenon

Because data science can be used to successfully reverse-engineer business growth and increase revenues, many of its more noble applications often slide by, completely unnoticed. Environmental data science, one such application, is the use of data science techniques, methodologies, and technologies to address or solve problems that are related to the environment. This particular data science falls into three main categories — environmental intelligence, natural resource modeling, and spatial statistics — to predict environmental variation. In this chapter, I discuss each type of environmental data science and how it’s being used to make a positive impact on human health, safety, and the environment.

Modeling Environmental-Human Interactions with Environmental Intelligence

The purpose of environmental intelligence (EI) is to convert raw data into insights that can be used for data-informed decision making about matters that pertain to environmental-human interactions. EI solutions are designed to support the decision making of community leaders, humanitarian response decision-makers, public health advisors, environmental engineers, policy makers, and more. If you want to collect and analyze environmentally relevant data in order to produce content that’s crucial for your decision making process — like real-time maps, interactive data visualizations, and tabular data reports — look into an EI solution.

In the following four sections, I discuss the type of problems being solved by using EI technologies and specify which organizations are out there using EI to make a difference. I explain the ways in which EI is similar to business intelligence (BI) and the reasons it qualifies as applied data science despite those similarities. I wrap up this main section with a real-world example of how EI is being used to make a positive impact.

Examining the types of problems solved

EI technologies are used to monitor and report on interactions between humans and the natural environment. This information provides decision-makers and stakeholders with real-time intelligence about on-the-ground happenings, in the hope of avoiding preventable disasters and community hardships through proactive, data-informed decision making. EI technologies are being used to achieve the following types of results:

· Make responsible energy-consumption plans: EI technology is just what you need in order to audit, track, monitor, and predict energy consumption rates. Here, energy is the natural resource, and people’s consumption of energy is the human interaction. (Note that you can use KNIME to help you build this type of EI solution directly on your desktop computer; check out Chapter 17 for more on KNIME.)

· Expedite humanitarian relief efforts: EI of this type involves crisis mapping, where EI technology is used to collect, map, visualize, analyze, and report environmental data that is relevant to the crisis at hand. Crisis mapping provides real-time humanitarian decision support. Here, water supply, sanitation, natural disaster, and hygiene status are measures of natural resources, and the effects that these resources have on the health and safety of people is the human interaction.

· Improve water resource planning: You can use EI technologies to generate predictive models for water consumption, based on simple inference from statistically derived population forecasts, historical water-consumption rates, and spatial data on existing infrastructure. Water supply is the natural resource here, and people’s consumption of it is the human interaction.

· Combat deforestation: EI technologies are being used in real-time, interactive, map-based data visualization platforms to monitor deforestation in remote regions of developing nations. This type of solution utilizes conditional autoregressive modeling, spatial analysis, web-based mapping, data analytics, data cubes, and time series analyses to map, visualize, analyze, and report deforestation in near-real-time. These uses of EI increase transparency and public awareness on this important environmental issue. In this application, forestry and land are the natural resources, and the act of cutting down trees is the human interaction.

Defining environmental intelligence

Although environmental intelligence (EI) and business intelligence (BI) technologies have a lot in common, EI is still considered applied data science. Before I delve into the reasons for this difference, first consider the following ways in which EI and BI are similar:

· Simple inference from mathematical models: As discussed in Chapter 3, BI generates predictions based on simple mathematical inference, and not from complex statistical predictive modeling. Many of the simpler EI solutions derive their predictions from simple mathematical inference as well.

· Data structure and type: Like BI, many of the simpler EI products are built solely from structured data that sits in a relational SQL database.

· Decision-support product outputs: Both EI and BI produce data visualizations, maps, interactive data analytics, and tabular data reports as decision-support products.

Much of the EI approach was borrowed from the BI discipline. However, EI evolved away from BI technologies when its features were upgraded and expanded to solve real-world environmental problems. When you look into the data science features that are central to most solutions, the evolution of EI away from standard BI becomes increasingly obvious. Here are a few data science processes you won’t find used in BI but will find in EI technology:

· Statistical programming: The most basic EI solutions deploy time series analysis and autoregressive modeling. Many of the more advanced solutions make use of complex statistical models and algorithms as well.

· GIS technology: Because environmental insights are location-dependent, it’s almost impossible to avoid integrating geographic information systems (GIS) technologies into EI solutions. Not only that, but almost all web-based EI platforms require advanced spatial web programming. (For more on GIS technologies, check out Chapter 13.)

· Web-based data visualization: Almost all web-based EI platforms offer interactive, near-real-time data visualizations that are built on the JavaScript programming language.

· Data sources: Unlike BI, EI solutions are built almost solely from external data sources. These sources often include data autofeeds derived from image, social media, and SMS sources. Other external data comes in the form of satellite data, scraped website data, or .pdf documents that need to be converted via custom optical text-recognition scripts. In EI, the reported data is almost always updating in real-time.

remember Web-scraping is a process that involves setting up automated programs to scour and extract the data you need straight from the Internet. The data you generate from this type of process is commonly called scrapeddata.

· Coding requirements: EI solutions almost always require advanced custom coding. Whether in the form of web programming or statistical programming, extra coding work is required in order to deliver EI products.

Identifying major organizations that work in environmental intelligence

Because EI is a social-good application of data science, there aren’t a ton of funding sources out there, which is probably the chief reason not many people are working in this line of data science. EI is small, but some folks in dedicated organizations have found a way to earn a living by creating EI solutions that serve the public good. In the following list, I name a few of those organizations, as well as the umbrella organizations that fund them. If your goal is to use EI technologies to build products that support decision making for the betterment of environmental health and safety, one of these organizations will likely be willing to help you with advice or even support services:

· DataKind (www.datakind.org): A nonprofit organization of data science volunteers who donate their time and skills to work together in the service of humanity, DataKind was started by the data science legend Jake Porway. The organization has donated EI support to projects in developing nations and first-world countries alike. DataKind’s sponsors include National Geographic, IBM, and Pop! Tech.

· Elva (www.elva.org): A nongovernmental organization, Elva was built by a small, independent group of international digital humanitarians — knowledge workers who use data and disruptive technologies to build solutions for international humanitarian problems. Elva founders gave their time and skills to build a mobile-phone platform, which allows marginalized communities to map local needs and to work with decision-makers to develop effective joint-response plans. Elva offers EI support for environmental projects that are centered in underserved, developing nations. Elva is directed by Jonne Catshoek and is sponsored by UNDP, USAID, and Eurasia Partnership.

· Vizzuality (www.vizzuality.com): Here’s a business started by the founders of CartoDB — a technology that’s discussed further in Chapter 11. Almost all of Vizzuality’s projects involve using EI to serve the betterment of the environment. Vizzuality was founded by Javier de la Torre, and some of the organization’s bigger clients have included Google, UNEP, NASA, the University of Oxford, and Yale University.

· QCRI (www.qcri.com): The Qatar Computing Research Institute (QCRI) is a national organization that’s owned and funded by a private, nonprofit, community development foundation in Qatar. The social-innovation section delivers some ongoing environmental projects, including Artificial Intelligence in Disaster Response (AIDR) and a crowdsourced verification-for-disaster-response platform (Verily).

Making positive impacts with environmental intelligence

Elva is a shining example of how environmental intelligence technologies can be used to make a positive impact. This free, open-source platform facilitates cause mapping and data visualization reporting for election monitoring, human rights violations, environmental degradation, and disaster risk in developing nations.

In one of its more recent projects, Elva has been working with Internews, an international nonprofit devoted to fostering independent media and access to information in an effort to map crisis-level environmental issues in one of the most impoverished, underdeveloped nations of the world, the Central African Republic. As part of these efforts, local human rights reporters and humanitarian organizations are using Elva to monitor, map, and report information derived from environmental data on natural disasters, infrastructure, water, sanitation, hygiene, and human health. The purpose of Elva’s involvement on this project is to facilitate real-time humanitarian-data analysis and visualization to support the decision making of international humanitarian-relief experts and community leaders.

With respect to data science technologies and methodologies, Elva implements

· Autofeeds for data collection: The data that’s mapped, visualized, and reported through the Elva platform is actually created by citizen activists on the ground who use SMS and smartphones to report environmental conditions by way of reports or surveys. The reporting system is built so that all reports come in with the correct structure, are collected by service-provider servers, and then are pushed over to the Elva database.

· Non-relational database technologies: Elva uses a non-relational NoSQL database infrastructure to store survey data submitted by smartphone and SMS, as well as other sources of structured, unstructured, and semistructured data.

· Open data: OpenStreetMap powers the map data that the Elva platform uses. You can find out more about OpenStreetMap in Chapter 22, where I focus on open data resources — data resources that have been made publicly available for use, reuse, modification, and sharing with others.

· Inference from mathematical and statistical models: Elva’s data analysis methods aren’t overly complex, but that’s perfect for producing fast, real-time analytics for humanitarian decision support. Elva depends mostly on time series analysis, linear regression, and simple mathematical inference.

· Data visualization: Elva produces data visualizations directly from reported data and also from inferential analyses. These are interactive JavaScript visualizations built from the Highcharts API.

· Location-based predictions: Such predictions are based on simple inference and not on advanced spatial statistics, as discussed in the section “Using Spatial Statistics to Predict for Environmental Variation across Space,” later in this chapter. Elva staff can infer locations of high risk based on historical time series reported in the region.

Modeling Natural Resources in the Raw

You can use data science to model natural resources in their raw form. This type of environmental data science generally involves some advanced statistical modeling to better understand natural resources. You model the resources in the raw — water, air, and land conditions as they occur in nature — to better understand the natural environment’s organic effects on human life.

In the following sections, I explain a bit about the type of natural-resource issues that most readily lend themselves to exploration via environmental data science. Then I offer a brief overview about which data science methods are particularly relevant to environmental resource modeling. Lastly, I present a case in which environmental data science has been used to better understand the natural environment.

Exploring natural resource modeling

Environmental data science can model natural resources in the raw so that you can better understand environmental processes in order to comprehend how those processes affect life on Earth. After environmental processes are clearly understood, then and only then can environmental engineers step in to design systems to solve problems that these natural processes may be creating. The following list describes the types of natural-resource issues that environmental data science can model and predict:

· Water issues: Rainfall rates, geohydrologic patterns, groundwater flows, and groundwater toxin concentrations

· Air issues: The concentration and dispersion of particulate-matter levels and greenhouse gas concentrations

· Land issues: Soil contaminant migration and geomorphology as well as geophysics, mineral exploration, and oil and gas exploration

If your goal is to build a predictive model that you can use to help you better understand natural environmental processes, you can use natural resource modeling to help you. Don’t expect natural-resource modeling to be easy, though. The statistics that go into these types of models can be incredibly complex.

Dabbling in data science

Because environmental processes and systems involve many different interdependent variables, most natural-resource modeling requires the use of incredibly complex statistical algorithms. The following list shows a few elements of data science that are commonly deployed in natural-resource modeling:

· Statistics, math, and machine learning: Bayesian inference, multilevel hierarchical Bayesian inference, multitaper spectral analysis, copulas, Wavelet Autoregressive Method (WARM), Autoregressive Moving Averages (ARMAs), Monte Carlo simulations, structured additive regression (STAR) models, regression on order statistics (ROS), maximum likelihood estimations (MLEs), expectation-maximization (EM), linear and nonlinear dimension reduction, wavelets analysis, frequency domain methods, Markov chains, k-nearest neighbor (kNN), kernel density, and logspline density estimation, among other methods

· Spatial statistics: Generally, something like probabilistic mapping

· Data visualization: As in other data science areas, needed for exploratory analysis and for communicating findings with others

· Web-scraping: Many times, required when gathering data for environmental models

· GIS technology: Spatial analysis and mapmaking

· Coding requirements: Using Python, R, SPSS, SAS, MATLAB, Fortran, and SQL, among other programming languages

Modeling natural resources to solve environmental problems

The work of Columbia Water Center’s director, Dr. Upmanu Lall, provides a world-class example of using environmental data science to solve incredibly complex water resource problems. (For an overview of the Columbia Water Center’s work, see http://water.columbia.edu/.) Dr. Lall uses advanced statistics, math, coding, and a staggering subject-matter expertise in environmental engineering to uncover complex, interdependent relationships between global water-resource characteristics, national gross domestic products (GDPs), poverty, and national energy consumption rates.

In one of Dr. Lall’s recent projects, he found that in countries with high rainfall variability — countries that experience extreme droughts followed by massive flooding — the instability results in a lack of stable water resources for agricultural development, more runoff and erosion, and overall decreases in that nation’s GDP. The inverse is also true, where countries that have stable, moderate rainfall rates have a better water resource supply for agricultural development, better environmental conditions overall, and higher average GDPs. So, using environmental data science, Dr. Lall has been able to draw strong correlations between a nation’s rainfall trends and its poverty rates.

With respect to data science technologies and methodologies, Dr. Lall implements these tools:

· Statistical programming: Dr. Lall’s arsenal includes multilevel hierarchical Bayesian models, multitaper spectral analysis, copulas, Wavelet Autoregressive Moving Averages (WARMs), Autoregressive Moving Averages (ARMAs), and Monte Carlo simulations.

· Mathematical programming: Tools here include linear and nonlinear dimension reduction, wavelets analysis, frequency domain methods, and nonhomogeneous hidden Markov models.

· Clustering analysis: In this case, Dr. Lall relies on the tried-and-true methods, including k-nearest neighbor, kernel density, and logspline density estimation.

· Machine learning: Here, Dr. Lall focuses on minimum variance embedding.

Using Spatial Statistics to Predict for Environmental Variation across Space

By their very nature, environmental variables are location-dependent: They change with changes in geospatial location. The purpose of modeling environmental variables with spatial statistics is to enable accurate spatial predictions so that you can use those predictions to solve problems related to the environment.

Spatial statistics is distinguished from natural-resource modeling because it focuses on predicting how changes in space affect environmental phenomenon. Naturally, the time variable is considered as well, but spatial statistics is all about using statistics to model the inner workings of spatial phenomenon. The difference is in the manner of approach.

In the following three sections, I discuss the types of issues you can address with spatial statistical models and the data science that goes into this type of solution. You can read about a case in which spatial statistics has been used to correlate natural concentrations of arsenic in well water with incidence of cancer.

Addressing environmental issues with spatial predictive analytics

You can use spatial statistics to model environmental variables across space and time so that you can predict changes in environmental variables across space. The following list describes the types of environmental issues that you can model and predict using spatial statistical modeling:

· Epidemiology and environmental human health: Disease patterns and distributions

· Meteorology: Weather phenomenon

· Fire science: The spread of a fire (by channeling your inner Smokey the Bear!)

· Hydraulics: Aquifer conductivity

· Ecology: Microorganism distribution across a sedimentary lake bottom

If your goal is to build a model that you can use to predict how change in space will affect environmental variables, you can use spatial statistics to help you do this. In the next section, I quickly overview the basics that are involved in spatial statistics.

Describing the data science that’s involved

Because spatial statistics involves modeling the x-, y-, z-parameters that comprise spatial datasets, the statistics involved can get rather interesting and unusual. Spatial statistics is, more or less, a marriage of GIS spatial analysis and advanced predictive analytics. The following list describes a few data science processes that are commonly deployed when using statistics to build predictive spatial models:

· Spatial statistics: Spatial statistics often involves krige and kriging, as well as variogram analysis. The terms “kriging” and “krige” denote different things. Kriging methods are a set of statistical estimation algorithms that curve-fit known point data and produce a predictive surface for an entire study area. Krige represents an automatic implementation of kriging algorithms, where you use simple default parameters to help you generate predictive surfaces. A variogram is a statistical tool that measures how different spatial data becomes as the distance between data points increases. The variogram is a measure of “spatial dissimilarity”. When you krige, you use variogram models with internally defined parameters to generate interpolative, predictive surfaces.

· Statistical programming: This one involves probability distributions, time series analyses, regression analyses, and Monte Carlo simulations, among other processes.

· Clustering analysis: Processes can include nearest-neighbor algorithms, k-means clustering, or kernel density estimations.

· GIS technology: GIS technology pops up a lot in this chapter, but that’s to be expected because its spatial analysis and map-making offerings are incredibly flexible.

· Coding requirements: Programming for a spatial statistics project could entail using R, SPSS, SAS, MATLAB, and SQL, among other programming languages.

Addressing environmental issues with spatial statistics

A great example of using spatial statistics to generate predictions for location-dependent environmental variables can be seen in the recent work of Dr. Pierre Goovaerts. Dr. Goovaerts uses advanced statistics, coding, and his authoritative subject-matter expertise in agricultural engineering, soil science, and epidemiology to uncover correlations between spatial disease patterns, mortality, environmental toxin exposure, and sociodemographics.

In one of Dr. Goovaerts recent projects, he used spatial statistics to model and analyze data on groundwater arsenic concentrations, location, geologic properties, weather patterns, topography, and land cover. Through his recent environmental data science studies, he discovered that the incidence of bladder, breast, and prostate cancers is spatially correlated to long-term arsenic exposure.

With respect to data science technologies and methodologies, Dr. Goovaerts commonly implements the following:

· Spatial statistical programming: Once again, kriging and variogram analysis top the list.

· Statistical programming: Least squares regression and Monte Carlo (a random simulation method) are central to Dr. Goovaerts’s work.

· GIS technologies: If you want map-making functionality and spatial data analysis methodologies, you’re going to need GIS technologies.

To find out more about Dr. Goovaerts’s work, check out his website at https://sites.google.com/site/goovaertspierre.