Data and how to get it - Business statistics with Excel and Tableau (2015)

Business statistics with Excel and Tableau (2015)

4. Data and how to get it

In this chapter, we’ll look at the data collection process, the prob- lems that might accompany poorly collected data, discuss so-called big data and provide links to some useful public sources of data.

As you might expect, this book works through the analysis of numbers — quantitative data — but it is important to note that the analysis of qualitative data is a rapidly growing area. Unfortunately the analysis ofqualitative data is beyond this book and the software available to us.

Quantitative data comes from two main sources.Primary data is collected by you or the company you are working for. For example, the market research you do to find out the possible market share for your product provides primary data. Primary data includes both internal company data and data from automated equipment such as website hits. Collecting data is expensive and highly proprietary. It is therefore unlikely to bepublished and available outside the enterprise.

By contrast,Secondary datais plentiful and mostly free. It is collected by governments of all sizes, and also by many non- government organizations as well. There is an increasing trend towards the liberation ofdata under various open government initiatives. Government data is usually reliable, but be sure to check the accompanying notes which warn of any problems, such as limited sample size or change in classification orcollection methods over time. There are some links to secondary data at the end of this chapter.

18

Experimental design

Experiments don’t necessarily have to be conducted in laboratories by people in white coats: a survey in shopping mall is also a form of experiment, as is analyzing the results of your favorite cricket team. There are two different design types:experimental and observational. The key difference is the amount of randomization that is built into the experiment. In general, more randomization is good, because it helps to removethe influence of fixed effects, such as the quality of the soil in a particular location.

Here’s an example to make clear the difference. Imagine a re- searcher wanting to test the effect of a new drug on mice. In an experimental design, the mice are examined before the drug is administered, and thenthe drug is applied to a treatment group of mice and a control group. The control group receives no drugs. Its job is to act as a reference group. All the mice are kept in identical conditions apart from the application of the drug. The allocation of mice to the treatment group and to the control group is entirely random.

While we can (and do) carry out drug experiments on mice, it would clearly be very wrong to try to do the same with humans as subjects. Instead we collect data or observations and then analyze those observations looking for differences.

To compensate for the lack of randomization, we control for ob- served differences by including as many relevant variables as possi- ble. If we knew that person X had received a particular drug and had developed a particular condition, we would want to compare person X with somebody else who had not developed that condition. Relevant variables that we would want to know might be age, gender and possibly pre-existinghealth conditions. Including these variables reduces fixed effects and allows us to concentrate on the effect of the drug.

Problems with data

It is obvious that to be credible, your analyses must be based on reliable data. Problems with data are usually connected to poor sampling and experimental design techniques, especially:

Sample size too small. The relative size of the sample to the population usually does not matter. It is the absolute size of the sample that counts. You usually want at least several hundred observations.

Population of interest not clearly defined. It is clear that we need to take a sample from a population, but what exactly is the population? Here’s an example. You want to survey shoppers in a shopping mallregarding your new product. But of the people inside the mall, who exactly are your population? People just entering, people just leaving, people having their lunch in the food court? Singles, couples, elderly people or the teenagers hanging around outside the door? You can see that picking any one of these groups on its own will lead to a biased sample.

Non-response bias. Many people don’t answer those irritating telephone calls which come in the evening because they’re busy with dinner. As a result, only answers from those who do choose to answer the surveyare counted. Those respondents most likely aren’t representative of the population. Perhaps they live alone or do not have too much to do. I’m not saying that they should not be in the sample, just that including only those who do respond may bias your sample.

Voluntary response bias. If you feel strongly about an issue, then you are more likely to respond than if you are indifferent. That’s simply human nature. As a result, the survey results will be skewed by the views of thosewho feel most passionately. This is hardly a representative sample because the strongly-held views drown out the more moderate voices.

4.1 Big data

Primary data frequently comes from automated collection devices, such as scanners, websites, social media, and the like. The volume of such data is enormous, and is aptly called big data. Big data is the term used todescribe large datasets generated by traditional busi- ness activities and from new sources such as social media. Typical big data includes information from store point-of-sale terminals, bank ATMs, Facebook postsand YouTube videos.

One of the apparently attractive features of big data is simply its size, which supposedly enables deeper insights and reveals connections which would not appear in smaller samples. This argu- ment neglects thepower of statistics, and in particular inferential statistics. A small sample, properly collected, can yield superior insights to a very large poorly collected sample. Think of it this way: which is better: a very large samplein which all the respondents are in the same age-group and of the same gender; or a smaller one which more accurately reflects the population?

4.2 Some useful sites

You can of course easily just Google for data, or look at these more focused sites:

Gapminder data: free to use but be sure to attribute¹ Worker employment and compensation²

Interesting and wide-ranging historical data³ Google Public Data

¹http://www.gapminder.org/data/ ²http://www.bls.gov/fls/country/canada.htm ³http://www.historicalstatistics.org/

http://www.google.com/publicdata/directory

International Monetary Fund

Food and Agriculture Organisation United Nations Data

The World Bank

http://www.imf.org/external/data.htm

http://www.fao.org/statistics/en/

http://data.un.org/

http://databank.worldbank.org/data/home.aspx