Being presented with a new business intelligence problem can be overwhelming. If you didn’t collect the data yourself, no doubt there is someone else who is proud of what has been pulled together and has high expectations for its use. Likely someone is expecting some straightforward and definitive insights into the current business situation. You may already have some idea about what the data is going to show, and you just want to prove it and move on to other things.
Right, here you should stop and take a step back. Not only does this sort of situation often lead to confirmation bias, but it also encourages you to pass up on an important milestone in the analysis process that can provide vital and often unexpected insights.
The general process of data analysis goes like this:
- Collect data
- Process and clean the data (Normally most of your time is spent here.)
- Perform exploratory data analysis
- Hypothesis testing and modeling (Normally the shortest step.)
- Make decisions
I’m going to talk a little bit about the important distinction between steps 3 and 4 and why both are important. As I described in the hypothetical situation above, you often have a theory about the case before any business data is even collected. From this narrow starting point, the data can only be used to hopefully either confirm or reject your predetermined hypothesis.
The hypothesis testing and modeling step is what most people think of when they think about data analysis. It is the part that analysts and statisticians spend most of their time learning about in school. But in reality, this is usually only a small part of the data analysis process. Not only that but the highly mathematical techniques used for modeling will only provide valuable information if they are used to test the right theory on the right data. Often you have jumped directly from step 2 to step 4 and bypassed the benefits of exploratory data analysis.
Exploratory data analysis (EDA) is a loosely defined term that involves using graphics and basic sample statistics (mean, median, standard deviation, etc.) to get a feeling for what information might be obtainable from your dataset. The concept of EDA is rather new, having been first developed by mathematician John Tukey in the late 1970’s. Tukey described a set of techniques that allows analysts to quickly look at data for trends, outliers, and patterns. The eventual goal of EDA is to obtain theories that can later be tested in the modeling step.
The fact is that many of us are already regularly performing EDA whether we realize it or not. When I am presented with a new business intelligence problem, my first steps are to write a few simple queries, pull the data out of the database and into Excel or some other software – like BOARD, and start to play around with it to see if I can figure out what’s going on. From there I create graphs and visuals, filter data in various ways, and look for outlier and possible entry errors. I see what jumps out at me and try to develop a narrative that explains the data. I think that many people approach problems in a similar way.
An important part of exploratory data analysis is to approach it without expectations. During this step, it is important to think outside of the box. EDA is perhaps more of an art than a science. Interestingly it can be performed by nearly anyone that knows how to use a computer, but at the same time is the process that allows talented data scientists to show their genius. Performing EDA is like sifting through the unbound pages of a novel and trying to put together the story. Most people could determine the main plot of the book, but perhaps only the best could find insight into the suggestive themes that make it worth reading.
Once you have sifted through the data and feel like you know what’s going on, you can proceed with hypothesis testing and modeling to make predictions. There is a multitude of methods to do this. Typically some form of regression is used to interpolate or extrapolate the results of possible future conditions. These methods can be very complicated mathematically and may require lots of computational resources. Other times satisfactory answers can be obtained from a single function. Regardless of the method used, the purpose of this step is to confirm or reject the theories that have been achieved during the EDA step. So clearly you aren’t going to get any useful information unless the EDA led you to ask the right questions.
When John Tukey first wrote about exploratory data analysis in 1977, he described a labor-intensive process that could be performed with only a basic calculator and by hand drawings. It’s not difficult to imagine the amount of work that went into sorting and plotting the data in a variety of ways to get a better understanding of the situation. His work led to the development of visualization software that made the process a lot faster. Now there is a multitude of software options for EDA; BOARD – for example – can be used by someone without an extensive background in statistics. Whereas the development of deeper insight into business problem used to take hours or days of painstaking work, a practiced user can quickly view many visualizations and get useful insight.