Data Exploration – What, Why, and How

Author: Harald Piringer – November 18, 2020

This blog will take a closer look at data exploration – an approach to data analytics that is rapidly gaining popularity with the rise of data science and interactive visualization. We will cover what data exploration is, why it is necessary, and how to do it.

All truths are easy to understand once they are discovered; the point is to discover them.

What is data exploration?

The goal of data exploration is to learn about characteristics and potential problems of a data set without the need to formulate assumptions about the data beforehand. In statistics, data exploration is often referred to as “exploratory data analysis” and contrasts traditional hypothesis testing. It was most prominently promoted by the statistician John W. Tukey [1] to suggest hypotheses about the causes of observed phenomena and to assess assumptions on which statistical methods are based.

Nowadays, data exploration has established as a mandatory phase in every data science project. Since its beginnings, exploratory data analysis has been a very graphical approach. Typical plots include histograms, box plots, scatter plots and many more in order to learn about distributions, correlations, outliers, trends, and other data characteristics.

Since the advent of interactive graphics, data exploration has increasingly gained popularity in fields beyond classical statistics and data science. Examples include engineering, business intelligence, and even popular communication in online media about phenomena such as the Covid-19 outbreak.

Why is data exploration necessary?

Two key advantages of data exploration are…

  1. to enable unexpected discoveries in the data
  2. to foster a deep understanding of the data as an important fundament for successful and efficient data science projects.

A simple and famous example by the statistician F. J. Anscombe [2] illustrates these aspects very well (Fig. 1). It shows four different data sets, each data set with two variables and eleven data points.

While the distributions and correlations of the data sets look quite different, all four data sets have precisely the same mean values and standard deviations for both variables as well as almost the same correlation coefficients and linear regression polynomials.

Anscombe’s quartet

Fig. 1: Anscombe’s quartet [2]

How can this mismatch be explained?

Many statistical methods make implicit assumptions about the data, such as assuming a unimodal, normal distribution, absence of strong outliers, and so on. If the assumptions are violated, the results may be misleading.While more robust statistics would avoid these problems in this simple example, two conclusions still hold:

  1. Inspecting raw, non-aggregated data values may contribute a lot to learning about the real structure of data.
    This is especially true for measured time series data such as industrial process data, which is afflicted by a lot of additional complexities, e.g., sudden process changes and very irregular patterns.
  2. Good analytical results require suitable data, and methods that match the characteristics of the data.
    Even the most advanced methods such as deep neural networks will not work well if trained with data that contains too much noise.

Data exploration in data science projects

For the reasons outlined above, data exploration should be a key phase in every data science as well as data-based engineering project. The Cross Industry Standard Process for Data Mining (CRISP-DM) [3], a widely adopted process model for conducting data science projects, lists “data understanding” as its second phase (see Fig. 2).

The results of this step are answers to important questions such as:

  • Is the available data sufficient and suitable to achieve the project goals?
  • Is the data plausible or suffers from acquisition or transformation artefacts?
  • Do we know enough about the causes of anomalies, trends, gaps, etc.?
  • Which data-subsets (e.g. variables, time periods) can (not) be used for building predictive models?
Process model of the Cross Industry Standard Process for Data Mining (CRISP-DM)

Fig. 2: Process model of the Cross Industry Standard Process for Data Mining (CRISP-DM)

Answers to these questions also inform the third step of CRISP-DM, namely data preparation. In practice, there is a tight interrelation between data exploration and data preparation.

Discoveries about issues such as outliers or clusters may trigger immediate preparatory actions (e.g., filtering outlying data) before the exploration process can be continued. This can be tedious and may require to consult domain experts, who know how to explain specific data problems and can decide about valid actions to deal with them.

Data preparation – a simple real-world example

As a real-world example, Fig. 3 shows four years of a temperature sensor in a hydropower plant. It reveals three types of problems for modelling the current normal behavior: An irrelevant period of a few months due to a revision (yellow), repeated spikes due to shut-downs and start-up procedures (red), and the levels of the time series before (blue) and after the revision (green) are different. It turns out that only the green part of the data can be used for building a predictive model reflecting the current normal behavior.

Four years of raw data from a temperature sensor in a hydropower plant.

Fig. 3: Four years of raw data from a temperature sensor in a hydropower plant.

The interrelation between discoveries, interpretations, and preparatory actions is one reason why those two steps – understanding and preparing data – are often cited as accounting for up to 80% of the overall effort of data science projects [4]. One goal of Visplore is to reduce this effort by keeping these steps as integrated as possible.

How should data exploration be done?

Data exploration has always been closely linked to data visualization. The idea is to learn about data by depicting (often raw) data as it is. Humans have powerful visual pattern recognition capabilities with more than 20 billion neurons of our brain dedicated to this purpose [5]. “Seeing is understanding” and “A picture is worth a thousand words” are just two famous proverbs stressing the power of human visual perception. For data exploration, a particular advantage is the ability to discover patterns which were not anticipated or which would sometimes be hard to describe mathematically.

A proven approach for doing data exploration is to gradually increase the dimensionality of the inspected data plots from 1D over 2D to nD [6].

Coarsely, a process for data exploration could look as follows:

1D distributions in histograms, 2D correlations in scatter plots, and nD patterns – in this case as horizon graphs

Fig. 4: Three phases of an exploration process: 1D distributions in histograms (left), 2D correlations in scatter plots (middle), and nD patterns – in this case as horizon graphs (right)

  1. Start by looking at variables in separation.
    Issues to check at this stage include the shape and plausibility of distributions as well as the presence of missing values and univariate outliers. Histograms (Fig. 4 left) are useful to assess the distributions of quantitative variables while bar charts are frequently used for categorical data in order to understand which data categories exist and how frequent each category occurs. For time series data, it is natural to look at time series plots.
  2. Investigate correlations between variables.
    This reveals bivariate outliers and clusters that would not be visible in 1D plots. It may also inform about potential dependencies – even though it’s a common saying in statistics that correlation must not be confused with causation! For quantitative data, the most common technique for this task are scatter plots (Fig. 4 middle). For categorical data, heatmaps are often a good choice.
  3. Proceed towards understanding the data set in its entirety.
    This may include multivariate visualization types such as parallel coordinate plots and visual overview techniques, e.g., horizon graphs for time series data (Fig. 4 right). It may also include statistical techniques for dimensionality reduction such as Principal Component Analysis (PCA). In general, however, this step largely depends on the data and the ultimate goal of the analysis. Concrete steps will differ whether your task is to, for example, build a predictive model, understand the impact of parameters for a design problem, or identify the root cause of a machine failure. Therefore, it’s hard to generalize.

Boosting data exploration with interactive visualization

With increasing complexity and dimensionality of the inspected data, limitations of static charts become obvious soon. Interactive data visualization is a powerful extension to overcome these limitations [5]. For example, it enables the user to rapidly adapt the visualization to answer specific questions – sometimes as simple as by zooming into a particular part of the plot. And it empowers the user to express interest in certain features of the data by selecting their corresponding visual representations.

In Visplore, for example, selecting a cluster in a scatter plot of two sensor variables highlights the corresponding time period in a linked time series view or a calendar view. Linking multiple views thus allows to overcome limitations of individual visualization types and reveals correlations between different aspects of the data that would be hard or impossible to show in a single plot. This turns data visualization into an intuitive user interface for entering into an informative dialog with the data for rapid exploration.

Linked bar chart reveals that cluster corresponds to night times. Bars reflect percentages of selected data per hour of day.

Fig. 5: Selecting a cluster of data points in a scatter plot. A linked bar chart reveals that this cluster corresponds to night times. The bars reflect the percentages of selected data per hour of the day.

The next level – guided data exploration

While interactivity multiplies the power of data visualization, it can sometimes still be hard to pick the most useful view for answering a specific question – especially if your data encompasses hundreds of variables. The answer is guided data exploration. The goal is to present the user with views of the data that could be most helpful for answering the current question at hand. Let’s illustrate this by means of some examples in Visplore:

Selection of an anomalous period ranks other variables by the degree to which their distribution differs during that time.

Fig. 6: Example of guided data exploration: Selection of an anomalous period (left) ranks other variables by the degree, how much their distribution also differs during that time (middle). This makes it easy to identify other anomalous time series (right).

  1. Defining a target variable…
    …ranks all other variables by their degree of correlation with that target.
  2. Selecting an anomaly…
    …in a time series view ranks histograms of all other variables by their potential correlation with that anomalous period (see Fig. 6).
  3. When building a regression model (in Visplore Professional),
    …a list orders all input variable candidates by their potential to explain the remaining prediction errors and thus to further improve the model.
  4. When selecting a pattern in a time series view,
    …Visplore Professional can search for occurrences of similar patterns.

In all cases, advanced data analytics use the power of today’s computers behind the scenes to do the tedious work of searching and analyzing the large bulk of data for relevance.

At the same time, the user is kept in the loop and can benefit from intuition, knowledge, and human perception skills to efficiently drive the data exploration by selecting between a few candidates for next steps.

Conclusion

Data exploration is very helpful whenever the need is to gain new insights from data. It can be the most effective – and sometimes only – approach when the data suffers from quality problems (e.g., gaps, anomalies, outliers) and other real-world complexities such as process changes or when it is not practical to formulate a precise question beforehand that can be answered by a numeric result.

In this respect, research & development, engineering as well as data science are among those fields that can benefit a lot from data exploration. With today’s computing power and the support of modern analytics, interactive data exploration can be an exciting and engaging experience for everyone to discover unexpected value in large amounts of complex data.

To see data exploration in action…

Newsletter

Stay up to date on new developments about Visplore. Subscribe to never miss new blog posts!

Literature

[1] J. W. Tukey. “Exploratory Data Analysis”. Addison-Wesley, 1977

[2] F. J. Anscombe. “Graphs in Statistical Analysis”. American Statistician vol. 27 nr. 1, pp. 17–21, 1973

[3] R. Wirth and J. Hipp. “CRISP-DM: Towards a standard process model for data mining”. In Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining, pp. 29 – 39, 2000

[4] New York Times (18.8.2014), “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”

[5] T. Munzner T. “Visualization Analysis and Design”. AK Peters Visualization Series, CRC Press, 2015.

[6] J. Seo and B. Shneiderman. “A rank-by-feature framework for interactive exploration of multidimensional data”. Information visualization, vol. 4, nr. 2, pp. 96-113, 2005.