Understanding complex data is time-consuming and requires domain expertise
Gaining a deep understanding of the available data is an essential step of each data science project. Process models such as CRISP-DM stress the importance of the data understanding phase for building good predictive models, or for assessing if the project goals can be achieved at all. However, obtaining that deep understanding can be time-consuming and very challenging for data scientists and consultants, especially if the data comes from new projects or clients. For example, sensor data from energy systems or industrial processes may comprise time series from thousands of measuring points. Typically, many of them show complex patterns, varying characteristics due to process changes, phases of process interruptions, and quality issues such as gaps and outliers. New questions about the data may therefore come up very rapidly during this understanding phase and often require lengthy iterations with domain experts. This explains why the data understanding phase – together with data preparation – accounts for up to 80% of the overall effort and costs of data science projects.