An important step of data preparation is to label (or annotate) parts of the data with semantic information. For example, labeling inexplicable anomalies in sensor time series allows for an efficient discussion with co-workers, or assessment by the data owners. Moreover, AI solutions for predicting anomalies require labeled training data for learning to distinguish impending problems from normal behavior. In many cases, only the domain experts are familiar enough with the data to externalize this kind of knowledge in the form of labels, and need tool support to do the annotating efficiently.
In this tutorial you learn how to label time series data interactively (much like painting the labels), as well as semi-automatically (algorithmic rule + manual correction). Moreover, we demonstrate how you can audit and correct existing labels. The tutorial starts from Python (but you can skip this when starting from another data source), and ends with exporting the labeling back to Python or CSV.
Tutorial for data labeling: Look at the detailed tutorial to see an entire workflow starting with data from Python, exporting labels back to Python, pattern search features for automatic labeling, and how to apply labelings to new data.
Preparation (to follow this lesson using demo data)
If you're not working with Python, you can start Visplore directly, and load your dataset from any other source (CSV, ...).
For the tutorial, load the solar power demo dataset from the welcome dialog, as shown below.
This section shows you how to interactively assign labels to the data, much like painting the labels on data points.
The dataset in this example contains almost 200 time series. Let's focus on a particular temperature sensor that is known to have had anomalies, and zoom in:
Note: if you accidentally zoomed in wrongly, you can zoom out again with this button in the bottom left of the time series view , and try again.
Note: if you tried to zoom using the left mouse button instead of the right mouse button, you have made a data selection. You can clear that again by pressing the X here: - then try again.
Once you have zoomed in to September, you can see the anomalies: some unplausible oscillations of the temperature. Let's select the horizontal interval selection tool, so we can label these times:
With this selection tool, select the first of the anomalies by dragging the left mouse button:
Now you can give this selection a name to start labeling. Name it "Oscillation":
Alternatively use the top level menu and select "Create" and "Label":
Now you are in a labeling mode, where your labels are shown using colors. You can label further periods by selecting, and clicking the button "Oscillation":
Great! You have mastered the basic workflow for manual data labeling. There are several ways you can continue from here:
You can scroll through the time series (e.g. using the mouse wheel or the scrollbar) and label further occurrences.
You can also create new labels within the labeling, e.g. for other types of anomaly or any other semantics, by pressing the "New" button next to the "Oscillation" button after selecting data.
You may also select and label data in many other views, such as the Scatter Plot, or the Histogram. And use any selection tool you want, like the Lasso, or the vertical selection brush, etc.
When you're done, export the corrected labeling as shown in the section "Exporting the labeling" below.
Note that the labeling functionality shown so far works for all versions of Visplore, including the Free version.
Auditing existing labelings
Just as important as the initial labeling is the use case of inspecting and correcting a labeling that was made by somebody else, or an external algorithm. Visplore supports a workflow of auditing and correcting any categorical data attribute - in particular, ones that contain labels. Regardless if the categorical attribute was already part of the originally loaded dataset, or it is merged to the Visplore data table at runtime from Python, R, or Matlab.
In our example, we use a labeling that states wind directions for each time stamp of wind speed sensor measurements. The problem in this example is that for weak wind speeds, the wind direction label is not reliable enough for using it in a follow-up algorithm. Thus, we want to correct the labels for weak wind speeds to an "unspecified" state.
Reload the demo dataset, including the wind timeseries we need, by clicking on top of the screen.
Then, select the demo data as follows:
After the cockpit starts, search and select the time series Wind_Speed_BrightCounty_Weather as shown below.
If you want, collapse the Histogram view by a click on the dark gray heading "Distribution" - to have more space for the time series view.
Now you see the wind speed time series in the "Time Series" view. The labeling we want to audit are the corresponding wind direction measurements.
Start auditing the existing categorical data attribute Wind_Direction_BrightCounty_Weather as shown in the following:
As a result, a new, editable labeling is created based on the selected original categories. The time series plot shows it in colors, and you are now in the same Labeling mode as in part 2 of this tutorial:
You can also audit the distribution of the labels.
Open the Drill Down views, and configure the Bar Chart as to show the distribution of the new labeling:
Then, the bar chart shows how often each wind direction label has occurred (bar length). As the views are linked, you can easily see when no wind direction was recorded; this is the label called "---".
Hover the "---" label to see the corresponding points in the time series:
We see, that these are mainly time periods with wind speed around zero. For our imaginary follow-up application, we only want very reliable direction measurements, and change the labels to be "---" also for all speeds below 2 m/s. For this, we can simply select these periods, and re-label them as shown in previous parts.
Use the vertical selection tool to select all times below 2 m/s, then click the "---" button to label all these times as undefined wind direction:
Now, all weaker time periods are labeled as "---".
Note that our selection was not very exact, but roughly below 2 m/s. If you wanted to make it exact, you could have clicked this icon in the Focus bar and set the borders exactly to 0 and 2.
From here, there are multiple directions to go forward:
- You can do further corrections. For example, re-label time periods, or merge certain categories like N, NNW and NNE by selecting them in the "Bar Chart", and re-labeling them as N, for example.
- When you're done, export the corrected labeling as shown in the section "Exporting the labeling" below.
Exporting the labeling
This section describes how you get the result of your labeling out of Visplore for further tasks. First, various ways of exporting are listed, then exporting to Python is described in more detail.
Note: Labels refer to entire data records (= table rows), not just to the values of the particular variable you labeled. This becomes evident when looking at the "Table" view, which you find in the bottom right of the screen.
There are multiple ways of exporting the labels for further use:
- Exporting the table view as shown in the image above. For this, ensure the "Table" tab is opened, then click the view title "Focus data records", and . This exports the table exactly as currently shown, to CSV or clipboard.
You can adjust which columns are shown by clicking the view title "Focus data records", then .
Note that the view only considers data records that are currently in Focus. If you want to export all data records, clear the Focus first by pressing the X in this button:
- Without going through this view, you can always export all data records in focus in different formats using the button in the Focus bar, as described here. This includes the labeling columns.
- And finally, you can get back the labelings to Python and other external script environments as described in the following.
Getting the labelings to Python
If you have started your usecase from Python, or connected to a Python environment by this point (see here), you can use the following API commands to retrieve the labels from Python.
get_classification(name): retrieves the labeling with the given name as categorical data column of a new data frame.
# you may need to adjust the name of the labeling, if you named it differently
In Jupyter, for example, the result would look like this (containing "Rest" where no label was added, otherwise the name of your labels, like "Oscillation").
get_condition(name): Alternatively, this gets individual labels (like "Oscillation") that are part of a labeling, as a list of Boolean values (with True = part of the label, False = not part of the label):
# you may need to adjust the name of label, if you named it differently
In Jupyter, for example, the result would look like this. The length of the list corresponds to the dataframe in Visplore. Somewhere in the middle, the list has 'True'-values where the oscillations were marked in Visplore:
Great! You have mastered the workflow for auditing and correcting labelings, and exported them for follow-up tasks.
>> Continue with Next lesson: Cleansing outliers and gaps
License Statement for the Photovoltaic and Weather dataset used for Screenshots:
"Contains public sector information licensed under the Open Government Licence v3.0."
Source of Dataset (in its original form): https://data.london.gov.uk/dataset/photovoltaic--pv--solar-panel-energy-generation-data
License: UK Open Government Licence OGL 3: http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
Dataset was modified (e.g. columns renamed) for easier communication of Visplore USPs.