Exploring Data

Exploring data

Most of the machine learning models that are used today are relying on good quality data. Withou access to good data you will struggle to make good machine learning models. So it is important to figure out as soon as possible, is your data any good, and what can we do to improve it?

How to determine if your data is any good

Garbage in -> Garbage out

GarbageinGarbageout Before performing any data analysis on your data it might be good to use common sense to determine if your data is any good. Here are a few questions that might be good to answer regarding your data.

  • Is it reasonable that a human could find useful information in the given dataset given enough time?
  • Is the dataset properly labeled?
  • How much noise is present in the data?
  • Is the size of the dataset large enough to be of use in machine learning mdoels? (100 000 samples is a good rule of thumb)

Labelling of a dataset

Making labels makes your dataset ready to be used in a machine learning model by assigning values for whatever it is that you can measure. In the famous MNIST example, your labels are the correct answer what number is hand drawn, but for different dataset your label could be almost anything.

To use an example from brain imaging data. Your label can be either high or low cogntive workload classified based on what task a particpant performed.

This is already a source of error. Your dataset might be wrongly labelled for many reasons. Here are a couple.

The data scientist accidently misclassified the dataset

Perhaps he mixed something up, and accidently misclassified your data.

The data scientist classifications are poorly made

Perhaps the tasks used to attribute for the labels where poorly constructed, so that the classifications are not properly assigned to the labels. For instance, if we would classify cogntive load in the brain as high or low, classifying people doing “nothing” as low cognitive load would be a mistake since the default mode network of the brain is active when we “do nothing” so our brain can be quite busy when we do nothing.

How to label a dataset

Working with labelling unlabelled data is a part of machine learning. It is often common to work with csv files. Here is a simple way to add labels to two different dataset and combine them into a third dataset in the bash terminal. But it is really really important that you remember which label is which. Otherwise you will have a bad time.

adding the labels for easy and Difficult

awk ‘{printf(“%s,0\n”,$0)}’ Easy > EasyL awk ‘{printf(“%s,1\n”,$0)}’ Difficult > DifficultL

making into one file

cat EasyL DifficultL > CognitiveLoad

How to determine the level of noise in a dataset

There are many different questions we should ask ourself regarding a dataset to determine if the data is of good quality. You can think of your dataset as data that contains information of direct quailty. Your goal as a data scientist is to extract the information that is of value from the dataset.

How accurate is our data

One problem might be that our measurements aren’t very accurate. For instance, lets imagine people filled out a form regarding how much they liked a certain company. Perhaps people got bored in the middle of filling out the form. This will impact your data by making the data less.

How was the data acquired?

The best way is to get a dataset raw, by acquiring the data yourself. If not there might be data that has been removed. If you get the chance, always ask for the raw data and then you can remove what you don’t need.

We got a dataset, now what?


Checking the distribution of your data is important for understanding your data. Is it normally distributed or using another form of distribution?

Here you have a list of ways to handle distribution in python.https://docs.scipy.org/doc/scipy/reference/stats.html#continuous-distributions

Handle outliers

Handling outliers is important to figure out how to handle extreme values that might skew your dataset. Here is a post detailing how to do it.

Write more about this. https://medium.com/@dhwajraj/learning-python-regression-analysis-part-7-handling-outliers-in-data-d36ee9e2130b

Check simple correlations

Figuring out how different values in your dataset correlates helps you draw analysis from your data.

Visualising your data.


What type of machine learning models should be used on the data?

Regression, classification, clustering (without labels), understanding (feauture importance)

Checking bad data

Chekf for null, nan, empty strings, weird strings (weird characters), remove columns without data

The unsexy part of machine learning

The sexy part of machine learning might be building machine learning models, sadly that is a tiny part of your job as a data scientist. The big job is figuring out, why doesn’t your model that should be working work? What could be wrong?

When we are working with AI it is important to learn how to troubleshoot. Everything that can go wrong will go wrong so we just have to learn to deal with that. The most common problems in machine learning. There is something wrong with your data One of the most common problems in machine learning is that your data isn’t any good.

Remember? Garbage in = Garbage out Depending on your dataset there could be many things that could be wrong. So lets take a moment to think what could be wrong with our data?

Can you think of all the things that might be wrong with our data?

There was too much light in the room that made the sensor have an overflow error The placement of the brain sensor was wrong so it couldn’t detect activity in the brain properly There was something wrong with the sensor that detected the brain signal, so it couldn’t detect it properly. There was something wrong with the program that gave us the data from the brain sensor that corrupted the data. A mistake was made while converting the data into a csv file. We wrote something wrong when we prepropessed the data (adding labels, removing missing values, shuffling the data)

Can you think of all the things that be wrong with our machine learning model?

We didn’t assing our X and Y variables properly. We are using a machine learning model that isn’t suitable for handling our data format. Our activation function (currently RELU) might be inefficient. Our loss function (currently mean_squared_error) might be inefficient. We need to tweak the number of layers in our neural network (currently a 16 neuron input layer, 4 deep layers of 100 neurons each, with and a 2 neuron output layer). We need to add our remove dropout to deal with overfitting (currently done twice with 20% of the network) We need to tweak our output parameters (validation_split=0.5, nb_epoch=5, batch_size=5), validation_split is how much of our data is training data and how much is test data, nb_epochs is how long the network will train and batch_size is how much data will be processed before updating weights.

Figure out if you have a problem with your data or your model

The reason why it’s so important to work with data preprocessing is because it’s really good to know that it’s actually your neural network that is the problem before you spend all your time trying to fix it. I’ll save you the suspense for this dataset. Default mode network activation has never been able to be detected using FNIRS (the technology we are using for the dataset), so even with the worlds greatest neural network, some problems might still be impossible to solve. The reason as always has to do with the input data.

Artificial neural network are (like our brain!) universal function approximators. This means that they can theoretically solve any mathematical problem.

When you are working in deep learning, your neural network are usually fine once you get the working, but getting the right data will always remain a big problem. We need more data

Okay, so that didn’t work so well. Lets try instead with a larger dataset. This dataset uses data taken from the 52 broadmann areas of the brain for 51 users.

The goal is to try to detect activation in the default mode network or task positive networks on the brain with the help of FNIRS.