data science - tips

Hey All!

Nice to see you all, to start with, I have just moved to Amsterdam to start my Master in Data Science at UvA. I am suuuuuper excited, that was honestly one of my dreams to study abroad someday and it is so great that I managed to achieve it. As far we have different assignments and projects so I will update you with the things that I learn :)! Today I would like to start with some brief sum-up of the things from the first week of lectures.

How should we use split data set for building a model?

Training set - 80%, Validation - 10% and Test - 10% or 90/5/5

The training set is as simple as it sounds - set that is used to tune the model, the validation set is used for checking how good is the performance of our model and is typically used for "hyperparameter" tuning. The test set on the other hand is held separately to the test and validation data and should be completely unseen until the end of the entire analysis. With cross-validation train and validation sets are randomly re-splited but the test set stays the same. With time-ordered data split between sets is due to the order of the data.

Types of Machine Learning

supervised - here we train our model using labeled data, so for example we might want to predict some time-series,

unsupervised - here we have data without labels and for example, we want to classify some dataset without prior knowledge about it,

reinforcement - one of three basic machine learning techniques, along with supervised and unsupervised learning, reinforcement learning enables software-defined agents to learn the best possible actions in a virtual environment (for example Deepmind's algorithm AlphaGo), when agents learn from rewards that are encountering,

zero / few-shot - learning model with a very small amount of data, mostly used in computer vision, for example when we want to categorize bird species from photos and some of the rarest might not have enough photos,

weakly-supervised - it is used when we do not have labeled data, and we make use of functions that label data. There are three typical types of weak supervision: incomplete supervision when only a subset of training data is labeled; inexact supervision when the training data are given with labels but not as exact as desired and inaccurate supervision when in the training data there are some labels with mistakes.

self-supervised - a means for training computers to do tasks without humans providing labeled data, outputs or goals are derived by machines that label, categorize, and analyze information on their own then draw conclusions based on connections and correlations.

Error types

We might occur two types of errors: statistical error (or statistical variance, the random systematic noise from repeated measurements, generally they represent minimum error on model's results) and systematic error (or statistical bias, the error in measurement caused by anything not random, no matter how many measurements, the mean diverges from expectation).

Gaussian approximation

Binomial distribution approximation using normal distribution, where $$ \mu = np, \sigma^2 = np(1-p).$$ Thanks to that we can have approximately statistical error.

Final remarks

Taken your dataset, cleaned it, modeled and made measurements, then quoted statistical errors and maybe even have an idea of systematics, worth it is to ask yourself whether is it still accurate!



Thankssssssss,
szarki9