# Data leakages

### Week – 2 Data leakages >>> How to Win a Data Science Competition Learn from Top Kagglers

An ID of a data point (row) in the train set correlates with target variable.
First half of the data points in the train set has a score of 0, while the second half has scores > 0.

Split train, public and private parts of data by time. Remove time variable from test set, keep the features.
Split train, public and private parts of data by time. Remove all features except IDs (e.g. timestamp) from test set so that participants will generate all the features based on past and join them themselves.
Make a time based split for train/test and a random split for public/private.

**Programming Assignment: Data leakages**

1. Suppose that you have a credit scoring task, where you have to create a ML model that approximates expert evaluation of an individual’s creditworthiness. Which of the following can potentially be a data leakage? Select all that apply.

2 points

2. What is the most foolproof way to set up a time series competition?

1 point

3. Suppose that you have a binary classification task being evaluated by logloss metric. You know that there are 10000 rows in public chunk of test set and that constant 0.3 prediction gives the public score of 1.01. Mean of target variable in train is 0.44. What is the mean of target variable in public part of test data (up to 4 decimal places)?

2 points

.7711

4. Suppose that you are solving image classification task. What is the label of this picture?

3

