Week – 2 Exploratory data analysis >>> How to Win a Data Science Competition Learn from Top Kagglers
1. Suppose we are given a data set with features XX, YY, ZZ.
On the top figure you see a scatter plot for variables XX and YY. Variable ZZ is a function of XX and YY and on the bottom figure a scatter plot between XX and ZZ is shown. Can you recover ZZ as a function of XX and YY?
2. What YY value do the objects colored in red have?
3. The following code was used to produce these two plots:
# bottom plot
logX = np.log1p(x) # no NaNs after this operation
(note that it is not the same variable XX as in previous questions).
Which hypotheses about variable XX do NOT contradict with the plots? In other words: what hypotheses we can’t reject (not in statistical sense) based on the plots and our intuition?
4. Suppose we are given a dataset with features XX and YY and need to learn to classify objects into 22 classes. The corresponding targets for the objects from the dataset are denoted as yy.Top left plot shows XX vs YY scatter plot, produced with the following code:
# y is a target vector
plt.scatter(X, Y, c = y)
We use target variable yy to colorcode the points.
The other three plots were produced by jitteringXX and YY values
def jitter(data, stdev):
N = len(data)
return data + np.random.randn(N) * stdev
# sigma is a given std. dev. for Gaussian distribution
plt.scatter(jitter(X, sigma), jitter(Y, sigma), c = y)
That is, we add Gaussian noise to the features before drawing scatter plot.