Name: Homework 1 Solution
SKU: 49592
Price: 29.99 USD
Availability: InStock

Description

Rate this product

Please review all homework guidance posted on the website before submitting to Gradescope. Reminders:

Please provide succinct answers along with succinct reasoning for all your answers. Points may be deducted if long answers demonstrate a lack of clarity. Similarly, when discussing the experimental results, concisely create tables and/or gures when appropriate to organize the experimental results. In other words, all your explanations, tables, and gures for any particular part of a question must be grouped together.

When submitting to gradescope, please link each question from the homework in gradescope to the location of its answer in your homework PDF. Failure to do so may result in point deductions. For instructions, see https://www.gradescope.com/get_started#student-submission.

Please recall that B problems, indicated in boxed text are only graded for 546 students, and that they will be weighted at most 0.2 of your nal GPA (see website for details). In Gradescope there is a place to submit solutions to A and B problems seperately. You are welcome to create just a single PDF that contains answers to both, submit the same PDF twice, but associate the answers with the individual questions in gradescope.

If you collaborate on this homework with others, you must indicate who you worked with on your home-work. Failure to do so may result in accusations of plagiarism.

Short Answer and \True or False” Conceptual questions

A.0 The answers to these questions should be answerable without referring to external materials.

[2 points] In your own words, describe what bias and variance are? What is bias-variance tradeo ? [2 points] What happens to bias and variance when the model complexity increases/decreases?

[1 points] True or False: The bias of a model increases as the amount of training data available increases.

[1 points] True or False: The variance of a model decreases as the amount of training data available increases.

[1 points] True or False: A learning algorithm will generalize better if we use less features to represent our data

[2 points] To get better generalization, should we use the train set or the test set to tune our hyperpa-rameters?

[1 points] True or False: The training error of a function on the training set provides an overestimate of the true error of that function.

Maximum Likelihood Estimation (MLE)

A.1. You’re a Reign FC fan, and the team is ve games into its 2018 season. The number of goals scored by the team in each game so far are given below:

[2; 0; 1; 1; 2]:

Let’s call these scores x₁; : : : ; x₅. Based on your (assumed iid) data, you’d like to build a model to understand how many goals the Reign are likely to score in their next game. You decide to model the number of goals scored per game using a Poisson distribution. The Poisson distribution with parameter assigns every non-negative integer x = 0; 1; 2; : : : a probability given by

Poi(xj ) = e ^x:

So, for example, if = 1:5, then the probability that the Reign score 2 goals in their next game is e ^1:5 ¹^:_2!⁵²

0:25. To check your understanding of the Poisson, make sure you have a sense of whether raising will mean more goals in general, or fewer.

a. [5 points] Derive an expression for the maximum-likelihood estimate of the parameter governing the Poisson distribution, in terms of your goal counts x₁; : : : ; x₅. (Hint: remember that the log of the likelihood has the same maximum as the likelihood function itself.)

[5 points] Suppose the team scores 4 goals in its sixth game. Derive the same expression for the estimate of the parameter as in the prior example, now using the 6 games x₁; : : : ; x₅; x₆ = 4.

[5 points] Given the goal counts, please give numerical estimates of after 5 and 6 games.

A.2. [10 points] In World War 2, the Allies attempted to estimate the total number of tanks the Germans had manufactured by looking at the serial numbers of the German tanks they had destroyed. The idea was that if there were n total tanks with serial numbers f1; : : : ; ng then its reasonable to expect the observed serial numbers of the destroyed tanks constituted a uniform random sample (without replacement) from this set. The exact maximum likelihood estimator for this so-called German tank problem is non-trivial and quite challenging to work out (try it!). For our homework, we will consider a much easier problem with a similar avor.

Let x₁; : : : ; x_n be independent, uniformly distributed on the continuous domain [0; ] for some . What is the Maximum likelihood estimate for ?

Over tting

A.3. Suppose we have N labeled samples S = f(x_i; y_i)g^N_i=1 drawn i.i.d. from an underlying distribution D. Suppose we decide to break this set into a set S_train of size N_train and a set S_test of size N_test samples for our training and test set, so N = N_train + N_test, and S = S_train [S_test. Recall the de nition of the true least squares error of f:

(f) = E_(x;y) _D[(f(x) y)²];

where the subscript (x; y) D makes clear that our input-output pairs are sampled according to D. Our training and test losses are de ned as:

b	1		X
					2
		1			2
_train(f) =	^Ntrain		(f(x)	y)²
	^Ntrain		(x;y)2S_train
			(x;y)2S_train
			X
_test(f) =			(f(x)	y)
_test(f) =	^Ntest		(f(x)	y)
b

(x;y)2S_test

We then train our algorithm (for example, using linear least squares regression) using the training set to obtain fb.

a. [3 points] (bias: the test error) For all xed f (before we’ve seen any data) show that

E_train[b_train(f)] = E_test[b_test(f)] = (f):

Use a similar line of reasoning to show that the test error is an unbiased estimate of our true error for f^{^}.

Speci cally, show that:

E_test[b_test(fb)] = (fb)

[4 points] (bias: the train/dev error) Is the above equation true (in general) with regards to the training loss? Speci cally, does E_train[b_train(fb)] equal E_train[ (fb)]? If so, why? If not, give a clear argument as to where your previous argument breaks down.

[8 points] Let F = (f₁; f₂; : : : ) be a collection of functions and let fb_train minimize the training error such that b_train(fb_train) b_train(f) for all f 2 F. Show that

^Etrain^[btrain^(fbtrain^)] ^Etrain,test^[btest^(fbtrain^)]:

(Hint: note that

E_train,test[ _test(f_train)] = ^X		^Etrain,test^[ test^(f)1f^ftrain ⁼ ^fg^]
b	b	b	b
	f2F	b	b	X	b	b
	X	b	b	X	b	b
	=	^Etest^[ test^(f)]Etrain^[1f^ftrain ⁼ ^fg^{] =}			^Etest^[ test^(f)]Ptrain^(ftrain ⁼ ^f)
	f2F			f2F

where the second equality follows from the independence between the train and test set.)

Bias-Variance tradeo

B.1. For i = 1; : : : ; n let x_i = i=n and y_i = f(x_i) + _i where _i N (0; ²) for some unknown f we wish to approximate at values fx_igⁿ_i=1. We will approximate f with a step function estimator. For some m n such that n=m is an integer de ne the estimator

n=m

; _n

^where ^cj ⁼ _m

y_i:

^fm^{(x) =} _j=1 ^cj¹f^x 2

i=(j 1)m+1

1)m jm

Note that this estimator just partitions f1; : : : ; ng into intervals f1; : : : ; mg; fm + 1; : : : ; 2mg; : : : ; fn m +

1; : : : ; ng and predicts the average of the observations within each interval (see Figure 1).

Figure 1: Step function estimator with n = 256, m = 16, and ²= 1.

By the bias-variance decomposition at some x_i we have

E ^h(f_m(x_i)

f(x_i))²ⁱ = (

E[f_m(x_i

)]₂

f(x_i))² + E

^h(f_m(x_i) E[f_m(x_i)])²

} _|

}

Bias (x_i)

Variance(x^b_i)

[5 points] Intuitively, how do you expect the bias and variance to behave for small values of m? What about large values of m?

b. [5 points]

If we de ne f^(j)

^Pi=(j

^Pi=1

(E[f_m(x_i)]

_1)m+1 f(x_i) and the average bias-squared as _n

f(x_i))², show that

(E[f_m(x_i)]

f(x_i))² =

n=m

₍_f^(j) _f(x_i₎₎²

^X i=(j^X

i=1

j=1

1)m+1

If we de ne the average variance as E ^h

ⁱ, show (both equali-

^c. ties)

_i=1^(fm^(xi⁾

E[f_m(x_i)])

[5 points]

_E “

(f_m(x_i)

E[f_m(x_i)])²

# ₌

₁ n=m

mE[(c_j f^(j))²] =

X_i

j=1

d. [15 points] Let n = 256, ²= 1, and f(x) = 4 sin( x) cos(6 x²). For values of m = 1; 2; 4; 8; 16; 32

plot the average empirical error	1	n	(f_m(x_i) f(x	_i))²	using randomly drawn data as a function
plot the average empirical error	n	^Pi=1	(f_m(x_i) f(x	_i))²	using randomly drawn data as a function
			b

of m on the x-axis. On the same plot, using parts b and c of above, plot the average bias-squared, the average variance, and their sum (the average error). Thus, there should be 4 lines on your plot, each described in a legend.

e. [5 points] By the Mean-Value theorem we have that min_i=(j _1)m+1;:::;jm f(x_i) f^(j) max_i=(j _1)m+1;:::;jm f(x_i).

Suppose f is L-Lipschitz so that jf(x_i) f(x_j)j ^L ji jj for all i; j 2 f1; : : : ; ng for some L > 0. n

Show that the average bias-squared is O( ^L²^m² ). Using the expression for average variance above, the

_n2

total error behaves like O( ^L²^m² + ²). Minimize this expression with respect to m. Does this value

n² m

of m, and the total error when you plug this value of m back in, behave in an intuitive way with respect to n, L, ²? That is, how does m scale with each of these parameters? It turns out that this simple estimator (with the optimized choice of m) obtains the best achievable error rate up to a universal constant in this setup for this class of L-Lipschitz functions (see Tsybakov’s Introduction to Nonparametric Estimation for details).

This setup of each x_i deterministically placed at i=n is a good approximation for the more natural setting where each x_i is drawn uniformly at random from [0; 1]. In fact, one can redo this problem and obtain nearly identical conclusions, but the calculations are messier.

Polynomial Regression

Relevant Files¹
polyreg.py		test	polyreg	univariate.py

		test	polyreg	learningCurve.py
linreg	closedform.py	data/polydata.dat

A.4.[10 points] Recall that polynomial regression learns a function h (x) = ₀ + ₁x + ₂x² + : : : + _dx^d. In this case, d represents the polynomial’s degree. We can equivalently write this in the form of a linear model

h (x) = _{0 0}(x) + _{1 1}(x) + _{2 2}(x) + : : : + _{d d}(x) ;

(1)

using the basis expansion that _j(x) = x^j. Notice that, with this basis expansion, we obtain a linear model where the features are various powers of the single univariate x. We’re still solving a linear regression problem, but are tting a polynomial function of the input.

Implement regularized polynomial regression in polyreg.py. You may implement it however you like, using gradient descent or a closed-form solution. However, I would recommend the closed-form solution since the data sets are small; for this reason, we’ve included an example closed-form implementation of linear regression in linreg closedform.py (you are welcome to build upon this implementation, but make CERTAIN you under-stand it, since you’ll need to change several lines of it). You are also welcome to build upon your implementation from the previous assignment, but you must follow the API below. Note that all matrices are actually 2D numpy arrays in the implementation.

init (degree=1, regLambda=1E-8) : constructor with arguments of d and

fit(X,Y): method to train the polynomial regression model

predict(X): method to use the trained polynomial regression model for prediction

polyfeatures(X, degree): expands the given n 1 matrix X into an n d matrix of polynomial features

of degree d. Note that the returned matrix will not include the zero-th power.

Note that the polyfeatures(X, degree) function maps the original univariate data into its higher order powers. Speci cally, X will be an n 1 matrix (X 2 Rⁿ ¹) and this function will return the polynomial expansion of this data, a n d matrix. Note that this function will not add in the zero-th order feature (i.e., x₀ = 1). You should add the x₀ feature separately, outside of this function, before training the model. By not including the x₀ column in the matrix polyfeatures(),

this allows the polyfeatures function to be more general, so it could be applied to multi-variate data as well. (If it did add the x₀ feature, we’d end up with multiple columns of 1’s for multivariate data.)

A.5. [10 points] In this problem we will examine the bias-variance tradeo through learning curves. Learning curves provide a valuable mechanism for evaluating the bias-variance tradeo . Implement the learningCurve() function in polyreg.py to compute the learning curves for a given training/test set. The learningCurve(Xtrain, ytrain, Xtest, ytest, degree, regLambda) function should take in the training data (Xtrain, ytrain), the testing data (Xtest, ytest), and values for the polynomial degree d and regularization parameter .

The function should return two arrays, errorTrain (the array of training errors) and errorTest (the array of testing errors). The i^th index (start from 0) of each array should return the training error (or testing error) for learning with i + 1 training instances. Note that the 0^th index actually won’t matter, since we typically start displaying the learning curves with two or more instances.

When computing the learning curves, you should learn on Xtrain[0:i] for i = 1; : : : ; numInstances(Xtrain) + 1, each time computing the testing error over the entire test set. There is no need to shu e the training data, or to average the error over multiple trials { just produce the learning curves for the given training/testing sets with the instances in their given order. Recall that the error for regression problems is given by

1	n
	^Xi
n	(h (x_i) y_i)² :	(2)
n	=1
	=1

Once the function is written to compute the learning curves, run the test polyreg learningCurve.py script to plot the learning curves for various values of and d. You should see plots similar to the following:

Notice the following:

The y-axis is using a log-scale and the ranges of the y-scale are all di erent for the plots. The dashed black line indicates the y = 1 line as a point of reference between the plots.

The plot of the unregularized model with d = 1 shows poor training error, indicating a high bias (i.e., it is a standard univariate linear regression t).

The plot of the unregularized model ( = 0) with d = 8 shows that the training error is low, but that the testing error is high. There is a huge gap between the training and testing errors caused by the model over tting the training data, indicating a high variance problem.

As the regularization parameter increases (e.g., = 1) with d = 8, we see that the gap between the training and testing error narrows, with both the training and testing errors converging to a low value. We can see that the model ts the data well and generalizes well, and therefore does not have either a high bias or a high variance problem. E ectively, it has a good tradeo between bias and variance.

Once the regularization parameter is too high ( = 100), we see that the training and testing errors are once again high, indicating a poor t. E ectively, there is too much regularization, resulting in high bias.

Make absolutely certain that you understand these observations, and how they relate to the learning curve plots.

In practice, we can choose the value for via cross-validation to achieve the best bias-variance tradeo .

Ridge Regression on MNIST

A.6. In this problem we will implement a regularized least squares classi er for the MNIST data set. The task is to classify handwritten images of numbers between 0 to 9.

You are NOT allowed to use any of the prebuilt classi ers in sklearn. Feel free to use any method from numpy or scipy. Remember: if you are inverting a matrix in your code, you are probably doing something wrong (Hint: look at scipy.linalg.solve).

Get the data from https://pypi.python.org/pypi/python-mnist.

Load the data as follows:

from mnist import MNIST

def load_dataset():

mndata = MNIST(’./data/’)

X_train, labels_train = map(np.array, mndata.load_training())

X_test, labels_test = map(np.array, mndata.load_testing())

X_train = X_train/255.0

X_test = X_test/255.0

Each example has features x_i 2 R^d (with d = 28 28 = 784) and label z_j 2 f0; : : : ; 9g. You can visualize a single example x_i with imshow after reshaping it to its original 28 28 image shape (and noting that the label z_j is accurate). We wish to learn a predictor fbthat takes as input a vector in R^d and outputs an index in f0; : : : ; 9g. We de ne our training and testing classi cation error on a predictor f as

b	1	(x;z)2 ^X	1ff(x) 6= zg
	1
_train(f) =
	^Ntrain
		Training Set
_test(f) =		X	1ff(x) 6= zg
	^N^test (x;z)2Test Set
b

We will use one-hot encoding of the labels, i.e. of (x; z) the original label z 2 f0; : : : ; 9g is mapped to the standard basis vector e_z where e_z is a vector of all zeros except for a 1 in the zth position. We adopt the notation where we have n data points in our training objective with features x_i 2 R^d and label one-hot encoded as y_i 2 f0; 1g^k where in this case k = 10 since there are 10 digits.

a. [10 points] In this problem we will choose a linear classi er to minimize the regularized least squares objective:

c	n
c	X_i	y_ik₂² + kW k_F²
W = argmin_W _2Rd k	kW ^T x_i	y_ik₂² + kW k_F²
	=0
Note that kW k_F corresponds to the Frobenius norm of W , i.e. kW k_F²			d	k
			= ^P_i=1	^P_j=1 W_i;j² . To classify a

j=0;:::;9 ^e_j ^W

_k ^xi^. _n

: : :

point x we will use the rule arg max

Note that if W = w

then

“

e_j^T y_i)² + kW e_jk²^#

kW ^T x_i

y_ik₂² + kW k_F² =

X_j

(e_j^T W ^T x_i

i=0

“

i=1

(w_j^T x_i e_j^T y_i)² + kw_jk²^#

X X_i

j=0

X_j

Y e_jk²

+ kw_jk²

kXw_j

where X =

x₁

: : : x_n

2 R^{n d} and Y

y₁

: : :

y_n ^> 2 R^{n k}. Show that

₁

Wc = (X

X+ I)

[10 points]

Code up a function train that takes as input X 2 R^{n d}, Y 2 f0; 1g^{n k}, > 0 and returns Wc.

Code up a function predict that takes as input W 2 R^{d k}, X⁰ 2 R^{m d} and returns an m-length vector with the ith entry equal to arg max_j=0;:::;9 e^T_j W ^T x⁰_i where x⁰_i is a column vector representing the ith example from X⁰.

Train Wc on the MNIST training data with = 10 ⁴ and make label predictions on the test data. What is the training and testing error? Note that they should both be about 15%.

B.2

a. [10 points] We just t a classi er that was linear in the pixel intensities to the MNIST data. For classi cation of digits the raw pixel values are very, very bad features: it’s pretty hard to separate digits with linear functions in pixel space. The standard solution to this is to come up with some transform h : R^d ! R^p of the original pixel values such that the transformed points are (more easily) linearly separable. In this problem, you’ll use the feature transform:

h(x) = cos(Gx + b):

where G 2 R^{p d}, b 2 R^p, and the cosine function is applied elementwise. We’ll choose G to be a random matrix, with each entry sampled i.i.d. from a Gaussian with mean = 0 and variance ² = 0:1, and b to be a random vector sampled i.i.d. from the uniform distribution on [0; 2 ]: The big question is: how do we choose p? Using cross-validation, of course!

Randomly partition your training set into proportions 80/20 to use as a new training set and validation set, respectively. Using the train function you wrote above, train a Wc^p for di erent values of p and plot the classi cation training error and validation error on a single plot with p on the x-axis. Be careful, your computer may run out of memory and slow to a crawl if p is too large (p 6000 should t into 4 GB of memory that is a minimum for most computers, but if you’re having trouble you can set p in the several hundreds). You can use the same value of as above but feel free to study the e ect of using di erent values of and ² for fun.

[5 points] Instead of reporting just the test error, which is an unbiased estimate of the true error, we would like to report a con dence interval around the test error that contains the true error.

Lemma 1. (Hoe ding’s inequality) Fix 2 (0; 1). If for all i = 1; : : : ; m we have that X_i are i.i.d. random variables with X_i 2 [a; b] and E[X_i] = then

X_i^!

a)² log(2= )

i=1

We will use the above equation to construct a con dence interval around the true classi cation error (fb) = E_test[b_test(fb)] since the test error b_test(fb) is just the average of indicator variables taking values in f0; 1g corresponding to the ith test example being classi ed correctly or not, respectively, where an error happens with probability = (fb) = E_test[b_test(fb)], the true classi cation error.

Let pb be the value of p that approximately minimizes the validation error on the plot you just made and use fb(x) = arg max_j x^T Wc^pbe_j to compute the classi cation test error b_test(fb). Use Hoe ding’s inequality, of above, to compute a con dence interval that contains E_test[b_test(fb)] (i.e., the true error) with probability at least 0:95 (i.e., = 0:05). Report b_test(fb) and the con dence interval.