Name: Homework 2 Solution
SKU: 49593
Price: 29.99 USD
Availability: InStock

Description

Rate this product

Conceptual Questions [10 points]

A0. The answers to these questions should be answerable without referring to external materials. Brie y justify your answers with a few words.

[2 points] Suppose that your estimated model for predicting house prices has a large positive weight on ’number of bathrooms’. Does it implies that if we remove the feature “number of bathrooms” and re t the model, the new predictions will be strictly worse than before? Why?

[2 points] Compared to L2 norm penalty, explain why a L1 norm penalty is more likely to result in a larger number of 0s in the weight vector or not?

[2 points] In at most one sentence each, state one possible upside and one possible downside of using the

following regularizer: ^P_i jw_ij^0:5

[1 points] True or False: If the step-size for gradient descent is too large, it may not converge.

[2 points] In your own words, describe why SGD works.

[2 points] In at most one sentence each, state one possible advantage of SGD (stochastic gradient descent) over GD (gradient descent) and one possible disadvantage of SGD relative to GD.

Convexity and Norms [30 points]

A1.

A norm k k over Rⁿ is de ned by the properties:

i) non-negative: kxk 0 for

_nall x 2 Rⁿ with equality

if and only if x = 0, ii) absolute scalability:

a x

j k

for all a

2 R

and x

2 R

, iii) triangle inequality:

kx + yk kxk + kyk for all x; y 2 Rⁿ.

[3 points]

Show that f(x) = (^P_i=1 jx_ij) is a norm. (Hint: begin by showing that ja + bj jaj + jbj for all

a; b 2 R.)

b. [2 points]

Show that g(x) =

_iⁿ₌₁ jx_ij¹⁼²

² is not a norm. (Hint: it su ces to nd two points in n = 2

dimensions such that the triangle inequality does not hold.)

Context: norms are often used in regularization to encourage speci c behaviors of solutions. If we de ne

kxk_p := (		_iⁿ₌₁ jx_ij^p)^1=p then one can show that kxk_p is a norm for all p 1. The important cases of p = 2 and
p = 1	P
	correspond to the penalty for ridge regression and the lasso, respectively.

k k			k k	j j	k k k k k k			P
k k			k k	j j	k k k k k k			P		P
B1.	[6	points] For any x		2 Rⁿ, de ne the following norms: kxk₁			=	_iⁿ₌₁ jx_ij, kxk₂ =		_iⁿ₌₁ jx_ij²	,
x ₁ := lim_p!1 x _p = max_i=1;:::;n x_i . Show that x ₁						x ₂	x ₁.		p
A2. [3 points] A set A Rⁿ is convex if x + (1					)y 2 A for all x; y 2 A and 2 [0; 1].

For each of the grey-shaded sets above (I-III), state whether each one is convex, or state why it is not convex using any of the points a; b; c; d in your answer.

A3. [4 points] We say a function f : R^d ! R is convex on a set A if f( x + (1 )y) f(x) + (1 )f(y) for all x; y 2 A and 2 [0; 1].

For each of the grey-colored functions below (I-III), state whether each one is convex on the given interval or state why not with a counterexample using any of the points a; b; c; d in your answer.

Function in panel I on [a; c]

Function in panel II on [a; c]

Function in panel III on [a; d]

Function in panel III on [c; d]

B2. Use just the de nitions above and let k k be a norm.

[3 points] Show that f(x) = kxk is a convex function.

[3 points] Show that fx 2 Rⁿ : kxk 1g is a convex set.

c.	[2 points] Draw a picture of the set f(x₁; x₂) : g(x₁; x₂) 4g where g(x₁; x₂) = jx₁j	1=2	1=2		2	.
			+ jx₂j
	(This is the function considered in 1b above specialized to n = 2.) We know g is not a norm. Is			the

de ned set convex? Why not?

Context: It is a fact that a function f de ned over a set A Rⁿ is convex if and only if the set f(x; z) 2

Rⁿ⁺¹ : z f(x); x 2 Ag is convex. Draw a picture of this for yourself to be sure you understand it.

B3. For i = 1; : : : ; n let ‘_i(w) be convex functions over w 2 R^d (e.g., ‘_i(w) = (y_i w^>x_i)²), k k is any norm, and > 0.

a. [3 points] Show that

‘_i(w) + kwk

i=1

is convex over w 2 R^d (Hint: Show that if f; g are convex functions, then f(x) + g(x) is also convex.)

[1 points] Explain in one sentence why we prefer to use loss functions and regularized loss functions that are convex.

Lasso [45 points]

Given > 0 and data (x₁; y₁); : : : ; (x_n; y_n), the Lasso is the problem of solving

	2	2	n		d
	2	2	X	2	X_j
	min		T	2
arg	min		(x_i w + b	y_i) +jw_jj
arg	w	R^d;b R	(x_i w + b	y_i) +jw_jj
			i=1		=1

is a regularization tuning parameter. For the programming part of this homework, you are required to implement the coordinate descent method of Algorithm 1 that can solve the Lasso problem.

You may use common computing packages (such as NumPy or SciPy), but do not use an existing Lasso solver (e.g., of scikit-learn).

Before you get started, here are some hints that you may nd helpful:

Algorithm 1: Coordinate Descent Algorithm for Lasso

while not converged do


	1	n			d
b		^Pi=1	y_i	P	_j=1 ^wj^xi;j
	n

for k 2 f1; 2; dg do

_a_k ₂^Pⁿ _x2

i=1 i;k


^ck	2	ⁿ x_i;k y_i		(b +	⁶	^wj^xi;j⁾
	^P(c_k		+ )=a_k	c_k <^P	⁶
	8	i=1			j=k
^wk	8	0	)=a_k	c_k 2 [ ; ]
	<	(c_k	)=a_k	c_k >
	:

end

For-loops can be slow whereas vector/matrix computation in Numpy is very optimized; exploit this as much as possible.

The pseudocode provided has many opportunities to speed up computation by precomputing quantities like a_k before the for loop. These small changes can speed things up considerably.

As a sanity check, ensure the objective value is nonincreasing with each step.

It is up to you to decide on a suitable stopping condition. A common criteria is to stop when no element of w changes by more than some small during an iteration. If you need your algorithm to run faster, an easy place to start is to loosen this condition.

You will need to solve the Lasso on the same dataset for many values of . This is called a regularization path. One way to do this e ciently is to start at a large , and then for each consecutive solution, initialize the algorithm with the previous solution, decreasing by a constant ratio (e.g., by a factor of 2) until nished.

The smallest value of for which the solution wb is entirely zero is given by

^max ⁼ k=1;:::;d ²		^xi;k	0	y_i	0	n		y_j	11
		n				1	n
		i=1					j=1
max	X						X
max			@		@				AA

This is helpful for choosing the rst in a regularization path.

A4. We will rst try out your solver with some synthetic data. A bene t of the Lasso is that if we believe many features are irrelevant for predicting y, the Lasso can be used to enforce a sparse solution, e ectively di erentiating between the relevant and irrelevant features. Suppose that x 2 R^d; y 2 R; k < d, and pairs of data (x_i; y_i) for i = 1; : : : ; n are generated independently according to the model y_i = w^T x_i + _i where

w =	(	j=k	if j 2 f1; : : : ; kg
j	(	0	otherwise

where _i N (0; ²) is some Gaussian noise (in the model above b = 0). Note that since k < d, the features k + 1 through d are unnecessary (and potentially even harmful) for predicting y.

With this model in mind, let n = 500; d = 1000; k = 100; and = 1. Generate some data by choosing x_i 2 R^d, where each component is drawn from a N (0; 1) distribution and y_i generated as speci ed above.

[10 points] With your synthetic data, solve multiple Lasso problems on a regularization path, starting at _max where 0 features are selected and decreasing by a constant ratio (e.g., 1.5) until nearly all the features are chosen. In plot 1, plot the number of non-zeros as a function of on the x-axis (Tip: use plt.xscale(’log’)).

[10 points] For each value of tried, record values for false discovery rate (FDR) (number of incorrect nonzeros in wb/total number of nonzeros in wb) and true positive rate (TPR) (number of correct nonzeros in wb/k). In plot 2, plot these values with the x-axis as FDR, and the y-axis as TPR and note that in an ideal situation we would have an (FDR,TPR) pair in the upper left corner, but that can always trivially achieve (0; 0) and ( ^d _d^k ; 1).

[5 points] Comment on the e ect of in these two plots.

A5. Now we put the Lasso to work on some real data. Download the training data set \crime-train.txt” and the test data set \crime-test.txt” from the website under Homework 2. Store your data in your working directory and read in the les with:

import pandas as pd

df_train = pd.read_table(“crime-train.txt”)

df_test = pd.read_table(“crime-test.txt”)

This stores the data as Pandas DataFrame objects. DataFrames are similar to Numpy arrays but more exible; unlike Numpy arrays, they store row and column indices along with the values of the data. Each column of a DataFrame can also, in principle, store data of a di erent type. For this assignment, however, all data are oats. Here are a few commands that will get you working with Pandas for this assignment:

df.head() # Print the first few lines of DataFrame df.

df.index # Get the row indices for df.

df.columns # Get the column indices.

df[‘‘foo’’’] # Return the column named ‘‘foo’’’.

df.drop(‘‘foo’’, axis = 1) # Return all columns except ‘‘foo’’.

df.values # Return the values as a Numpy array.

df[‘‘foo’’’].values # Grab column foo and convert to Numpy array.

df.iloc[:3,:3] # Use numerical indices (like Numpy) to get 3 rows and cols.

The data consist of local crime statistics for 1,994 US communities. The response y is the crime rate. The name of the response variable is ViolentCrimesPerPop, and it is held in the rst column of df train and df test. There are 95 features. These features include possibly relevant variables such as the size of the police force or the percentage of children that graduate high school. The data have been split for you into a training and test set with 1,595 and 399 entries, respectively¹.

We’d like to use this training set to t a model which can predict the crime rate in new communities and evaluate model performance on the test set. As there are a considerable number of input variables, over tting is a serious issue. In order to avoid this, use the coordinate descent LASSO algorithm you just implemented in the previous problem.

Begin by running the LASSO solver with = _maxde ned above. For the initial weights, just use 0. Then, cut

down by a factor of 2 and run again, but this time pass in the values of w^ from your = _max solution as your initial weights. This is faster than initializing with 0 weights each time. Continue the process of cutting by a factor of 2 until the smallest value of is less than 0.01. For all plots use a log-scale for the dimension (Tip: use plt.xscale(’log’)).

1. [4 points] Plot the number of nonzeros of each solution versus .

1. [4 points] Plot the regularization paths (in one plot) for the coe cients for input variables agePct12t29, pctWSocSec, pctUrban, agePct65up, and householdsize.

1. [4 points] Plot the squared error on the training and test data versus .

1. [4 points] Sometimes a larger value of performs nearly as well as a smaller value, but a larger value will select fewer variables and perhaps be more interpretable. Inspect the weights (on features) for = 30. Which feature variable had the largest (most positive) Lasso coe cient? What about the most negative? Discuss brie y. A description of the variables in the data set can be found here: http://archive.ics. uci.edu/ml/machine-learning-databases/communities/communities.names.

1. [4 points] Suppose there was a large negative weight on agePct65up and upon seeing this result, a politician suggests policies that encourage people over the age of 65 to move to high crime areas in an e ort to reduce crime. What is the (statistical) aw in this line of reasoning? (Hint: re trucks are often seen around burning buildings, do re trucks cause re?)

Logistic Regression

Binary Logistic Regression [30 points]

A6. Let us again consider the MNIST dataset, but now just binary classi cation, speci cally, recognizing if a digit is a 2 or 7. Here, let Y = 1 for all the 7’s digits in the dataset, and use Y = 1 for 2. We will use

The features have been standardized to have mean 0 and variance 1.

regularized logistic regression. Given a binary classi cation dataset f(x_i; y_i)gⁿ_i=1 for x_i 2 R^d and y_i 2 f 1; 1g we showed in class that the regularized negative log likelihood objective function can be written as

1	n
	X_i
J(w; b) = _n	log(1 + exp( y_i(b + x_i^T w))) + jjwjj₂²
	=1
Note that the o set term b is not regularized. For all experiments, use = 10		¹. Let _i(w; b) =		1			.
		¹. Let _i(w; b) =	1+exp(	y_i(b+x	T	w))	.
			1+exp(	y_i(b+x	i	w))
					i

[8 points] Derive the gradients r_wJ(w; b), r_bJ(w; b) and give your answers in terms of _i(w; b) (your answers should not contain exponentials).

[8 points] Implement gradient descent with an initial iterate of all zeros. Try several values of step sizes to nd one that appears to make convergence on the training set as fast as possible. Run until you feel you are near to convergence.

1. For both the training set and the test, plot J(w; b) as a function of the iteration number (and show both curves on the same plot).

1. For both the training set and the test, classify the points according to the rule sign(b + x^T_i w) and plot the misclassi cation error as a function of the iteration number (and show both curves on the same plot).

Note that you are only optimizing on the training set. The J(w; b) and misclassi cation error plots should be on separate plots.

[7 points] Repeat (b) using stochastic gradient descent with a batch size of 1. Note, the expected gradient with respect to the random selection should be equal to the gradient found in part (a). Take careful note of how to scale the regularizer.

[7 points] Repeat (b) using stochastic gradient descent with batch size of 100. That is, instead of ap-proximating the gradient with a single example, use 100. Note, the expected gradient with respect to the random selection should be equal to the gradient found in part (a).

B4.

Multinomial Logistic Regression[25 points]

We’ve talked a lot about binary classi cation, but what if we have k > 2 classes, like the 10 digits of MNIST? Concretely, suppose you have a dataset f(x_i; y_i)gⁿ_i=1 where x_i 2 R^d and y_i 2 f1; : : : ; kg. Like in our least squares classi er of homework 1 for MNIST, we will assign a separate weight vector w^(‘) for each class ‘ = 1; : : : ; k; let W = [w⁽¹⁾; : : : ; w^(k)] 2 R^{d k}. We can generalize the binary classi cation probabilistic model to multiple classes as follows: let

^PW

= ‘ W; x

) =

exp(w^(‘) x_i)

(j)

x_i)

The negative log-likelihood function is equal to

_j=1 exp(w

k _exp(w_(j^x₎ⁱ⁾_x_i₎ ^!

L(W) =

1fy_i = ‘g log

exp(w^(‘)

X X_‘

j=1

i=1

De ne the softmax( ) operator to be the function that takes in a vector 2 R and outputs a vector in R

whose ith component is equal to

exp( _i)

. Clearly, this vector is nonnegative and sums to one. If for

_j=1 ^exp( j⁾

any i we have _i max_{j6=i j} then softmax( ) approximates e_i, a vector of all zeros with a one in the ith component.

For each y_i let y_i be the one-hot encoding of y_i (i.e., y_i 2 f0; 1g^k is a vector of all zeros aside from a 1 in the y_ith index).

a. [5 points]

_If _y(W )

= softmax(W ^>x ), show that

r_W L

(W) =

_y^(W ⁾₎>_.

i=1

[5 points]

Recall Ridge Regression on MNIST problem in Homework^P 1 and de ne J(W ) =

i=1 ^k

W ^>x_ik₂².

If y_i^(W ⁾

= W ^>x_i show that r_W J(W ) =

_iⁿ₌₁ ^xi^(yi

_y_i^(W ⁾₎>_.

Comparing^Pthe least

squares linear regression gradient step of this part to the gradient step of minimizing the negative log

likelihood of the logistic model of part a may shed light on why we call this classi cation problem

logistic regression.

[15 points] Using the original representations of the MNIST attened images x_i 2 R^d (d = 28 28 =

1. and all k = 10 classes, implement gradient descent (or stochastic gradient descent) for both J(W ) and L(W ) and run until convergence on the training set of MNIST. For each of the two solutions, report the classi cation accuracy of each on the training and test sets using the most natural arg max_j e_jW ^>x_i classi cation rule.

We highly encourage you to use PyTorch for this problem! The base object in PyTorch is the torch.tensor, which is essentially a numpy.ndarray but with some powerful new features. Namely, tensors have accelerator support (GPU, TPU) and high-performance autodi erentiation. Don’t worry too much about the details of PyTorch! We will discuss PyTorch and the torch.autograd package in greater detail once we get to neural networks! At a high-level though, torch.autograd allows us to automatically calculate the gradients of our model parameters with minimal additional cost. Yep, that’s right! Your days of writing out gradients by hand are numbered! :D

We include some starter pseudocode for the negative log-likelihood + softmax portion of this question. You are expected to nd good hyperparameters. You can install the library at https://pytorch.org/ and access the relevant beginner tutorial here.

import torch

W = torch.zeros(784, 10, requires_grad=True)

for epoch in range(epochs):

y_hat = torch.matmul(X_train, W)

cross entropy combines softmax calculation with NLLLoss loss = torch.nn.functional.cross_entropy(y_hat, y_train)

computes derivatives of the loss with respect to W loss.backward()

gradient descent update

W.data = W.data – step_size * W.grad

.backward() accumulates gradients into W.grad instead

of overwriting, so we need to zero out the weights W.grad.zero_()

Homework 2 Solution

Description

Related products

Solved–Database Systems Homework #6 –Solution

Solved-Intro to Graphics Lab 1 -Solution

Solved-Write a program “assignment-6.c” that creates and manages geometric shapes -Solution

Solved–Homework Assignment 3– Solution

Solved-Programming Assignment #2- Solution