Solved-Homework Assignment 2 -Solution

$35.00 $24.00

Starting point Your repository will have now a directory ‘homework2/’. Please do not change the name of this repository or the names of any les we have added to it. Please perform a git pull to retrieve these les. You will nd within it: python script logistic prog.py, which you will be amending by adding…

You’ll get a: . zip file solution

 

 

Description

5/5 – (2 votes)

Starting point Your repository will have now a directory ‘homework2/’. Please do not change the name of this repository or the names of any les we have added to it. Please perform a git pull to retrieve these les. You will nd within it:

python script logistic prog.py, which you will be amending by adding code for questions in Sect. 4.

python script dnn misc.py, which you will be amending by adding code for questions in Sect. 5.

python script dnn cnn 2.py, which you will be amending by adding code for questions in Sect. 5.

Various other python scripts: dnn mlp.py, dnn mlp nononlinear.py, dnn cnn.py, and dnn im2col.py, which you are not allowed to modify.

Various scripts: q43.sh, q53.sh, q54.sh, q55.sh, q56.sh, q57.sh, q58.sh, q510.sh; you will use these to generate output les.

Dataset: mnist subset.json

Submission Instructions The following will constitute your submission:

The three python scripts above, amended with the code you added for Sect. 4 and Sect. 5. Be sure to commit your changes!

A PDF report named firstname lastname USCID.pdf, which contains your solutions for questions in Sect. 1 and Sect. 2. For your written report, your answers must be typeset with LATEX and generated as a PDF le. No handwritten submission will be permitted. There are many free integrated LATEX editors that are convenient to use, please Google search them and choose the one(s) you like the most. This http://www.andy-roberts.net/writing/latex seems like a good tutorial.

Eight .json les, which will be the output of the eight scripts above. We reserve the right to run your code to regenerate these les, but you are expected to include them.

logistic res.json

MLP lr0.01 m0.0 w0.0 d0.0.json

1

MLP lr0.01 m0.0 w0.0 d0.5.json

MLP lr0.01 m0.0 w0.0 d0.95.json

LR lr0.01 m0.0 w0.0 d0.0.json

CNN lr0.01 m0.0 w0.0 d0.5.json

CNN lr0.01 m0.9 w0.0 d0.5.json

CNN2 lr0.001 m0.9 w0.0 d0.5.json

Collaboration You may discuss with your classmates. However, you need to write your own solutions and submit separately. Also in your written report, you need to list with whom you have discussed for each problem. Please consult the syllabus for what is and is not acceptable collaboration.

2

x

u

h

a

softmax

z

y^

input features

hidden layer

output layer

predicted label

Figure 1: A diagram of a 1 hidden-layer multi-layer perceptron (MLP). The edges mean mathematical operations, and the circles mean variables. Generally we call the combination of a linear (or a ne) operation and a nonlinear operation (like element-wise sigmoid or the recti ed linear unit (relu) operation as in eq. (3)) as a hidden layer.

Algorithmic component

  • Neural networks (error-backpropagation, initialization, and non-linearity)

[Recommended maximum time spent: 1 hour]

In the lecture (see lec8.pdf), we have talked about error-backpropagation, a way to compute partial derivatives (or gradients) w.r.t the parameters of a neural network. We have also mentioned that optimization is challenging and nonlinearity is important for neural networks. In this question, you are going to (Q1.1) practice error-backpropagation, (Q1.2) investigate how initialization a ects optimization, and (Q1.3) the importance of nonlinearity.

Speci cally, you are given the following 1-hidden layer multi-layer perceptron (MLP) for a K-class classi cation problem (see Fig. 1 for illustration and details), and (x 2 RD; y 2 f1; 2; ; Kg) is a labeled instance,

x 2 RD

u = W (1)x + b(1) ; W (1) 2 RM D

  • 3

maxf0; u1g

h = maxf0; ug = 6max

0...; uM

g

7

a = W

(2)

(2)

4

f

(2)

5K

M

h + b

; W

2 R

z =

2

ea1

3

6

Pkeak

7

6

7

6

7

6

7

6

eaK

7

4

5

Pk eak

y^ = arg maxk zk:

and b(1) 2 RM

and b(2) 2 RK

(1)

(2)

(3)

(4)

(5)

(6)

For K-class classi cation problem, one popular loss function for training is the cross-entropy loss,

3

X

l =

1[y == k] log zk;

(7)

k

where 1[True] = 1; otherwise, 0:

(8)

For ease of notation, let us de ne the one-hot (i.e., 1-of-K) encoding

1;

if y = k;

y 2 RK and yk = (0;

otherwise:

(9)

so that

yk log zk = yT 2

3 = yT log z:

l =

log z1

(10)

X

6log zK

7

4

5

k

@l

Q1.1 Assume that you have computed u, h, a, z, given (x, y). Please rst express @u in terms

@l

of @a, u, h, and W (2).

@l

@u =?

@l

Then express @a in terms of z and y.

@l

@a =?

@l

@l

@l

@l

@l

Finally, compute

and

in terms of

and x. Compute

in terms of

and h.

@W(1)

@b(1)

@u

@W(2)

@a

@l

@W(1) =?

@l

  • b(1) =?

@l

@W(2) =?

4

You only need to write down the nal answers of the above 5 question marks. You are encouraged to use matrix/vector forms to simplify your answers. Note that maxf0; ug is not di erentiable w.r.t. u at u = 0. Please note that

@u

(0; if u 0;

@ maxf0; ug

=

1; if u > 0;

(11)

which stands for the Heaviside step function. You can use

@ maxf0; ug

= H(u)

(12)

@u

@l

in your derivation of @u.

You can also use to represent element-wise product between two vectors or matrices. For example,

  • 3

v1 c1

v c = 6

vI

...

cI

7

2 RI , where v 2 RI and c 2 RI :

(13)

4

5

Also note that the partial derivatives of the loss function w.r.t. the variables (e.g., a scalar, a vector, or a matrix) will have the same shape as the variables.

What to submit: No more than 5 lines of derivation for each of the 5 partial derivatives.

Q1.2 Suppose we initialize W (1), W (2), b(1) with zero matrices/vectors (i.e., matrices and vectors

@l

@l

@l

with all elements set to 0), please rst verify that

,

,

are all zero matrices/vectors,

@W

(1)

@W

(2)

(1)

@b

irrespective of x, y and the initialization of b(2).

Now if we perform stochastic gradient descent for learning the neural network using a training set f(xi 2 RD; yi 2 RK )gNi=1, please explain with a concise mathematical statement in one sentence why no learning will happen on W (1), W (2), b(1) (i.e., they will not change no matter how many iterations are run). Note that this will still be the case even with weight decay and momentum if the initial velocity vectors/matrices are set to zero.

What to submit: No submission for the veri cation question. Your concise mathematical state-ment in one sentence for the explanation question.

Q1.3 As mentioned in the lecture (see lec8.pdf), nonlinearity is very important for neural net-works. With nonlinearity (e.g., eq. (3)), the neural network shown in Fig. 1 can bee seen as a nonlinear basis function (i.e., (x) = h) followed by a linear classi er f (i.e., f(h) = y^).

Please show that, by removing the nonlinear operation in eq. (3) and setting eq. (4) to be a = W (2)u + b(2), the resulting network is essentially a linear classi er. More speci cally, you can now represent a as Ux + v, where U 2 RK D and v 2 RK . Please write down the representation

5

of U and v using W (1); W (2); b(1); and b(2)

  • =?

  • =?

What to submit: No more than 2 lines of derivation for each of the question mark.

  • Kernel methods

[Recommended maximum time spent: 1 hour]

In the lecture (see lec10.pdf) , we have seen the \kernelization” of regularized least squares problem. The \kernelization” process depends on an important observation: the optimal model parameter can be expressed as a linear combination of the transformed features. You are now to prove a more general case.

Consider a convex loss function in the form ‘(wT (x); y); where (x) 2 RM is a nonlinear feature mapping, and y is a label or a continuous response value.

Now solve the regularized loss minimization problem on a training set D = f(x1; y1); : : : ; (xN ; yN )g,

X

T

2

w

(w

(xn); yn) +

jjwjj2

(14)

2

min

n

Q2.1 Show that the optimal solution of w can be represented as a linear combination of the training samples. You can assume ‘(s; y) is di erentiable w.r.t. s, i.e. during derivation, you can

use the derivative @‘(s;y) and assume it is a known quantity.

@s

What to submit: Your fewer than 10 line derivation and optimal solution of w.

Q2.2 Assume the combination coe cient is n for n = 1; : : : ; N. Rewrite loss function Eqn. 14 in terms of n and kernel function value Kij = k(xi; xj).

What to submit: Your objective function in terms of and K.

Q2.3 After you obtain the general formulation for Q2.1 and Q2.2, please plug in three di erent loss functions we have seen so far, and examine what you get.

square loss:

(wT (x); y) =

1

ky

wT (x)k22

y 2 R

(15)

2

cross entropy loss:

(wT (x); y) =

y log[ (wT (x))] + (1 y) log[1 (wT (x))]

y

0; 1

(16)

perceptron loss:

2 f

g

(wT (x); y) = max(

ywT (x); 0)

y 2 f

1; 1g

(17)

What to submit: Nothing.

6

Programming component

  • High-level descriptions

3.1 Dataset

We will use mnist subset (images of handwritten digits from 0 to 9). This is the same subset of the full MNIST that we used for Homework 1. As before, the dataset is stored in a JSON-formated le mnist subset.json. You can access its training, validation, and test splits using the keys ‘train’, ‘valid’, and ‘test’, respectively. For example, suppose we load mnist subset.json to the variable x. Then, x[0train0] refers to the training set of mnist subset. This set is a list with two elements: x[0train0][0] containing the features of size N (samples) D (dimension of features), and x[0train0][1] containing the corresponding labels of size N.

3.2 Tasks

You will be asked to implement 10-way classi cation using multinomial logistic regression (Sect. 4) and neural networks (Sect. 5). Speci cally, you will

nish the implementation of all python functions in our template code. run your code by calling the speci ed scripts to generate output les.

add, commit, and push (1) all *.py les, and (2) all *.json les that you have created.

In the next two subsections, we will provide a high-level checklist of what you need to do. Furthermore, as in the previous homework, you are not responsible for loading/pre-processing data; we have done that for you. For speci c instructions, please refer to text in Sect. 4 and Sect. 5, as well as corresponding python scripts.

3.2.1 Multi-class Classi cation

Coding In logistic prog.py, nish implementing the following functions: logistic train ovr, logistic test ovr, logistic mul train, and logistic mul test. Refer to logistic prog.py and Sect. 4 for more information.

Running your code Run the script q43.sh after you nish your implementation. This will output logistic res.json.

What to submit Submit both logistic prog.py and logistic res.json.

3.2.2 Neural networks

Preparation Read dnn mlp.py and dnn cnn.py.

Coding

First, in dnn misc.py, nish implementing

forward and backward functions in class linear layer forward and backward functions in class relu

7

backward function in class dropout (before that, please read forward function).

Refer to dnn misc.py and Sect. 5 for more information.

Second, in dnn cnn 2.py, nish implementing the main function. There are three TODO items.

Refer to dnn cnn 2.py and Sect. 5 for more information.

Running your code Run the scripts q53.sh, q54.sh, q55.sh, q56.sh, q57.sh, q58.sh, q510.sh after you nish your implementation. This will generate, respectively,

MLP lr0.01 m0.0 w0.0 d0.0.json

MLP lr0.01 m0.0 w0.0 d0.5.json

MLP lr0.01 m0.0 w0.0 d0.95.json

LR lr0.01 m0.0 w0.0 d0.0.json

CNN lr0.01 m0.0 w0.0 d0.5.json

CNN lr0.01 m0.9 w0.0 d0.5.json

CNN2 lr0.001 m0.9 w0.0 d0.5.json

What to submit Submit dnn misc.py, dnn cnn 2.py, and seven .json les that are generated from the scripts.

3.3 Cautions

Please do not import packages that are not listed in the provided code. Follow the instructions in each section strictly to code up your solutions. Do not change the output format. Do not modify the code unless we instruct you to do so. A homework solution that does not match the provided setup, such as format, name, initializations, etc., will not be graded. It is your responsibility to make sure that your code runs with the provided commands and scripts on the VM. Finally, make sure that you git add, commit, and push all the required les, including your code and generated output les.

3.4 Advice

We are extensively using softmax and sigmoid function in this homework. To avoid numerical issues such as over ow and under ow caused by numpy:exp() and numpy:log(), please use the following implementations:

Let x be a input vector to the softmax function. Use x~ = x max(x) instead of using x

directly for the softmax function f, i.e. f(x~

) =

exp(x~i)

i

P

D

j=1 exp(x~j)

If you are using numpy:log(), make sure the input to the log function is positive. Also, there

may be chances that one of the outputs of softmax, e.g. f(x~i), is extremely small but you need the value ln(f(x~i)), you can convert the computation into x~i ln(PDj=1 exp(x~j)).

We have implemented and run the code ourselves without any problems, so if you follow the instructions and settings provided in the python les, you should not encounter over ow or under-ow.

8

  • Multi-class Classi cation

You will modify 4 python functions in logistic prog.py. First, you will implement two functions that train and test a one-versus-rest multi-class classi cation model. Second, you will implement two functions that train and test a multinomial logistic regression model. Finally, you will run the command that train and test the two models using your implemented functions, and our code will automatically store your results to logistic res.json.

Coding: One-versus-rest

Q4.1 Implement the code to solve the multi-class classi cation task with the one-versus-rest strategy. That is, train 10 binary logistic regression models following the setting provided in class: for each class Ck; k = 1; ; 10, we create a binary classi cation problem as follows:

Re-label training samples with label Ck as positive (namely 1) Re-label other samples as negative (namely 0)

We wrote functions to load, relabel, and sample the data for you, so you are not responsible for doing it.

Training Finish the implementation of the function logistic train ovr(Xtrain, ytrain, w, b, step size, max iterations). As in the previous homework, we have pre-de ned the hyper-parameters and initializations in the template code. Moreover, you will use the AVERAGE of gradients from all training samples to update the parameters.

Testing Finish the implementation of the function logistic test ovr(Xtest, w l, b l). This function should return the predicted probability, i.e., the value output by logistic function without thresholding, instead of the 0/1 label. Formally, for each test data point xi, we get its nal pre-diction by y^i = argmaxk2f1; ;10g fk(xi), where y^i is the predicted label and fk(xi) is the predicted probability by the kth logistic regression model fk. Then, you compute the classi cation accuracy as follows:

P

Ntest 1 y^i

== yi

Acc =

i=1

Nftest

g

;

(18)

where yi is the ground-truth label of xi and Ntest is the total number of test data instances.

What to do and submit: Your logistic prog.py with completed logistic train ovr and logistic test ovr.

Coding: Multinomial logistic regression

Q4.2 Implement the multinomial logistic regression, training a 10-way classi er (with the softmax function) on mnist subset dataset.

9

Training Finish the implementation of the function logistic mul train(Xtrain, ytrain, w, b, step size, max iterations). Again, we have pre-de ned the hyper-parameters and initial-izations in the template code. Moreover, you will use the AVERAGE of gradients from all training samples to update the parameters.

Testing Finish the implementation of the function logistic mul test(Xtest, w l, b l) For

each test data point xi, compute y^ = argmaxk2f1; ;10g p(y = kjx), where p(y = kjx) is the pre-dicted probability by the multinomial logistic regression. Then, compute the accuracy following

Eqn. 18.

What to do and submit: Your logistic prog.py with completed logistic mul train and logistic mul test.

Training and getting generated output les from both one-versus-rest and multinomial logistic regression models

Q4.3 What to do and submit: run script q43.sh. It will generate logistic res.json. Add, commit, and push both logistic prog.py and logistic res.json before the due date. What it does: q43.sh will run python3 logistic prog.py. This will train your models (for both Q4.1 and Q4.2 above) and test the trained models (for both Q4.1 and Q4.2 above). The output le stores accuracies of both models.

  • Neural networks: multi-layer perceptrons (MLPs) and convolu-tional neural networks (CNNs)

In recent years, neural networks have been one of the most powerful machine learning models. Many toolboxes/platforms (e.g., TensorFlow, PyTorch, Torch, Theano, MXNet, Ca e, CNTK) are publicly available for e ciently constructing and training neural networks. The core idea of these toolboxes is to treat a neural network as a combination of data transformation modules. For example, in Fig. 2, the edges correspond to module names of the same neural network shown in Fig. 1 and Sect. 1.

Now we will provide more information on modules for this homework. Each module has its own parameters (but note that a module may have no parameters). Moreover, each module can perform a forward pass and a backward pass. The forward pass performs the computation of the module, given the input to the module. The backward pass computes the partial derivatives of the loss function w.r.t. the input and parameters, given the partial derivatives of the loss function w.r.t. the output of the module. Consider a module hmodule namei. Let hmodule namei.forward and hmodule namei.backward be its forward and backward passes, respectively.

10

x

linear(1)

u

relu

linear(2)

a

softmax

z

y^

h

input features

predicted label

Figure 2: A diagram of a 1-hidden layer multi-layer perceptron (MLP), with modules indicated on the edges. The circles correspond to variables. The rectangles shown in Fig. 1 are removed for clearness. The term relu stands for recti ed linear units.

For example, the linear module may be de ned as follows.

forward pass:

u = linear(1).forward(x) = W (1)x + b(1);

(19)

where W (1)

and b(1) are its parameters.

@l

@l

@l

@l

backward pass:

[

;

;

] = linear(1).backward(x;

):

(20)

@x

@W(1)

@b(1)

@u

Let us assume that we have implemented all the desired modules. Then, getting y^ for x is equivalent to running the forward pass of each module in order, given x. All the intermediated variables (i.e., u, h, etc.) will all be computed along the forward pass. Similarly, getting the partial derivatives of the loss function w.r.t. the parameters is equivalent to running the backward pass of

@l

each module in a reverse order, given @z.

In this question, we provide a Python environment based on the idea of modules. Every module is de ned as a class, so you can create multiple modules of the same functionality by creating multiple object instances of the same class. Your work is to nish the implementation of several modules, where these modules are elements of a multi-layer perceptron (MLP) or a convolutional neural network (CNN). We will apply these models to the same 10-class classi cation problem introduced in Sect. 4. We will train the models using stochastic gradient descent with mini-batch, and explore how di erent hyperparameters of optimizers and regularization techniques a ect training and validation accuracies over training epochs. For deeper understanding, check out, e.g., the seminal work of Yann LeCun et al. \Gradient-based learning applied to document recognition,” written in 1998.

We give a speci c example below. Suppose that, at iteration t, you sample a mini-batch of

  • examples f(xi 2 RD; yi 2 RK )gNi=1 from the training set (K = 10). Then, the loss of such a mini-batch given by Fig. 2 is

11

1

N

Xi

lmb =

N

l(softmax.forward(linear(2).forward(relu.forward(linear(1).forward(xi)))); yi) (21)

=1

1

N

Xi

=

N

l(softmax.forward(linear(2).forward(relu.forward(ui))); yi)

(22)

=

=1

N

(23)

1

Xi

=

N

l(softmax.forward(ai); yi)

(24)

=1

1

N K

X X

=

yik log zik:

(25)

N

i=1 k=1

That is, in the forward pass, we can perform the computation of a certain module to all the N input examples, and then pass the N output examples to the next module. This is the same case for the backward pass. For example, according to Fig. 2, if we are now to pass the partial derivatives of the loss w.r.t. faigNi=1 to linear(2).backward, then

2

(

@lmb

)T

3

@a1

6

(

@lmb)T

7

6

7

@lmb

@a2

6

.

7

=

6

..

7

:

(26)

N

@

f

ai

gi=1

6

@lmb

7

6

T

7

6

7

6(

)

7

@aN

1

6

7

6

7

6

@lmb

T

7

6

(

)

7

@aN

6

7

4

5

linear(2).backward will then compute

@lmb

and pass it back to relu.backward.

@fhigiN=1

Preparation

Q5.1 Please read through dnn mlp.py and dnn cnn.py. Both les will use modules de ned in dnn misc.py (which you will modify). Your work is to understand how modules are created, how they are linked to perform the forward and backward passes, and how parameters are updated based on gradients (and momentum). The architectures of the MLP and CNN de ned in dnn mlp.py and dnn cnn.py are shown in Fig. 3 and Fig. 4, respectively.

What to submit: Nothing.

Coding: Modules

12

x linear

(1)

relu

dropout

linear

(2)

softmax

y^

Figure 3: The diagram of the MLP implemented in dnn mlp.py. The circles mean variables and edges mean modules.

xconvolution

relu

max pooling

atten

dropout

linear

softmax

y^

Figure 4: The diagram of the CNN implemented in dnn cnn.py. The circles correspond to variables and edges correspond to modules. Note that the input to CNN may not be a vector (e.g., in dnn cnn.py it is an image, which can be represented as a 3-dimensional tensor). The atten layer is to reshape its input into vector.

Q5.2 You will modify dnn misc.py. This script de nes all modules that you will need to construct the MLP and CNN in dnn mlp.py and dnn cnn.py, respectively. You have three tasks. First, nish the implementation of forward and backward functions in class linear layer. Please follow Eqn. (2) for the forward pass. Second, nish the implementation of forward and backward functions in class relu. Please follow Eqn. (3) for the forward pass and Eqn. (11) for deriving the partial derivatives (note that relu itself has no parameters). Third, nish the the implementation of backward function in class dropout. We de ne the forward pass and the backward pass as follows.

forward pass:s = dropout.forward(q

2

RJ) =

1

2

1[p1 >= r]

q1

3

;

(27)

1 r

qJ

6

1[pJ >= r]

7

4

5

where pj is sampled uniformly from [0; 1); 8j 2 f1; ; Jg; and r 2 [0; 1) is a pre-de ned scalar named dropout rate:

@l

@l

1

21[p1

>= r] @s1 3

@l

backward pass:

@q

= dropout.backward(q; @s) = 1

r

6

...

7

:

(28)

@l

6

7

6

7

6

1[pJ

>= r]

7

@sJ

6

7

4

5

Note that pj; j 2 f1; ; Jg and r are not be learned so we do not need to compute the derivatives w.r.t. to them. Moreover, pj; j 2 f1; ; Jg are re-sampled every forward pass, and are kept for the following backward pass. The dropout rate r is set to 0 during testing.

Detailed descriptions/instructions about each pass (i.e., what to compute and what to return) are included in dnn misc.py. Please do read carefully.

Note that in this script we do import numpy as np. Thus, to call a function XX from numpy, please u np.XX.

13

What to do and submit: Finish the implementation of 5 functions speci ed above in dnn misc.py.

Submit your completed dnn misc.py.

Testing dnn misc.py

Q5.3 What to do and submit: run script q53.sh. It will output MLP lr0.01 m0.0 w0.0 d0.0.json.

Add, commit, and push this le before the due date.

What it does: q53.sh will run python3 dnn mlp.py with learning rate 0.01, no momentum, no weight decay, and dropout rate 0.0. The output le stores the training and validation accuracies over 30 training epochs.

Q5.4 What to do and submit: run script q54.sh. It will output MLP lr0.01 m0.0 w0.0 d0.5.json.

Add, commit, and push this le before the due date.

What it does: q54.sh will run python3 dnn mlp.py –dropout rate 0.5 with learning rate 0.01, no momentum, no weight decay, and dropout rate 0.5. The output le stores the training and validation accuracies over 30 training epochs.

Q5.5 What to do and submit: run script q55.sh. It will output MLP lr0.01 m0.0 w0.0 d0.95.json.

Add, commit, and push this le before the due date.

What it does: q55.sh will run python3 dnn mlp.py –dropout rate 0.95 with learning rate 0.01, no momentum, no weight decay, and dropout rate 0.95. The output le stores the training and validation accuracies over 30 training epochs.

You will observe that the model in Q5.4 will give better validation accuracy (at epoch 30) compared to Q5.3. Speci cally, dropout is widely-used to prevent over- tting. However, if we use a too large dropout rate (like the one in Q5.5), the validation accuracy (together with the training accuracy) will be relatively lower, essentially under- tting the training data.

Q5.6 What to do and submit: run script q56.sh. It will output LR lr0.01 m0.0 w0.0 d0.0.json.

Add, commit, and push this le before the due date.

What it does: q56.sh will run python3 dnn mlp nononlinear.py with learning rate 0.01, no momentum, no weight decay, and dropout rate 0.0. The output le stores the training and validation accuracies over 30 training epochs.

The network has the same structure as the one in Q5.3, except that we remove the relu (non-linear) layer. You will see that the validation accuracies drop signi cantly (the gap is around 0.03). Essentially, without the nonlinear layer, the model is learning multinomial logistic regression similar to Q4.2.

Q5.7 What to do and submit: run script q57.sh. It will output CNN lr0.01 m0.0 w0.0 d0.5.json.

Add, commit, and push this le before the due date.

What it does: q57.sh will run python3 dnn cnn.py with learning rate 0.01, no momentum, no weight decay, and dropout rate 0.5. The output le stores the training and validation accuracies over 30 training epochs.

14

x

conv

relu

max-p conv

relu

max-p atten

dropout linear softmax

y^

Figure 5: The diagram of the CNN you are going to implement in dnn cnn 2.py. The term conv stands for convolution; max-p stands for max pooling. The circles correspond to variables and edges correspond to modules. Note that the input to CNN may not be a vector (e.g., in dnn cnn 2.py it is an image, which can be represented as a 3-dimensional tensor). The atten layer is to reshape its input into vector.

Q5.8 What to do and submit: run script q58.sh. It will output CNN lr0.01 m0.9 w0.0 d0.5.json.

Add, commit, and push this le before the due date.

What it does: q58.sh will run python3 dnn cnn.py –alpha 0.9 with learning rate 0.01, momen-tum 0.9, no weight decay, and dropout rate 0.5. The output le stores the training and validation accuracies over 30 training epochs.

You will see that Q5.8 will lead to faster convergence than Q5.7 (i.e., the training/validation accuracies will be higher than 0.94 after 1 epoch). That is, using momentum will lead to more stable updates of the parameters.

Coding: Building a deeper architecture

Q5.9 The CNN architecture in dnn cnn.py has only one convolutional layer. In this question, you are going to construct a two-convolutional-layer CNN (see Fig. 5 using the modules you imple-mented in Q5.2. Please modify the main function in dnn cnn 2.py. The code in dnn cnn 2.py is similar to that in dnn cnn.py, except that there are a few parts marked as TODO. You need to ll in your code so as to construct the CNN in Fig. 5.

What to do and submit: Finish the implementation of the main function in dnn cnn 2.py (search for TODO in main). Submit your completed dnn cnn 2.py.

Testing dnn cnn 2.py

Q5.10 What to do and submit: run script q510.sh. It will output CNN2 lr0.001 m0.9 w0.0 d0.5.json.

Add, commit, and push this le before the due date.

What it does: q510.sh will run python3 dnn cnn 2.py –alpha 0.9 with learning rate 0.01, momentum 0.9, no weight decay, and dropout rate 0.5. The output le stores the training and validation accuracies over 30 training epochs.

You will see that you can achieve slightly higher validation accuracies than those in Q5.8.

15