Description

5/5 – (2 votes)

Problem 1 (20 Points)

This problem investigates how changing the error measure can change the result of the learning process.

You have N data points y₁ y_N and wish to estimate a ‘representative’ value.

[10 pts] If your algorithm is to find the hypothesis h that minimizes the in-sample sum of squared deviations,

E_in(h) = (h y_n)²;

n=1

then show that your estimate will be the in sample mean,

1 X

^hmean ⁼ _N _n=1 ^yn^:

[10 pts] If your algorithm is to find the hypothesis h that minimizes the in-sample sum of absolute deviations,

E_in(h) = jh y_nj;

n=1

then show that your estimate will be the in-sample median h_med, which is any value for which half the data points are at most h_med and half the data points are at least h_med.

Problem 2 (20 Points)

Consider

e_n(w) = max(0; 1 y_nw^>x_n):

[5 pts] Show that e_n(w) is continuous and differentiable except when y_n = w>x_n.

[5 pts] Show that e_n(w) is an upper bound for the “0-1 loss” or [[sign(w^>x_n) 6= y_n]]. Thus, _{N n=1} e_n(w) is an upper bound for the in-sample classification error E_in(w).1P^N

(c) [10		pts] Apply stochastic gradient descent on _N				_n=1 e_n(w) (ignoring the singular case of
					1	N
w		x		= y		algorithm.
	>		n		_n) and derive a new perceptron learning	P

Note: e_n (w) corresponds to the “hinge loss” used for maximum-margin classification, most notably for support vector machines (SVMs).

Electrical and Computer Engineering, SNU

Problem 3 (20 Points)

There are a number of bounds on the generalization error , all holding with probability at least 1 .

(a) Original VC-bound:

_N ln

8 4m

(2N)

(b) Rademacher penalty bound:

+ ^r

2 ln(2N_N

(N))

⁺ N

_N ln

2 + ln

(2N)

(d) Devroye:

_2N 4 (1 + ) + ln ^4m^H^(N

₎

Note that (c) and (d) are implicit bounds in . Fix d_VC = 50 and = 0:05. Plot these bounds as a function of N. Which is the best?

Problem 4 (20 Points)

The bias-variance decomposition of out-of-sample error is based on squared error measures. Recall that the out-of-sample error is given by

h i

E_out g^(D) = E_x (g^(D)(x) f(x))² (1)

where E_x denotes the expected value with respect to input x, and using g(D) is to make explicit the dependence of g on data D. From (1) we can remove the dependence on D by taking average:

_E_D ^h_E_out _g(D) ⁱ ₌ _E_D

^hE_x

(g^(D)(x) f(x))²

₌ _E_x ^h_E_D ^h_(g(D)_(x) _f(x))2ⁱⁱ _:

(a)

[5 pts] To evaluate E_D (g(^D)(x) f(x))2 , we define the ‘average’ hypothesis

(x) as

_(x) _, _E_D ^h_g(D)_(x)ⁱ _:

(2)

Now imagine that we have K datasets D₁; D₂; : : : ; D_K , then what will be the average hypothesis

(x) estimated using these datasets?

(b)

[10 pts] Let us define

var(x) , E_D g^(D)(x)

(x) ²

, and

bias(x) , (

(x) f(x))² :

Show that

ⁱ = var(x) + bias(x)

E_D ^h(g^(D)(x) f(x))²

(c)

[5 pts] Show that

E_D ^hE_out(g^(D))ⁱ = bias + var

where bias , E_x [bias(x)] and var , E_x [var(x)].

Problem 5 (20 Points)

In support vector machines, the hyperplane h separates the data if and only if it can be represented by weights (b; w) that satisfy

			min	y_n(w^>x_n + b) = 1:				(3)
			n=1;:::;N
Consider the data below and a ‘hyperplane’ (b; w) that separates the data.
X =	²2	2³	y =	² 1³	w =	3:2	b = 0:5
	0	0		1		1:2
	2	0		+1
	4	5		4 5

(a) [10 pts] Compute

= min y_n(w^>x_n + b):

1;:::;N

[5 pts] Compute the weights 1 (b; w) and show that they satisfy Eq (3).

[5 pts] Plot both hyperplanes to show that they are the same separator.

Note: This problem is to give a concrete example of re-normalizing the weights to show that the condition (3) and the following condition are equivalent, as covered at class:

y_n(w^>x_n + b) > 0:

(4)

Solved–Assignment #2 –SOlution

Description

Related products

Solved–Homework 03 Inventory Manager –Solution

Solved-Homework 6 -Solution

Solved-Programming Assignment #2- Solution

Solved-Homework 06 -Solution

Solved-Introduction to Data Mining Homework 3- Solution