Your cart is currently empty!
Problem 1 (20 Points) This problem investigates how changing the error measure can change the result of the learning process. You have N data points y1 yN and wish to estimate a ‘representative’ value. [10 pts] If your algorithm is to find the hypothesis h that minimizes the in-sample sum of squared deviations, N X…
Problem 1 (20 Points)
This problem investigates how changing the error measure can change the result of the learning process.
You have N data points y1 yN and wish to estimate a ‘representative’ value.
[10 pts] If your algorithm is to find the hypothesis h that minimizes the in-sample sum of squared deviations,
N
X
Ein(h) = (h yn)2;
n=1
then show that your estimate will be the in sample mean,
N
1 X
hmean = N n=1 yn:
[10 pts] If your algorithm is to find the hypothesis h that minimizes the in-sample sum of absolute deviations,
N
X
Ein(h) = jh ynj;
n=1
then show that your estimate will be the in-sample median hmed, which is any value for which half the data points are at most hmed and half the data points are at least hmed.
Problem 2 (20 Points)
Consider
en(w) = max(0; 1 ynw>xn):
[5 pts] Show that en(w) is continuous and differentiable except when yn = w>xn.
[5 pts] Show that en(w) is an upper bound for the “0-1 loss” or [[sign(w>xn) 6= yn]]. Thus, N n=1 en(w) is an upper bound for the in-sample classification error Ein(w).1PN
(c) [10 |
pts] Apply stochastic gradient descent on N |
n=1 en(w) (ignoring the singular case of |
|||||
1 |
N |
||||||
w |
x |
= y |
algorithm. |
||||
> |
n |
n) and derive a new perceptron learning |
P |
Note: en (w) corresponds to the “hinge loss” used for maximum-margin classification, most notably for support vector machines (SVMs).
Electrical and Computer Engineering, SNU |
3 |
Problem 3 (20 Points)
There are a number of bounds on the generalization error , all holding with probability at least 1 .
(a) Original VC-bound: |
r |
||||||||||||||||||||||||||||||||||||||||||||||||||
N ln |
H |
||||||||||||||||||||||||||||||||||||||||||||||||||
8 4m |
(2N) |
||||||||||||||||||||||||||||||||||||||||||||||||||
(b) Rademacher penalty bound: |
|||||||||||||||||||||||||||||||||||||||||||||||||||
r |
+ r |
||||||||||||||||||||||||||||||||||||||||||||||||||
2 ln(2NN |
H |
(N)) |
+ N |
||||||||||||||||||||||||||||||||||||||||||||||||
N ln |
|||||||||||||||||||||||||||||||||||||||||||||||||||
m |
2 |
1 |
1 |
||||||||||||||||||||||||||||||||||||||||||||||||
(c) Parrondo and van den Broek: |
|||||||||||||||||||||||||||||||||||||||||||||||||||
s |
|||||||||||||||||||||||||||||||||||||||||||||||||||
N |
2 + ln |
6m |
H |
||||||||||||||||||||||||||||||||||||||||||||||||
1 |
(2N) |
||||||||||||||||||||||||||||||||||||||||||||||||||
(d) Devroye: |
|||||||||||||||||||||||||||||||||||||||||||||||||||
s |
|||||||||||||||||||||||||||||||||||||||||||||||||||
2N 4 (1 + ) + ln 4mH(N |
2 |
) |
|||||||||||||||||||||||||||||||||||||||||||||||||
1 |
|||||||||||||||||||||||||||||||||||||||||||||||||||
Note that (c) and (d) are implicit bounds in . Fix dVC = 50 and = 0:05. Plot these bounds as a function of N. Which is the best?
Problem 4 (20 Points)
The bias-variance decomposition of out-of-sample error is based on squared error measures. Recall that the out-of-sample error is given by
h i
Eout g(D) = Ex (g(D)(x) f(x))2 (1)
where Ex denotes the expected value with respect to input x, and using g(D) is to make explicit the dependence of g on data D. From (1) we can remove the dependence on D by taking average:
ED hEout g(D) i = ED |
hEx |
h |
(g(D)(x) f(x))2 |
ii |
|||||||||||||
= Ex hED h(g(D)(x) f(x))2ii : |
|||||||||||||||||
(a) |
[5 pts] To evaluate ED (g(D)(x) f(x))2 , we define the ‘average’ hypothesis |
(x) as |
|||||||||||||||
g |
|||||||||||||||||
(x) , ED hg(D)(x)i : |
(2) |
||||||||||||||||
g |
|||||||||||||||||
Now imagine that we have K datasets D1; D2; : : : ; DK , then what will be the average hypothesis |
|||||||||||||||||
(x) estimated using these datasets? |
|||||||||||||||||
g |
|||||||||||||||||
(b) |
[10 pts] Let us define |
||||||||||||||||
var(x) , ED g(D)(x) |
(x) 2 |
, and |
|||||||||||||||
g |
|||||||||||||||||
bias(x) , ( |
(x) f(x))2 : |
||||||||||||||||
g |
|||||||||||||||||
Show that |
i = var(x) + bias(x) |
||||||||||||||||
ED h(g(D)(x) f(x))2 |
|||||||||||||||||
(c) |
[5 pts] Show that |
ED hEout(g(D))i = bias + var
where bias , Ex [bias(x)] and var , Ex [var(x)].
Problem 5 (20 Points)
In support vector machines, the hyperplane h separates the data if and only if it can be represented by weights (b; w) that satisfy
min |
yn(w>xn + b) = 1: |
(3) |
||||||
n=1;:::;N |
||||||||
Consider the data below and a ‘hyperplane’ (b; w) that separates the data. |
||||||||
X = |
22 |
23 |
y = |
2 13 |
w = |
3:2 |
b = 0:5 |
|
0 |
0 |
1 |
1:2 |
|||||
2 |
0 |
+1 |
||||||
4 |
5 |
4 5 |
(a) [10 pts] Compute
= min yn(w>xn + b):
1;:::;N
[5 pts] Compute the weights 1 (b; w) and show that they satisfy Eq (3).
[5 pts] Plot both hyperplanes to show that they are the same separator.
Note: This problem is to give a concrete example of re-normalizing the weights to show that the condition (3) and the following condition are equivalent, as covered at class:
yn(w>xn + b) > 0: |
(4) |