Your cart is currently empty!
The following languages are acceptable: Java, C/C++, Python, Matlab You can work in a team of up to 3 people. Each team will only need to submit one copy of the source code and report. You need to explicitly state each member’s contribution in percentages (a rough estimate). Your source code and report…
Decision Tree Ensemble for Optical Character Recognition (total points: 80 pts + 10 report pts + 10 result pts)
In this assignment we continue working on the Optical Character Recognition task to classify between two numbers 3 and 5. The goal in this assignment is to develop variations of the decision tree.
Data. The data for this assignment is generated from the data of implementation assignment 2. We apply the Principal Component Analysis (PCA) to reduce the dimension from 784 to 100. Here is a short descrip-tion of each train and validation split:
Important Guidelines. For all parts of this assignment:
Part 1 (20 pts) : Decision Tree (DT). For this part we are interested in using a decision tree with below con guration:
The DT uses gini-index to measure the uncertainty. Speci cally if we have a node split from list A to two left and right lists AL and AR as depicted in gure 1 then
A | |||||||||
C + | C – | >= T | |||||||
f < T | f | ||||||||
i | |||||||||
i | |||||||||
CL+ | CL– | CR + | CR – |
AL AR
Figure 1: Split list A to two lists AL and AL with feature fi at threshold T
the bene t of split for feature fi at threshold T is computed as:
B = U(A) plU(AL) prU(AR)
Where U is the gini-index function which is computed for a list such as AL as follows:
2 | 2 | CL+ | 2 | CL | 2 | ||||
U(AL) = 1 p+ | p = 1 | ( | ) | ( | ) | ||||
CL+ + CL | CL+ + CL |
and pl and pr are the probabilities for each split which is given by
CL+ + CL
pl =
(1)
(2)
(3)
The feature values are continuous. Therefore you need to follow the descriptions and hints given in the slide to search for the best threshold T for a feature fi.
Please implement below steps:
Part 2 (30 pts) : Random Forest (Bagging). In this part we are interested in random forest which is a variation of bagging without some of its limitations. Please implement below steps:
m : The number of features for a tree.
d : Maximum depth of the trees in the forest.
Here is how the forest is created: The random forest is a collection of n trees. All the trees in the forest
has maximum depth of d. Each tree is built on a data set of size 4888 sampled (with replacement) from the train set. In the process of building a tree of the forest, each time we try to nd the best feature fi to split, we need to rst sub-sample (without replacement) m number of features from 100 feature set and then pick fi with highest bene t from m sampled features.
Part 3 (30 pts) : AdaBoost (Boosting). For this part we are interested in applying AdaBoost to create yet another ensemble model with decision tree. Considering the AdaBoost algorithm described in the slide, please do below steps:
(Hint: It changes the gini-index and also the predictions at leaves).
Submission. Your submission should include the following:
4