Description
Note: The assignment will be auto-graded. It is important that you do not use additional libraries, or change the provided functions’ input and output.
Part 1: Setup
- Remote connect to an EWS machine.
ssh (netid)@remlnx.ews.illinois.edu
- Load python module, this will also load pip and virtualenv
module load python/3.4.3
- Reuse the virtual environment from mp1.
source ~/cs446sp_2018/bin/activate
- Copy mp2 into your svn directory, and change directory to mp2.
cd ~/(netid)
svn cp https://subversion.ews.illinois.edu/svn/sp18-cs446/_shared/mp2 . cd mp2
- Install the requirements through pip.
pip install -r requirements.txt
- Create data directory and download the data into the data directory.
mkdir data
wget –user (netid) –ask-password \ https://courses.engr.illinois.edu/cs446/sp2018/\ secure/assignment2_data.zip -O data/assignment2_data.zip
- Unzip assignment2 data.zip
unzip data/assignment2_data.zip -d data/
- Prevent svn from checking in the data directory.
svn propset svn:ignore data .
Part 2: Exercise
In this exercise we will build a system to predict housing prices. We illustrate the overall
pipeline of the system in Fig. 1. We will implement each of the blocks.
In main.py , the overall program structure is provided for you.
Figure 1: High-level pipeline
Part 2.1 Numpy Implementation
- Reading in data. In utils/io tools.py , we will fill in one function for reading in the dataset. The dataset consists of housing features (e.g . the size of the house, location, …, etc.) and the price of the house.
There are three csv files, train.csv , val.csv , and test.csv , each contains exam- ples in each of the dataset splits.
The format is comma separated, and the first line containing the header of each column.
Id,BldgType,OverallQual,GrLivArea,GarageArea,SalePrice
1,1Fam,7,1710,548,208500
Everything before the SalePrice may be the input to our system, and SalePrice is the quantity we hope to predict.
- Data processing. In utils/data tools.py , we will implement functions to trans- form the data into vector forms. For example, converting the location column into one-hot encoding. There is a total of five types of buildings, 1Fam, 2FmCon, Duplx, TwnhsE, TwnhsI. In order to represent this, we construct a vector of length five, one for each type, where each element is a Boolean variable indicating the existence of the building type. For example,
1Fam = [1, 0, 0, 0, 0]
2FmCon = [0, 1, 0, 0, 0]
…etc.
More details are provided in the function docstring.
- Linear model implementation. In models/linear model.py , we will implement an abstract base class for linear models, then we will extend it to linear regression. The models will support the following operations:
– Forward operation. Forward operation is the function which takes an input and outputs a score. In this case, for linear models, it is F = w| x + b. For simplicity, we will redefine x = [x, 1] and w = [w, b], then F = w| x.
– Loss function. Loss function takes in a score, and ground-truth label and out- puts a scalar. The loss function indicates how good the models predicted score fits to the ground-truth. We will use L to denote the loss.
– Backward operation. Backward operation is for computing the gradient of the loss function with respect to the model parameters. This is computed after the forward operation to update the model.
- Optimization
– Gradient descent. In models/train eval model.py , we will implement gra- dient descent. Gradient descent is a optimization algorithm, where the model adjusts the parameters in direction of the negative gradient of L.
Repeat until convergence:
w(t) = wt−1 − η∇L(t−1)
The above equation is referred as an update step, which consists of one pass of the forward and backward operation.
– Linear regression also has an analytic solution, which we will also implement.
- Model selection. For the optimization above, it is about learning the model param- eters w. In this case, we use the training split of the dataset to “train” these param- eters. Additionally, there are several hyper-parameters in this model, (e.g . learning rate, weight decay factor, the column features). These hyper-parameters should be chosen based on the validation split (i.e . for each hyper-parameter setting, we find the optimal w using the training set then compute the loss on the validation set; We will choose the hyper-parameters with the lowest validation error as the final model.
- Running Experiments. In main.py , experiment with different features, weights initialization and learning rate. We will not grade the main.py file, feel free to modify it.
To run main.py
python main.py
- Things to think about. Here is a list of things to think about, you do not have to hand in anything here.
– How does learning effect convergence?
– Which optimization is better, analytic solution or gradient descent?
– Are squared features better? Why?
– Which of the column features are important?
Part 3: Writing Tests
In test.py we have provided basic test-cases. Feel free to write more. To test the code,
run
nose2
Part 4: Submit
Submitting the code is equivalent to committing the code. This can be done with the
following command:
svn commit -m “Some meaningful comment here.”
Lastly, double check on your browser that you can see your code at
https://subversion.ews.illinois.edu/svn/sp18-cs446/(netid)/mp2/