Statistical assignment

Predicting Prices of Used Cars (Regression Trees). The file ToyotaCorolla.csv contains the data on used cars (Toyota Corolla) on sale during late summer of 2004 in the Netherlands. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications. The goal is to predict the price of a used Toyota Corolla based on its specifications. (The example in Section 9.7 is a subset of this dataset).Data Preprocessing. Split the data into training (60%), and validation (40%) datasets. a. Run a regression tree (RT) with outcome variable Price and predictors Age_08_04, KM,_Type, HP, Automatic, Doors,_Tax, Mfg_Guarantee, Guarantee_Period, Airco,_Airco, CD_Player, Powered_, Sport_Model, and Tow_Bar. Keep the minimum number of records in a terminal node to 1, maximum number of tree levels to 100, and cp = 0.001, to make the run least restrictive.i. Which appear to be the three or four most important car specifications for predicting the cars price? ii. Compare the prediction errors of the training and validation sets by examining their RMS error and by plotting the two boxplots. What is happening with the training set predictions? How does the predictive performance of the validation set compare to the training set? Why does this occur?iii. How can we achieve predictions for the training set that are not equal to the actual prices?iv. Prune the full tree using the cross-validation error. Compared to the full tree, what is the predictive performance for the validation set?b. Let us see the effect of turning the price variable into a categorical variable. First, create a new variable that categorizes price into 20 bins. Now repartition the data keeping_Price instead of Price. Run a classification tree with the same set of input variables as in the RT, and with Binned_Price as the output variable. Keep the minimum number of records in a terminal node to 1.i. Compare the tree generated by the CT with the one generated by the RT. Are they different? (Look at structure, the top predictors, size of tree, etc.) Why?ii. Predict the price, using the RT and the CT, of a used Toyota Corolla with the specifications listed in Table 9.6.iii. Compare the predictions in terms of the predictors that were used, the magnitude of the difference between the two predictions, and the advantages and disadvantages of the two methods.