Advanced analytics: using a decision tree and random forest approach to predict bike rentals
In our last two blogs we explored the business case for advanced analytics, and how to create a linear regression model using the example of predicting bike rentals from a Kaggle challenge. Creating a rudimentary linear regression model using just numeric variables did not produce great results. Measured using the root mean squared logarithmic error, the predictive power of the linear regression model puts it towards the end of the rankings on the Kaggle website. To improve the model we will now take the categorical predictors such as weather, season etc. and create two different models; a classification tree and a random forest.
Creating a Regression Tree in R
As we mentioned in the previous blog, we will separately predict the number of registered and casual rentals, because the two profiles are different, as you can see from the image below. The registered users tend to rent more in the morning (8 am) and after work (6-7 pm) coinciding with commuting patterns, while casual users are more evenly distributed throughout the day.
In addition, we will also predict the logarithmic values of the casual and registered rental, since the distribution of the transformed variables is a bit more normal.
Here is the code for creating a regression tree:
This creates a regression tree and displays a summary of the model. Instead of showing the raw output from the model, we have created a visualisation of the model using OrgVue. Starting at the top of the tree, we have all casual cycle rentals. At each level below, we make a decision about the rental in question, moving it through the “splits” of the tree until reaching a leaf (terminal) node. For instance, in the first row from the top, we decide “was the rental during the hours from 9:00 to 20:00 or outside them?” If our rental occurred during these hours, we move right at this level and make the next decision; “was the temperature below 19.27 degrees?” And so on. Once we have grouped all of the rentals into their leaf nodes, we can use this grouping to form a prediction of the number of rentals that follow the decision rules we made to place them there. We can see that the predictors used in the classification are hour, temperature (actual and real-feel), day of the week, and working day.
We’ll do the same to calculate a model for the registered users.
Here is the summary output, visualised in OrgVue. Notice that in this model, season and month are both used as predictors, while day of the week and climate variables are no longer important.
Now, let’s calculate the predictive power of the combined models. First, we’ll use the test data to predict bike rentals. Remember to square the results to de-transform the outcome.
It seems like the classification tree does a much better job at predicting bike rentals than the previous, cruder linear regression model. The difference in the root square logarithmic error between the two types of models is 0.56, a 43% improvement.
Let’s import this into Tableau and take a look at the actual and predicted values:
It seems that the model significantly over predicts, as can be seen in the graph above, which compares actual and predicted values for 3 days in January.
Random Forest Approach
Now, let’s look at how a random forest performs under the circumstances. Similarly, we will create two separate models, for casual and registered users, and then combine the results for prediction. we will also use the transformed variables as in the case of the linear regression and the decision tree.
Here is the code for building the random forest model using package “randomForest”:
The model seems to be explaining 88.92% of the inherent variation in the data, which is very high. After running the same model for the registered users, we can predict the results with the following code and calculate the root mean squared logarithmic error.
So, it seems that the random forest has more predictive power, the model results in a much lower root mean squared error. Looking at the output in Tableau, we can see that the predicted values (red line) are closer to the actuals (purple line). The model could still be improved, as it is evident that peak values are overestimated.
A comparison in Tableau of predicted values between the random forest (red line) and the decision tree (green line) clearly exhibits the increased accuracy of the random forest. The decision tree is almost always above the random forest and peak values are even more overestimated.
There is scope to improve both models, but even with these simple models the random forest is more accurate. However, greater accuracy may not always be the factor that determines whether a model is used in practice or not, especially in business environments. Accuracy is usually weighted against model simplicity or the time it takes for a model to run. In this case the random forest suffers from 2 significant shortcomings;
- It is much more difficult to interpret than a decision tree, which is very clearly laid out. From a business point of view, a random forest is much more of a “black box”, and in a consulting environment clarity is always desired.
- The random forest takes significantly more time to be generated, which may not be ideal in this case when results could be desired in very short amounts of time, especially if many logistic decisions must be done simultaneously.
There are always different ways to model scenarios. But you have take into account the advantages and disadvantages of models and how each one will work in practice, in a business environment.
See the first two blogs in this series; The business case for using advanced analytics & Advanced analytics – creating a linear regression model