Notes from Karen: November 2019

Wednesday, November 27, 2019

Notes on Root Mean Squared Error and R^2

Linear Regression Metrics

Two popular linear regression metrics are the root mean squared error and R^2.

Say you have a data set and have plotted the line of best fit. How well does the line of best fit actually model the data set? Visually, if the data points are very close to the line of best fit, you can say that it is a pretty good model compared to data points that are scattered everywhere and not particularly hugging the line of best fit. But, there is a more rigorous way of telling whether your linear model is good or not.

SSE- Sum of Squared Errors

The most straightforward way to see whether or not your model is a good fit compared to other models is to find the sum of the squared errors. This means you sum up all the squared values of the difference of the actual values and the predicted values. The lower the sum, the better the model.

MAE- Mean Absolute Error

Another way to compare models is to find the mean absolute error, which is just the average of the absolute values of the difference of the actual values and predicted values. Again, the lower the mean, the better the model. SSE is more sensitive to outliers (since we are squaring the values) than MAE is. But, MAE requires less refinement to be interpretable since the MAE returns the original units of the values. You would need to divide SSE by the total number of values and then take the square root of that for it to be more interpretable. Which leads us to...

Root Mean Squared Error

The root mean squared error is simply the square root of the SSE divided by the total number of values. This creates a more interpretable metric for how good our model is. Again, RSME, like SSE is more sensitive to outliers and would penalize them more than MAE.

R^2

R^2, or the coefficient of determination, is another metric you could use to see how good your model is. R^2 is usually between 0 and 1, but there are cases where R^2 can be negative. (That's really 'bad'.) The idea behind R^2 is to measure the target "variance explained" by the model. (How much variance of the target can be explained by the variance of the inputs?)

SST, or the sum of the squares total) is the target's intrinsic variance from its mean. It is given by the sum of the squares of the difference of the actual values and the mean of the actual values.

SSE/SST gives us the percent of the variance that remains once you've modeled the data. The sum of the squared errors over the sum of the squares total is the portion of the variance that the model cannot explain. However, we can subtract SSE/SST from 1 and get the portion of the variance that the model can explain.

1 - SSE/SST is the coefficient of determination, or R^2. "R^2 measures how good [your] regression fit is relative to a constant model that predicts the mean".

(Usually SSE is smaller than SST, assuming that our model does a little or much better than just predicting the mean all the time, so R^2 is between 0 and 1.)

Monday, November 25, 2019

Forest Fires in Brazil

Context

As a person who cares about environmental issues, coming across a data set about forest fires in Brazil on Kaggle was very exciting. The data set contains the number of forest fires in 23 states in each month over the span of years from 1998 to 2017 reported by the Brazilian government. I decided to analyze the data set to find the trend of the number of forest fires, to find when forest fires occur with the highest frequency, and to create a Tableau dashboard to visualize the change in the number of forest fires over time.

engine = 'python'

When reading the csv file with the data set, I had to set the engine parameter to python because UTF-8 couldn't decode some characters.
I realized that the month names were in Portuguese, so I decided to change all the month names to numeric names. I created a new column in my dataframe that are the numeric equivalents to the month names.

Geocoding

I wanted to have the latitude and longitude for each state in the data set so that I can graph it in the future. So, I used locationiq's API to find the latitude and longitude for each state.

I then was able to create a dictionary of the states and their respective latitude and longitude values. I was able to use two apply functions to apply to each row the correct latitude and longitude based on the state.

Average Number of Forest Fires Per Year in Each State

The visual I wanted to create is one of a map which shows the average number of forest fires per year in each state in the data set with a circle. The larger and darker the circle, the higher the average number of forest fires per year. I would need to create a dataframe with the average number of forest fires per year in each state first. Here is the code, which required use of a groupby:

Total Number of Forest Fires Per Year

It would be great if I could see the overall trend of forest fires throughout the years. I decided to use another groupby to find the sum of all forest fires over the years. I found that there was an increasing trend.

A dashboard where the user could click on a point on the line graph which corresponds to a specific year and the number of forest fires would filter the map of Brazil to the corresponding year and show the average number of forest fires in each state would be pretty nice. So, I did that.

Monthly Trends in Forest Fires

Before I show the Tableau dashboard, I would like to show a graph which represents the monthly trend of forest fires over the years. As you can see, the graph below shows that the number of forest fires are low in the beginning of the year, increases quickly in June, peaks at July, drops a bit in September, then spikes again in October.

Looking at a bit more granular of a level, you can see the shift in the lines upward since 1998 of the number of forest fires over months. This supports the positive trend seen in the number of forest fires over the years.

An Increasing Trend of Forest Fires in Brazil

Lastly, I would like to present the Tableau dashboard I created with a short video. You can see the change in the sizes and shade of the circles that represents the average number of forest fires in each state over time. The trend line of the total number of forest fires each year is also there. Sao Paulo always has a very large average number of forest fires each year.