Notes from Karen: EDA

Happy 2020!

A new year calls for a new project. I decided to look at a Kaggle dataset about US traffic accidents from February 2016 to March 2019. Since the data set is really large, with over 2.5 million rows and 49 columns, I believe there is a lot to analyze. This entry is just part 1 of my analysis, but it already has attempted to answer some interesting questions that I came up with.

I thought this dataset would be interesting mainly because it is so large. But, I am also interested to see if I can find any insights in traffic accident data to understand more about the behavior of drivers. Some preliminary questions I wanted to answer was how does weather affect the severity of traffic accidents, how does the time of day affect the severity of accidents, and where do the most accidents occur?

Those questions require some assumptions and definitions. Severity is ranked between 1 and 4, where 1 indicates the least impact on traffic and 4 indicates a significant impact on traffic. I also assume that the dataset is representative of all the accidents that occurred in the US and that there is little to no bias of the amount of accidents reported in each location. (The dataset was gathered by tracking multiple Traffic APIs.)

Most accidents occur in California

484,706 of them. I am not surprised by this finding. California is a huge state and has the most people living there in the U.S. Texas is second place, but that makes sense too. It is also a huge state and has a lot of people living there.

Big Picture - A Bimodal Distribution of Accidents Over the Hours

If I analyze the distribution of accidents over 24 hours, I see a bimodal distribution in all Severity levels. Take a look.

It seems as if the number of accidents peak at around 7 am, dips in the afternoon and comes up in the late afternoon around 5-6pm. The very early hours and the very late hours have the least amount of accidents. If you look a the y axis which represents the number of accidents in each graph, we can see that there are relatively few Severity Level 1 accidents and the most Severity Level 2 accidents. (Level 2 > Level 3 > Level 4 > Level 1)

Zooming In: Top 10 Urban Cities vs Top 10 Rural States

Another question that came up was whether the big picture trend, that the number of accidents have a bimodal distribution is consistent if we zoom in on urban ares and rural areas. I found the top 10 urban cities (2014) on this website. (The website takes into account more than 100 variables to determine the urban-ness of a city.) I also found the most rural states from this Quora post. (I based a state's rural-ness by the amount of land that is 'rural'.) Since according to the 2010 Census, some 80% of the U.S. population lives in urban areas, I felt comparing the top 10 urban cities and the top 10 rural states made sense.

The top urban cities are

1. New York, NY

2. San Francisco, CA

3. Boston, MA

4. Jersey City, NJ

5. Washington, DC

6. Miami, FL

7. Chicago, IL

8. Seattle, WA

9. Philadelphia, PA

10. Minneapolis, MN

The top rural states are

1. Alaska

2. Wyoming

3. Montana

4. North Dakota

5. South Dakota

6. Idaho

7. New Mexico

8. Nebraska

9. Nevada

10. Utah

* Data from Alaska does not exist in the dataset since the dataset only has data from the contiguous U.S.

Analysis of the top urban cities show that the trend in Severity Level 2 and 3 accidents do look like the big picture trend, but take a look at Severity Level 4 accidents -

The bimodal trend is not seen in Severity Level 4 accidents in the 10 most urban cities. The most number of Severity Level 4 accidents occur in the morning at around 6-7am. Rush hour. Then, it drops throughout the day.

Analysis of the top rural states show that the trend in Severity Level 2 and 3 accidents also have a bimodal trend, but Severity Level 4 accidents look like this -

The bimodal trend is not seen here either! In fact, very severe accidents occur in the middle of the day at around 12-1pm.

This is quite interesting and I wonder why the curve looks the way it does in urban and rural areas of the U.S.

Weather Condition Might Not Have An Effect on Severity of Accidents

It is quite interesting, but it may be the case that weather conditions, especially bad weather conditions might not have an effect on severity of accidents. For one, the most number of accidents occur on Clear days. So, we can say that if the day is clear, there may be more accidents. But, that is simply because more people will drive on clear days. There are fewer accidents on non-clear days, and the trend of the severity levels is still about the same for clear and non clear days, with the most number of Severity 2 accidents and least number of Severity 1 accidents.

Out of the 2.25 million accidents reported, 808,182 of those accidents occurred on clear days. The next weather condition to have the most number of accidents is overcast, with 382,482 accidents.

Only one traffic accident occurred in weather conditions of heavy smoke, dust whirls, and blowing sand.

Wrapping up brief EDA

This exploratory data analysis of the U.S. traffic accident dataset was really fun. At first, I didn't know where to begin because I had so much data and so many questions. But, tackling one question at a time was a good start. As a conclusion, I would like to say that sqlite3 saved the day. I remember learning about sqlite3 at Metis very briefly and didn't understand why I would ever use it.

sqlite3 turns out to be great when I have a csv file and want to run SQL queries on it in a Jupyter notebook. I simply create an empty database, to_sql my csv or a dataframe into the empty database, then run my query on that new table in my database.

Here is the skeleton of the code I used to run a SQL query using sqlite3 in a jupyter notebook setting:

Through this EDA, I was able to really understand aggregating over a CASE WHEN statement too. I struggled with that idea and how to implement it so that I get the data in a format that I wanted to. I used SQL and pandas to do groupbys, but since I was practicing a lot of SQL, I found that I gravitated towards using CASE WHEN with SQL first, rather than using pandas groupby function.

Overall, this EDA was a lot of fun and the insights were interesting. I will be continuing this EDA project, as there are so many other questions to ask and so many different variables to look at (there are 49 columns!).

For now, thanks for coming along with me on my first EDA of 2020.

I wanted to review some classification algorithms, so I found a clean data set on Kaggle about classifying mobile price ranges to practice with logistic regression, decision tree algorithm, and random forest. I will be walking through my thought process when attempting this exploration of the data set from Kaggle.

The Dataset

The data set has its own story about how Bob wants to start his own mobile company, but does not know how to estimate the price of mobiles his company creates. He decides to collect sales data of mobile phones from various companies and decides to find some relation between the features of the phone and its selling price.

The features of the phone include battery power, bluetooth capabilities, clock speed, dual sim option, front camera mega pixels, 4G capabilities, internal memory, mobile depth, mobile weight, number of core processors, primary camera mega pixels, pixel height resolution, pixel width resolution, RAM, screen height, screen width, talk time with a single battery charge, 3G capabilities, touch screen, and wifi capabilities.

The data set has a test and train file, but because there were no price ranges in the test file, I decided to just use my training data and create a separate test file out of some of my training data. This decreased the amount of data I could train, but I felt that it was ok for this exploration.

Exploratory Data Analysis

Some things I did before I ran any models was to look at pairplots using seaborn. Since there were so many features, I separated the dataset into three groups with about 7-8 features in each set of pairplots and compared them with the price range, which is denoted 0, 1, 2, or 3. I looked at pairplots to see if there are any features that are linearly separable. (This would show me that I could use logistic regression and it would work well.)

How can I tell if some features are linearly separable or not? Well, I looked at each pairplot to see if there were features that overlapped the least. For example, if I look at the distribution plot of battery power with the different price ranges as the "hue" attribute, I see that the blue and red hues are clearly separated. There is definitely overlap, but there is a huge separation between the blue and red distributions.

There is some separation in the pixel width feature too.

But the separation with the RAM is most significant.

So you might ask what are some examples of features that do not look separable? Well, I would say the front camera pixels feature is not linearly separable, since there seems to be an equal distribution of all the four colors in the same way. Look at the front camera pixels distribution-- the colors overlap almost perfectly.

So, now I know that certain features like RAM, battery power, and pixel width may be very telling of the price range of mobile phones, but something like the front camera pixels might not be very telling.

Are classes balanced?

Another thing to look at before doing any modeling is to check the class balance of the target. My target has four classes that tell me how expensive a mobile phone is-- class 0, 1, 2, and 3. Looking at a distribution plot of the price range, I see that it is uniform. This rarely happens in real life, but again, this is a clean Kaggle data set.

Balanced classes tell me that I would not need to do any over or undersampling or class weights or change the threshold of my classes after modeling.

Logistic Regression Model

I used sci-kit learn's modules to run my logistic regression model. I did two train, test, splits to divide my data set into a train-validate-test set of 60-20-20. I also imported StandardScaler to standardize my features. I looked at many metrics, including F1, accuracy, precision, and recall. I also decided to look at a confusion matrix, to practice and make sure that I am getting the metrics right.

For my logistic regression model, I just decided to focus on the three features I found that looked more linearly separable than the rest of the features--battery_power, ram, and px_width. I standard scaled my features, then fit my model to my training set. I predicted using my validation set and used a classification report to see the precision, recall, F1, and accuracy scores. Using just three features, I got an accuracy of 82%, and an F1 score of predicting class 0 as .91 and class 3 as .93, class 1 as .68, and class 2 as .71. That's not bad.

Logistic Regression-Confusion Matrix

This part will cover how to calculate precision, recall, and accuracy using the confusion matrix. The confusion matrix basically tells me the actual vs. predicted values of my target. For example, in this confusion matrix, there are 95 cases of phones predicted as class 0 and were actually class 0.

So, what is accuracy in relation to the confusion matrix? Accuracy is the total number of true positives over the total number of cases. (An example of a true positive is a case that is predicted to be 0 and is actually 0.) In this case, it is the sum of the diagonal that goes from top left to bottom right over the sum of the numbers in the whole matrix- 325/328 which rounds to .82 (same number as reported above).

What is precision in relation to the confusion matrix? Precision is the total number of true positives over the sum of true positives and false positives (an example of a false positive is a case that is predicted to be 0 but is not 0.). We will have to look at precision class to class. In this case, the total number of true positives for class 0 is 95. The total number of false positives is 19. So, the precision is 95/114 which rounds to .83.

What is recall in relation to the confusion matrix? Recall is the total number of true positives over the sum of true positives and false negatives. (An example of a false negative is a case that is predicted to not be 0 but is actually 0.) In this case, the total number of true positives for class 0 is 95 and the number of false negatives is 0, so precision is 1.

What is F1 in relation to precision and recall? F1 is actually the harmonic mean of the precision and recall scores.

Decision Tree

Using sklearn's decision tree algorithm is simple. You instantiate the DecisionTreeClassifier and set a max depth for the tree. Then you fit your training set and then predict using your validation set. The decision tree performed poorly in comparison to the logistic regression algorithm, with an accuracy of .26. We can guess the classes of the mobile phones and do just as well as the decision tree classifier.

I want to make a note that instead of using just three features, I used all the features available to me when training and fitting my decision tree.

The F1 score for class 0 was .28, class 1 was .33, class 2 was .14, and class 3 was .27.

Random Forest

Similarly, random forest did not perform that well compared to logistic regression. I used 1000 estimators (basically individual decision trees) and a max depth of 2 in my random forest model. The random forest basically averages out many decision trees to give a more robust and accurate prediction, but in this case, the accuracy score turned out o be .24. The F1 score for class 0 was .28, class 1 was .31, class 2 was .12, and class 3 was .27.

However, what was interesting to see was the feature importances in my random forest model. The following feature importance graph shows the relative importance of each feature in determining the information gained from splitting at that feature. Here, ram, battery power, px_height, and px_width showed the information gain. This is consistent with the fact that I noticed that ram, battery power and px_width showed much linear separability from the pairplots done in the exploratory data analysis portion above.

Logistic Regression worked best

A reason why I think logistic regression worked best in this scenario is that there were features that were linearly separable. In cases where there was little linearly separability, random forest might have worked better.

A final note

Improvements could be made on my model with more investigation of how to obtain a F1 score of when predicting class 1 and class 2 mobiles. Perhaps running a logistic regression model on all with all of the features and lowering the C parameter (a regularization parameter, which I set to 1 which means no regularization was used) to make sure the model doesn't over-fit. That would be interesting to see.

Notes from Karen

Tuesday, January 7, 2020

US Traffic Accidents Trends (Part 1)

Thursday, December 19, 2019

An Exploration of Classification Models using Mobile Dataset