A new year calls for a new project. I decided to look at a Kaggle dataset about US traffic accidents from February 2016 to March 2019. Since the data set is really large, with over 2.5 million rows and 49 columns, I believe there is a lot to analyze. This entry is just part 1 of my analysis, but it already has attempted to answer some interesting questions that I came up with.
I thought this dataset would be interesting mainly because it is so large. But, I am also interested to see if I can find any insights in traffic accident data to understand more about the behavior of drivers. Some preliminary questions I wanted to answer was how does weather affect the severity of traffic accidents, how does the time of day affect the severity of accidents, and where do the most accidents occur?
Those questions require some assumptions and definitions. Severity is ranked between 1 and 4, where 1 indicates the least impact on traffic and 4 indicates a significant impact on traffic. I also assume that the dataset is representative of all the accidents that occurred in the US and that there is little to no bias of the amount of accidents reported in each location. (The dataset was gathered by tracking multiple Traffic APIs.)
Most accidents occur in California
484,706 of them. I am not surprised by this finding. California is a huge state and has the most people living there in the U.S. Texas is second place, but that makes sense too. It is also a huge state and has a lot of people living there.
Big Picture - A Bimodal Distribution of Accidents Over the Hours
If I analyze the distribution of accidents over 24 hours, I see a bimodal distribution in all Severity levels. Take a look.
It seems as if the number of accidents peak at around 7 am, dips in the afternoon and comes up in the late afternoon around 5-6pm. The very early hours and the very late hours have the least amount of accidents. If you look a the y axis which represents the number of accidents in each graph, we can see that there are relatively few Severity Level 1 accidents and the most Severity Level 2 accidents. (Level 2 > Level 3 > Level 4 > Level 1)
Zooming In: Top 10 Urban Cities vs Top 10 Rural States
Another question that came up was whether the big picture trend, that the number of accidents have a bimodal distribution is consistent if we zoom in on urban ares and rural areas. I found the top 10 urban cities (2014) on this website. (The website takes into account more than 100 variables to determine the urban-ness of a city.) I also found the most rural states from this Quora post. (I based a state's rural-ness by the amount of land that is 'rural'.) Since according to the 2010 Census, some 80% of the U.S. population lives in urban areas, I felt comparing the top 10 urban cities and the top 10 rural states made sense.
The top urban cities are
1. New York, NY
2. San Francisco, CA
3. Boston, MA
4. Jersey City, NJ
5. Washington, DC
6. Miami, FL
7. Chicago, IL
8. Seattle, WA
9. Philadelphia, PA
10. Minneapolis, MN
The top rural states are
1. Alaska
2. Wyoming
3. Montana
4. North Dakota
5. South Dakota
6. Idaho
7. New Mexico
8. Nebraska
9. Nevada
10. Utah
* Data from Alaska does not exist in the dataset since the dataset only has data from the contiguous U.S.
Analysis of the top urban cities show that the trend in Severity Level 2 and 3 accidents do look like the big picture trend, but take a look at Severity Level 4 accidents -
The bimodal trend is not seen in Severity Level 4 accidents in the 10 most urban cities. The most number of Severity Level 4 accidents occur in the morning at around 6-7am. Rush hour. Then, it drops throughout the day.
Analysis of the top rural states show that the trend in Severity Level 2 and 3 accidents also have a bimodal trend, but Severity Level 4 accidents look like this -
The bimodal trend is not seen here either! In fact, very severe accidents occur in the middle of the day at around 12-1pm.
This is quite interesting and I wonder why the curve looks the way it does in urban and rural areas of the U.S.
Weather Condition Might Not Have An Effect on Severity of Accidents
It is quite interesting, but it may be the case that weather conditions, especially bad weather conditions might not have an effect on severity of accidents. For one, the most number of accidents occur on Clear days. So, we can say that if the day is clear, there may be more accidents. But, that is simply because more people will drive on clear days. There are fewer accidents on non-clear days, and the trend of the severity levels is still about the same for clear and non clear days, with the most number of Severity 2 accidents and least number of Severity 1 accidents.
Out of the 2.25 million accidents reported, 808,182 of those accidents occurred on clear days. The next weather condition to have the most number of accidents is overcast, with 382,482 accidents.
Only one traffic accident occurred in weather conditions of heavy smoke, dust whirls, and blowing sand.
Wrapping up brief EDA
This exploratory data analysis of the U.S. traffic accident dataset was really fun. At first, I didn't know where to begin because I had so much data and so many questions. But, tackling one question at a time was a good start. As a conclusion, I would like to say that sqlite3 saved the day. I remember learning about sqlite3 at Metis very briefly and didn't understand why I would ever use it.
sqlite3 turns out to be great when I have a csv file and want to run SQL queries on it in a Jupyter notebook. I simply create an empty database, to_sql my csv or a dataframe into the empty database, then run my query on that new table in my database.
Here is the skeleton of the code I used to run a SQL query using sqlite3 in a jupyter notebook setting:
Through this EDA, I was able to really understand aggregating over a CASE WHEN statement too. I struggled with that idea and how to implement it so that I get the data in a format that I wanted to. I used SQL and pandas to do groupbys, but since I was practicing a lot of SQL, I found that I gravitated towards using CASE WHEN with SQL first, rather than using pandas groupby function.
Overall, this EDA was a lot of fun and the insights were interesting. I will be continuing this EDA project, as there are so many other questions to ask and so many different variables to look at (there are 49 columns!).
For now, thanks for coming along with me on my first EDA of 2020.
No comments:
Post a Comment