Notes from Karen: January 2020

Did you know that January 10th was National Houseplant Day? It was a great coincidence that around that time, I was working on an app that is supposed to recommend similar houseplants to people so that they can discover different plants they might like.

Motivation

I got the inspiration to create an app revolving around houseplants because my mom, cousin, and best friend are plant lovers. Their houses are basically mini-forests. I scraped all my data off of houseplant411.com, created a crude recommendation system and dressed it up in a flask app which was lastly deployed on Heroku.

Data

The data I scraped included images of the 136 different popular house plants, their descriptions, and other information related to their care such as light, water, fertilizer, temperature, humidity, flowering, pests, diseases, soil, pot size, pruning, propagation, special occasion, and poisonous plant information. I used beautiful soup to do all of my web-scraping.

The Recommendation System

At the core of my app is a recommendation system made possible by using cosine similarity on vectors created using Tfidfvectorizer. The vectors came from the text descriptions of the plants and care information related to the plants. I compared the vectors that came from just the text descriptions using cosine similarity to create a cosine similarity matrix. I also compared the vectors that came from care information using cosine similarity to create another matrix. I then found the average of the two matrices to create my last matrix which I used to recommend 5 different, yet similar plants to a plant of the user's choice.

The App

The application can be found here - https://discover-a-houseplant.herokuapp.com/

The idea of the app is that three random plants are generated from my list of 136 plants. A user chooses one of the plants he or she likes and then they are brought to a new page which has 5 similar, yet different plants to the plant he or she chose. There is a photo, a name and a description of each plant on each page of the application.

Results

You can decide if you like the recommendations put forth by this app or not! But, I think this is a fun way to discover new plants.

Code

The code for this project can be found on my Github at https://github.com/morningkaren/discover-a-houseplant .

Happy 2020!

A new year calls for a new project. I decided to look at a Kaggle dataset about US traffic accidents from February 2016 to March 2019. Since the data set is really large, with over 2.5 million rows and 49 columns, I believe there is a lot to analyze. This entry is just part 1 of my analysis, but it already has attempted to answer some interesting questions that I came up with.

I thought this dataset would be interesting mainly because it is so large. But, I am also interested to see if I can find any insights in traffic accident data to understand more about the behavior of drivers. Some preliminary questions I wanted to answer was how does weather affect the severity of traffic accidents, how does the time of day affect the severity of accidents, and where do the most accidents occur?

Those questions require some assumptions and definitions. Severity is ranked between 1 and 4, where 1 indicates the least impact on traffic and 4 indicates a significant impact on traffic. I also assume that the dataset is representative of all the accidents that occurred in the US and that there is little to no bias of the amount of accidents reported in each location. (The dataset was gathered by tracking multiple Traffic APIs.)

Most accidents occur in California

484,706 of them. I am not surprised by this finding. California is a huge state and has the most people living there in the U.S. Texas is second place, but that makes sense too. It is also a huge state and has a lot of people living there.

Big Picture - A Bimodal Distribution of Accidents Over the Hours

If I analyze the distribution of accidents over 24 hours, I see a bimodal distribution in all Severity levels. Take a look.

It seems as if the number of accidents peak at around 7 am, dips in the afternoon and comes up in the late afternoon around 5-6pm. The very early hours and the very late hours have the least amount of accidents. If you look a the y axis which represents the number of accidents in each graph, we can see that there are relatively few Severity Level 1 accidents and the most Severity Level 2 accidents. (Level 2 > Level 3 > Level 4 > Level 1)

Zooming In: Top 10 Urban Cities vs Top 10 Rural States

Another question that came up was whether the big picture trend, that the number of accidents have a bimodal distribution is consistent if we zoom in on urban ares and rural areas. I found the top 10 urban cities (2014) on this website. (The website takes into account more than 100 variables to determine the urban-ness of a city.) I also found the most rural states from this Quora post. (I based a state's rural-ness by the amount of land that is 'rural'.) Since according to the 2010 Census, some 80% of the U.S. population lives in urban areas, I felt comparing the top 10 urban cities and the top 10 rural states made sense.

The top urban cities are

1. New York, NY

2. San Francisco, CA

3. Boston, MA

4. Jersey City, NJ

5. Washington, DC

6. Miami, FL

7. Chicago, IL

8. Seattle, WA

9. Philadelphia, PA

10. Minneapolis, MN

The top rural states are

1. Alaska

2. Wyoming

3. Montana

4. North Dakota

5. South Dakota

6. Idaho

7. New Mexico

8. Nebraska

9. Nevada

10. Utah

* Data from Alaska does not exist in the dataset since the dataset only has data from the contiguous U.S.

Analysis of the top urban cities show that the trend in Severity Level 2 and 3 accidents do look like the big picture trend, but take a look at Severity Level 4 accidents -

The bimodal trend is not seen in Severity Level 4 accidents in the 10 most urban cities. The most number of Severity Level 4 accidents occur in the morning at around 6-7am. Rush hour. Then, it drops throughout the day.

Analysis of the top rural states show that the trend in Severity Level 2 and 3 accidents also have a bimodal trend, but Severity Level 4 accidents look like this -

The bimodal trend is not seen here either! In fact, very severe accidents occur in the middle of the day at around 12-1pm.

This is quite interesting and I wonder why the curve looks the way it does in urban and rural areas of the U.S.

Weather Condition Might Not Have An Effect on Severity of Accidents

It is quite interesting, but it may be the case that weather conditions, especially bad weather conditions might not have an effect on severity of accidents. For one, the most number of accidents occur on Clear days. So, we can say that if the day is clear, there may be more accidents. But, that is simply because more people will drive on clear days. There are fewer accidents on non-clear days, and the trend of the severity levels is still about the same for clear and non clear days, with the most number of Severity 2 accidents and least number of Severity 1 accidents.

Out of the 2.25 million accidents reported, 808,182 of those accidents occurred on clear days. The next weather condition to have the most number of accidents is overcast, with 382,482 accidents.

Only one traffic accident occurred in weather conditions of heavy smoke, dust whirls, and blowing sand.

Wrapping up brief EDA

This exploratory data analysis of the U.S. traffic accident dataset was really fun. At first, I didn't know where to begin because I had so much data and so many questions. But, tackling one question at a time was a good start. As a conclusion, I would like to say that sqlite3 saved the day. I remember learning about sqlite3 at Metis very briefly and didn't understand why I would ever use it.

sqlite3 turns out to be great when I have a csv file and want to run SQL queries on it in a Jupyter notebook. I simply create an empty database, to_sql my csv or a dataframe into the empty database, then run my query on that new table in my database.

Here is the skeleton of the code I used to run a SQL query using sqlite3 in a jupyter notebook setting:

Through this EDA, I was able to really understand aggregating over a CASE WHEN statement too. I struggled with that idea and how to implement it so that I get the data in a format that I wanted to. I used SQL and pandas to do groupbys, but since I was practicing a lot of SQL, I found that I gravitated towards using CASE WHEN with SQL first, rather than using pandas groupby function.

Overall, this EDA was a lot of fun and the insights were interesting. I will be continuing this EDA project, as there are so many other questions to ask and so many different variables to look at (there are 49 columns!).

For now, thanks for coming along with me on my first EDA of 2020.

Notes from Karen

Sunday, January 19, 2020

Discover a Houseplant App

Tuesday, January 7, 2020

US Traffic Accidents Trends (Part 1)