Notes from Karen: code

Did you know that January 10th was National Houseplant Day? It was a great coincidence that around that time, I was working on an app that is supposed to recommend similar houseplants to people so that they can discover different plants they might like.

Motivation

I got the inspiration to create an app revolving around houseplants because my mom, cousin, and best friend are plant lovers. Their houses are basically mini-forests. I scraped all my data off of houseplant411.com, created a crude recommendation system and dressed it up in a flask app which was lastly deployed on Heroku.

Data

The data I scraped included images of the 136 different popular house plants, their descriptions, and other information related to their care such as light, water, fertilizer, temperature, humidity, flowering, pests, diseases, soil, pot size, pruning, propagation, special occasion, and poisonous plant information. I used beautiful soup to do all of my web-scraping.

The Recommendation System

At the core of my app is a recommendation system made possible by using cosine similarity on vectors created using Tfidfvectorizer. The vectors came from the text descriptions of the plants and care information related to the plants. I compared the vectors that came from just the text descriptions using cosine similarity to create a cosine similarity matrix. I also compared the vectors that came from care information using cosine similarity to create another matrix. I then found the average of the two matrices to create my last matrix which I used to recommend 5 different, yet similar plants to a plant of the user's choice.

The App

The application can be found here - https://discover-a-houseplant.herokuapp.com/

The idea of the app is that three random plants are generated from my list of 136 plants. A user chooses one of the plants he or she likes and then they are brought to a new page which has 5 similar, yet different plants to the plant he or she chose. There is a photo, a name and a description of each plant on each page of the application.

Results

You can decide if you like the recommendations put forth by this app or not! But, I think this is a fun way to discover new plants.

Code

The code for this project can be found on my Github at https://github.com/morningkaren/discover-a-houseplant .

From Python Programmer's (Gile's) youtube video titled "Can you LEARN DATA SCIENCE for FREE? YES! I'll show you HOW!", I discovered an online book about Python by Christian Hill.
In Chapter 6, there was a coding problem on the topic of airport distances. The problem gives a data set of airports, their location, their latitudes and their longitudes. I had to write code to find the distance between two airports based on their latitudes and longitudes.

The Importance of Reading the Question

Here is the problem, word for word:

The file busiest_airports.txt provides details of the 30 busiest airports in the world in 2014. The tab-delimited fields are: three-letter IATA code, airport name, airport location, latitude and longitude (both in degrees).

Write a program to determine the distance between two airports identified by their three-letter IATA code, using the Haversine formula (see, for example, Exercise 4.4.2) and assuming a spherical Earth of radius 6378.1 km).

I spent a lot of time parsing through the data set because I did not read the second sentence in the question carefully. It says that the fields in the data set are "tab-delimited". Now, I understand that means that columns are separated by tabs. At first, I thought columns were separated by commas, so I had a hard time even reading the file with Pandas.

Importing modules

The first step is to import pandas and numPy. Pandas will be used to read the data set. I usually import numPy together with pandas, but I am not very familiar with the module numPy yet. Later on, I also imported some math functions from the math module, like sqrt, sin, asin, and cos. That is because I will be using some form of the Haversine formula to calculate distances given latitude and longitude of airports.

Derived from the Haversine formula, we have the distance between two points given its latitude and longitude to be distance =

where

$φ 1, φ 2$ : latitude of point 1 and latitude of point 2 (in radians),
$λ 1, λ 2$ : longitude of point 1 and longitude of point 2 (in radians). (Wikipedia)

Using iloc to choose certain columns

The next step, was to convert the latitude and longitude degrees to radians, since the Haversine formula works on radian measurements. I knew that to convert degrees to radians, we multiply the degree measurement by pi and divide by 180 degrees. I estimated pi to be 3.14.

The harder part is to know how to isolate the column that has a series of latitudes and then the column that has a series of longitudes to apply the mathematical formula.

I used iloc to choose certain columns from my data set. iloc works by index. The code I used to convert the latitude degrees to radians is as follows :

data['Latitude in Radians'] = ((data.iloc[:, 3:4]*3.14)/180)

This line of code creates a new column named "Latitude in Radians", which is each element in the series of the 4th column of the dataframe multiplied by 3.14 and divided by 180. The third column of the dataframe is the latitude in degrees of the airports. The iloc[:, 3:4] picks out all the rows ( : ) specifically for the column starting at the 4th column and ending at the beginning of the 5th column (3:4). Since iloc uses indexes and we count starting from index 0, the 4th column is denoted by 3.

Similar code was used to convert the longitudes in degrees to radians.

Dictionary to map airport to their respective latitudes and longitudes

Given two three-letter IATA codes that represents two airports, I had to pull out their respective latitudes and longitudes to apply a formula to find the distance between the two airports.

This sounds like a dictionary is required. A dictionary can easily map two elements together. I wanted to create a dictionary that mapped the IATA code to the airport's latitude and another dictionary that mapped the IATA code to the airport's longitude.

I thought that I could create a list of IATA codes, a list of latitudes and a list of longitudes and create two dictionaries that way.

But first, I needed to create those lists.

List of lists to a single list

To create a list of IATA codes, a list of latitudes, and a list of longitudes from three series objects, I had to use the values attribute and the tolist attribute. The code I used to create a list of individual IATA codes is as follows:

dftolistIATA = data.iloc[:, 0:1].values.tolist()

The IATA code is the 1st column and the values attribute takes all the values in a series and the tolist attribute converts those values into a list. The tolist attribute actually created a LIST OF LISTS. So, each individual IATA code was a separate list. Lists are denoted with brackets, so a list of letters would be something like [a, b, c, d, e], but a list of lists would have brackets inside brackets, like so: [[a], [b], [c], [d], [e]].

I really just need one list, so I needed a flattened list. How was I supposed to do that? I used stackoverflow to find the code to create a flattened list. It is as follows (following the code from above):

flattened_list = []

flattened_list = [y for x in dftolistIATA for y in x]

This code is getting the elements of the elements in dftolistIATA (the list of lists), which are the individual IATA codes, to form a single list themselves. I still have to read more on flattened list to really understand how this code works, though.

So, after I created my flattened lists, I was able to create my dictionaries using the following code:

mydictionary= dict(zip(flattened_list, flattened_list2))

mydictionary2= dict(zip(flattened_list, flattened_list3))

Essentially, the first line of code created a dictionary called mydictionary which mapped the flattened_list (list of IATA codes) to flattened_list2 (list of latitudes).

Writing code for a form of the Haversine Formula

The last step was to create a function where I input two IATA codes and it spits out the distance in kilometers between the two airports. This was not that difficult, as I had the form of the Haversine Formula that explicitly told me the distance between two points given their latitudes and longitudes. I just had to write it out in code.

The skeleton for a function in Python is as follows:

def functionname(variable(s)):

function code if necessary

return (something)

Testing out my function

Lastly, I had to test out my function. I printed out distance(JFK, AMS). My function required me to have two variables and it returned the distance. distance(JFK, AMS) returned 5854.025421753566. On Google, the distance is 5850 km. Close enough!

Final Thoughts

I found this problem to be very interesting and worthwhile. I love to travel, so the topic resonated with me. I learned about the Haversine formula and gained a deeper understanding of iloc, parsing through data sets, and dictionaries. I don't still yet understand the code for flattened lists, but that will come with time.

This problem took me a while, probably 5 hours, which could have been cut short if I read the question carefully and if I just understood more Python. But, now that I have more of a taste of what Python and math is capable of doing, I am excited to keep on learning and experimenting.

Github code - Airport_distances.py

Notes from Karen

Sunday, January 19, 2020

Discover a Houseplant App

Sunday, June 16, 2019

Airport Distances