Notes from Karen: June 2019

Thursday, June 20, 2019

The effect of sea-level rise on Great Britain

Original Map of Great Britain

A rise in sea level of 10 meters

A rise in sea level of 50 meters

A rise in sea level of 100 meters

The graphics shown show a picture of Great Britain and the effect of sea level rising 10 meters, 50 meters, and 100 meters. The black mass represents the ocean and the lighter parts represent the land mass. You can see the white parts diminishing as sea level rises.

I got the data from Christian Hill's online book in Chapter 7 on matplotlib.

The graphics was created from an .npy file which had arrays of the altitudes of 10 km x 10 km hectad squares of Great Britain.

I was told to use imshow to plot the map given in the .npy file and plot the three different maps where sea level rises 10m, 50m and 100m. I also had to figure out the percentage of land area remaining, relative to its present value.

.imshow

I used .imshow to plot the map after I opened the npy file with numpy's load function.

The code looks like this:

plt.imshow(a, interpolation='nearest', cmap='gray')

plt.show()

where plt represents matplotlib.pyplot. Interpolation= 'nearest' was supposed to make the map more clear, but I don't think it did anything. cmap='gray' might be for the color scale.

Manipulating the arrays

To find the altitudes of the land mass when there is a sea level rise of 10m, I subtracted every element in my array by 10. This was done by just calling the given array a and subtracting 10. Since some values became negative, I had to replace all negative values by 0. That was done by the following code:

z[z<0] = 0

where z is an array. The above code is basically saying, for all negative values in the array z, let them equal 0.

Deducing percentage of land remaining

The next step was the deduce the percentage of land mass remaining after a rise in sea level of 10m, 50m, or 100m.

Since the numpy arrays gives the average altitudes per 100km^2 of area of Great Britain, we know that we can find the total area of the map by counting the total number of elements in the arrays. There are 8118 hectad squares whose area is 811800 km^2. That counts the area of the oceans too. To just find the land mass area, we need to subtract the hectads that have an average altitude of 0.

To count how many hectads had altitude 0, I used a for loop like so:

counter = 0

for element in a:

for element in element:

if element == 0:

counter += 1

print(counter)

where a is an array.

In the original map, there were 5261 hectads with altitude 0. I subtracted the number of hectads with altitude 0 from the original number of hectads (8118) to get the number of hectads of land mass. I got that the land mass of the original map is 285700 km^2.

I used a similar for loop to find the number of hectads of land mass with altitude greater than 0 for the latter 3 maps. Then, I compared the number with the original map's land mass.

When sea level rises to 10m, the percentage of land remaining compared to the original land is 83.86%.

When sea level rises to 50m, the percentage of land remaining compared to the original land is 64.02%.

When sea level rises to 100m, the percentage of land remaining compared to the original land is 45.15%.

Does it look like that in the maps?

Final Thoughts

I am very happy to know how to load some type of image through python. Although I did not create the code for the map, I have a bit of knowledge of how pictures can be created through arrays! That is very cool. I always wondered how pictures of maps were created through code, and now I know, kind of.

Climate change is also something I am very interested about. I hope to do my final project at Metis related to climate change.

On a side note, I will be attending the Spring cohort's career day today to see them present their final projects. I am excited about that.

Github code - Sea Level

BMI variation with GDP

How does per capita GDP compare to the BMI of men across the world? That is a question in Christian Hill's book in Chapter 7. I was asked to create a scatter plot (bonus if it was colored) to compare BMI and GDP. The size of the bubbles are relative to the population of the countries and the countries are color-coded by continent.

Reading tsv files with pandas

The 4 data sets were in tsv format, so I used pandas read_csv to read them. Each of the three data sets had a list of countries and either their population, average BMI for men, GDP per capita, or the continents the countries were in. The data sets did not have the same amount of data. Some were missing countries, some were missing values for BMI , GDP, or population.

Making Dictionaries

Since I did not know how to directly create a scatterplot from data sets that did not have an equal amount of rows, I thought I should create a dictionary that had keys as countries that have existing data on average BMI , GDP , and population to values of BMI, GDP, and population.

How can I do that?

I first made 4 dictionaries that had keys as the countries and values as either BMI, GDP, population, or continent. I first had four list of lists I had to flatten to form into four different dictionaries. I wrote a function to create a dictionary from a dataframe object.

This did not get rid of my missing values, but they will be dealt with later.

For loop

I then decided to write a for loop where for each country in my GDP dictionary, if that country is in the set of countries in my BMI dictionary (ignoring the missing countries here), then I would append a tuple that consisted of the average BMI in that country and the GDP of that country to an empty list. I later added in more elements into my tuple where the population of that country and the continent of that country was returned. (The population element was divided by 8000000 because the size of the bubbles were too large, so I had to scale the population down.)

Getting the above paragraph into code form required a lot of logical reasoning and some knowledge of how dictionaries worked!

The result is a list of tuples.

List of tuples to scatterplot

The next step was to create a scatterplot from my list of tuples. I looked it up and there is a quick way to do that. My list of tuples was named list_of_bmi_gdp and I applied

x,y,z,q= zip(*list_of_bmi_gdp)

zip combined with * will unzip a list, and the unzipped lists are assigned to x, y, z, and q.

Then, I plotted my scatter plot with the following code:

plt.scatter(x,y,s=z, c=q)

x represents my BMI (shown on my x-axis), y represents my GDP (shown on my y-axis), s = z means that my third dimension, size, will be represented by my list of population z, and c =q means that my fourth dimension, color, will be represented by my list of colors q.

Note that I had to first convert my continents into colors before I could assign c to q. That was done with a for loop like so:

for country, continent in color_dictionary.items():
if continent == 'Europe':
color_dictionary[country] = 'red'

matplotlib.patches

The last step is to add a legend to my scatterplot to show which colors represented what continent. I used patches to add a custom legend. I imported matplotlib.patches and mpatches.

To make a red patch, I used this code:

red_patch = mpatches.Patch(color='red', label='Europe')

To add that to my legend, I used this code:

plt.legend(handles=[red_patch, orange_patch, yellow_patch, green_patch, blue_patch, purple_patch])

About the Scatterplot

From the scatterplot, we see that Asia has several countries that have a huge population, denoted by the large orange bubbles relative to the other bubbles. However, Africa and Asia has low GDP and low BMI. There seems to be a trend where countries with higher GDP has higher BMI on average.
But, you can still see some Asian countries having relatively high GDP, but low BMI (the orange dots that are in the middle of the graph). North America seems to have the highest BMI, but also relatively high GDP. Europe is clustered at a relatively high BMI and high GDP.

Final Thoughts

I get really excited when I am asked to create multi-dimensional scatter plots because I know the results will be very nice looking. Just look at the scatter plot! It tells so much and there are so many layers too it, but not too much to be overwhelming. I had to explain to my friend what was going on, but once she got the hang of reading it, there are correlations that come up from the plot that are worthwhile to note.

I learned about patches and adding color to scatterplots in this exercise. I also learned more about the zip function and how to pull things from dictionaries.

Overall, another worthwhile problem!

Here is the code I used on Github- BMI.py

Tuesday, June 18, 2019

Big Mac Index

I attempted another problem in Christian Hill's book, this time in Chapter 7 which focuses on matplotlib.

The problem was about using the Big Mac index from The Economist to measure the purchasing power parity between two currencies. Basically, it can tell us how over or under valued a currency is relative to the dollar given the price of Big Macs in each country.

The percentage of over or under valuation of each currency is calculated by the formula percentage =

( (local price converted to USD- US price)/ (US price) ) * 100

I used numpy to read the txt files and convert each big mac price to USD given the exchange rate.

The hard part was really just converting the given columns of months and years into datetime format for matplotlib to plot.

Converting two columns to datetime format

Once I got a list of months and list of years in integer format, I converted them to tuples of (month, year) by using a while loop. I don't think this is the best way to do that since you had to know how many rows of data there are. There is probably an easier way to do that.

Once I had my list of tuples of (month, year), I used datetime to convert them to datetime. I was missing a "day" argument, so I made the day 1 for every tuple like so:

dt = []

for element in list_of_month_years:

dt_obj = datetime(*element, 1)

dt.append(dt_obj)

print(dt)

Now, I had my list of datetimes. I just had to plot it against the valuation percentages.

The Plots

Pretty neat graphs!

Final Thoughts

I learned more about the datetime module and creating line graphs with matplotlib. It would be interesting to know more about the economic history of the countries in the graphs to explain the overall trend or the the peaks or troughs of the graphs.

Github code - Big Mac Index

Linear Regression on four data sets

Another question in Chapter 6 of Christian Hill's book was about linear regression over four data sets.

Using NumPy, it was easy to find the mean and variance of both x and y for each data set. They turned out to be the same across the board, the mean of x is equal to 9, the mean of y is equal to 7.5, the variance of x is equal to 10 and the variance of y is equal to 3.75.

The correlation coefficient is also the same across the board at .816.

The linear regression line was also the same across the board at approximately y= .5x + 3

However, once the data sets was plotted, we can see that they are in fact very different data sets even though all of the data sets had the same mean, variance, correlation coefficient, and line of best fit.

The x1, y1 data set looks like this graphically:

The x2, y2 data set looks like this:

The x3, y3 data set looks like this:

The x4, y4 data set looks like this:

As you can see, the scatter plots are very different, so how come they data set has so many similar statistical attributes?

The mean

The mean of a data set is the average or mathematically, the sum of all the elements divided by the number of elements. A dataset that is very spread out can have the same mean as a dataset that is clustered together. For example, the dataset with elements {0, 5, 10} has mean 5 and the data set with elements {5, 5, 5} also has mean 5, but they are essentially, very different data sets since the range of the first one is 10, while the range of the latter is 0.

The variance

The variance of a data set tells us something about the spread of a data set. It is mathematically, the sum of the squares of the difference of the mean and each element divided by the number of elements. The variance tells us the average of the squared distances of each element from the mean.

The correlation coefficient

The correlation coefficient tells us something about how closely related two variables are to each other. Correlation coefficients are in the range of -1 and 1. -1 implies an indirect relationship while 1 implies direct relationship. Mathematically, the correlation coefficient is calculated by dividing the sums of the products of the distance of each x element and the mean of x and the distance of each y element and the mean of y by the square root of the product of the squares of the sums of the difference of each x element and the mean of x and the squares of the sums of the difference of each y element and the mean of y.

It may be easier to show you the formula.

Intuitively, if the difference between each element of x and the mean of x is constant and the difference between each element of y and the mean of y is also constant we get 1 as the correlation coefficient. For example, if the difference between the x data and the mean of x is {1, 1, 1, 1, 1} and the difference between the y data and the mean of of y is {2, 2, 2, 2, 2}, we get (1*2)+(1*2)+(1*2)+(1*2)+(1*2)= 10 as the numerator and square root of [(1+1+1+1+1)(4+4+4+4+4)]= square root of [5(20)] = square root of [100] = 10 as the denominator. This gives the correlation coefficient as 10/10 = 1.

This is saying, if there is both very little difference in the spread between the x data and very little difference in the spread between the y data, our correlation coefficient goes up.

Intuitively, if there is both a lot of difference in the spread of the x and y data, our correlation coefficient will also go up.

But, if there is a lot of difference in the spread of the x data and very little difference in the spread of the y data, our correlation coefficient will go down. For example, if the difference between the x data and the mean of x is {0, 0, 5, 5, 10, 10} and the difference between the y data and the mean of y is {1, 1, 1, 1, 1, 1}, we get 0+0+5+5+10+10 = 30 as the numerator and square roof of [ (0+0+25+25+100+100)(1+1+1+1+1+1)] = square root of [(250)(6)] = square root of [1500] = 38.7. Our coefficient becomes 30/38.7 which is less than 1.

So, the correlation coefficient in a sense measures how similar the spreads of our x and y data sets are. (Graphically, it represents the amount of clustering around the line of best fit.)

Regression Line

The regression line or line of best fit is a straight line that best models the data set. It takes the form of any line, y = mx+ b, where m, the slope, is equal to the correlation coefficient times the quotient of the standard deviation of y and the standard deviation of x. The y-intercept, b, is calculated by finding the difference between the mean of y and the product of the slope, m, and the mean of x. This makes sense since our a = mx+ b equation tells us that b = y - mx. We are using the mean of y and the mean of x to estimate a point on the regression line to find the y intercept.

Why is the slope equal to the correlation coefficient times the quotient of the standard deviation of y and the standard deviation of x? The standard deviation of y over the standard deviation of x is in a way the change of the spread of y over the spread of x. It is always positive or 0 since the standard deviation is the square root of the variance. We multiply that by the correlation coefficient, which gives us the sign of the slope, or the way x and y is related to each other.

Proper use of the Regression line

As we see from the graphs, the same regression line can be fitted to four different scatter plots.

But, perhaps other than the first graph, a regression line is not suited to represent the scatter plots. Regression lines should only be used if the scatter plot act in a somewhat linear manner. In the second plot, the graph acts in a parabolic manner, the third plot has an outlier, and the fourth plot obviously does not act in a linear manner.

We can always fit a regression line to any data set, so it must be up to our good judgement to use regression lines when they fit a data set.

It is best to plot the graph of a data set first to visually see if a regression line would work to model the data.

Final Thoughts

I learned how to plot a regression line on my scatter plot with matplotlib.pyplot. I also learned how to find the correlation coefficient using numpy. This problem also made me try to intuitively explain what the correlation coefficient and the line of regression is. :D

Github code - Four Data Sets

Monday, June 17, 2019

Analyzing meteorological data with NumPy

Another interesting problem I wrote code for was a short analysis on the meteorological data from Heathrow. This was also found in Christian Hill's book in chapter 6. Chapter 6 is all about numPy. I know what I used pandas in my previous post about a chapter 6 problem, but this time, I did use numPy to complete my analysis.

The problem gives a data set of the meteorological data in Heathrow including the year, month, max temperature, min temperature, rainfall, and hours of sun.
I had to find using numPy:
1) 10 hottest and coldest months in the data set
2) the total rainfall in each year and the wettest year in the data set
3) the least sunny June.

Reading the txt file with NumPy

I first named my txt file and then specified the dataypes and column names that existed in my file. I opened my txt file by specifying that name I gave it, the datatypes, how many rows I would like to skip, and which columns in the data set I would like to use.

So, the code ended up looking like this:
fname = 'heathrowdata.txt'
dtype1 = np.dtype([('year', 'f8'), ('month', 'f8'), ('maxtemp', 'f8'), ('mintemp', 'f8'), ('rainfall', 'f8'), ('sun', 'S8')])
a = np.loadtxt(fname, dtype=dtype1, skiprows=7, usecols=(0,1,2,3,5,6))

'f8' and 'S8' means a float datatype and a string datatype respectively.

Using argsort

The numPy method, argsort, sorts a given axis of an array in ascending order and returns the sorted indicies of the elements. I used argsort to sort the maximum temperature column. That returned an array of indicies that corresponded to the sorted maximum temperature column. Then I got the corresponding year to the sorted indicies and used slicing to get the last 10 temperatures, which is the highest ten temperatures in the dataset.

The following is the code I used:

out_arr_max_temp = np.argsort(a['maxtemp'])

print(a['year'][out_arr_max_temp][-10:])

#prints out sorted max temperature by year starting from the tenth to last year

print(a['month'][out_arr_max_temp][-10:])

#prints out sorted max temperature by month starting from the tenth to last month.

So, to answer number 1:

The first hottest date is 7.0, 2006.0, with a temperature of 28.2 degrees celsius.

The second hottest date is 7.0, 1983.0, with a temperature of 27.6 degrees celsius.

The third hottest date is 7.0, 2013.0, with a temperature of 27.0 degrees celsius.

The fourth hottest date is 8.0, 1995.0, with a temperature of 27.0 degrees celsius.

The fifth hottest date is 7.0, 1976.0, with a temperature of 26.6 degrees celsius.

The sixth hottest date is 8.0, 2003.0, with a temperature of 26.4 degrees celsius.

The seventh hottest date is 7.0, 1995.0, with a temperature of 26.3 degrees celsius.

The eighth hottest date is 7.0, 1994.0, with a temperature of 26.2 degrees celsius.

The ninth hottest date is 8.0, 1990.0, with a temperature of 26.0 degrees celsius.

The tenth hottest date is 8.0, 1975.0, with a temperature of 25.9 degrees celsius.

The first coldest date is 1.0, 1963.0, with a temperature of -4.6 degrees celsius.

The second coldest date is 2.0, 1956.0, with a temperature of -3.6 degrees celsius.

The third coldest date is 2.0, 1986.0 with a temperature of -2.7 degrees celsius.

The fourth coldest date is 1.0, 1979.0 with a temperature of -2.6degrees celsius.

The fifth coldest date is 2.0, 1963.0 with a temperature of -2.2 degrees celsius.

The sixth coldest date is 1.0, 1985.0 with a temperature of -1.8 degrees celsius.

The seventh coldest date is 12.0, 1981.0 with a temperature of -1.5 degrees celsius.

The eighth coldest date is 12.0, 2010.0 with a temperature of -1.5 degrees celsius.

The ninth coldest date is 2.0, 1991.0 with a temperature of -1.3 degrees celsius.

The tenth coldest date is 12.0, 1962.0 with a temperature of -1.1 degrees celsius.

The formatting of the months and years is in float form because I made the datatype a float instead of a string...

While loop

The next step is to find the total rainfall and then the wettest year in the data set. How can I sum up parts of a column and compare my sums? I used a while loop to loop through each year and append all the rainfalls of that year into a list. This formed a list of arrays. Each element in my list of arrays is an array which had the rainfalls for a particular year, sorted in order from the earliest to the latest year.
My while loop looks like this:

list_of_rainfalls = []
n= 1948
while n < 2017:
m = n == a['year']
list_of_rainfalls.append((a['rainfall'][m]))
n += 1

How can I sum up my individual arrays? I decided to convert my list of arrays into a list of lists by using a list comprehension and the tolist() method. My list comprehension is as follows:

list_of_lists = [individual_arrays.tolist() for individual_arrays in list_of_rainfalls]

After I got my list of lists, I needed to find the individual sums of each list in my list of lists. I used another while loop. This time, I had an empty list to append my individual sums too called individual_sum. I let n start from -1 and while n < 68 (because there are 68 years between the earliest and latest year- 1948 and 2016), I would add increment n by 1 and then sum the elements in the lists in the nth index of my list of lists. Then, I would append my sum to list, individual_sum. Individual_sum, at the end of the loop, will be a list of the sums of each list of rainfalls for each year.

I can then find the index of the maximum sum and multiply it by 12 to get the original index for the start of the wettest year. I multiply by 12 because each sum represents at most 12 numbers in a sense for the 12 records of rainfall in a year. Then, knowing the original index, I can find the year, which turns out to be 2014.

Therefore, 2014 is the wettest year.

List comprehension and Dictionaries

The last thing I had to do was to find the least sunny June. This means, the June with the minimum amount of hours of sun.

Looking at the column for hours of sun, I noticed that there are some missing values. How can I deal with those missing values? I used list comprehension to replace those values with something else.

But before that, I decided to read the hours of sun column as strings. Then, using the following list comprehension, I replaced the --- with 999.

k = [i if i!=b'---' else 999 for i in a['sun']]

Then, I used another list comprehension to convert all my strings to floats:

j = [float(i) for i in k]

Now, I have a list of hours of sun. I can create a dictionary to map the hours of sun to their respective months.

mydictionary = dict(zip(j, a['month']))

Here, the keys are the hours of sun and the values are the months.

Another list comprehension coming up!

june_month_sun = [sunlight for sunlight, month in mydictionary.items() if month == 6.0]

The above list comprehension spits out all the values of sun hours if the month is June (or 6.0).

Given a list of sun hours in June, I can find the minimum hour, find its original index and then using that index find the year.

Finally, we find out that 2016 had the least sunny June.

Final Thoughts

Another great problem to work on because it honed my numPy knowledge. I also learned more about slicing and indexing. There was a lot of list comprehension and I just loved the logic behind the code. This is my first time using numPy to read and extensively analyze a data set. I think I like pandas more, but numPy is actually very useful too.

Github Code - Heathrow

Sunday, June 16, 2019

Airport Distances

From Python Programmer's (Gile's) youtube video titled "Can you LEARN DATA SCIENCE for FREE? YES! I'll show you HOW!", I discovered an online book about Python by Christian Hill.
In Chapter 6, there was a coding problem on the topic of airport distances. The problem gives a data set of airports, their location, their latitudes and their longitudes. I had to write code to find the distance between two airports based on their latitudes and longitudes.

The Importance of Reading the Question

Here is the problem, word for word:

The file busiest_airports.txt provides details of the 30 busiest airports in the world in 2014. The tab-delimited fields are: three-letter IATA code, airport name, airport location, latitude and longitude (both in degrees).

Write a program to determine the distance between two airports identified by their three-letter IATA code, using the Haversine formula (see, for example, Exercise 4.4.2) and assuming a spherical Earth of radius 6378.1 km).

I spent a lot of time parsing through the data set because I did not read the second sentence in the question carefully. It says that the fields in the data set are "tab-delimited". Now, I understand that means that columns are separated by tabs. At first, I thought columns were separated by commas, so I had a hard time even reading the file with Pandas.

Importing modules

The first step is to import pandas and numPy. Pandas will be used to read the data set. I usually import numPy together with pandas, but I am not very familiar with the module numPy yet. Later on, I also imported some math functions from the math module, like sqrt, sin, asin, and cos. That is because I will be using some form of the Haversine formula to calculate distances given latitude and longitude of airports.

Derived from the Haversine formula, we have the distance between two points given its latitude and longitude to be distance =

where

$φ 1, φ 2$ : latitude of point 1 and latitude of point 2 (in radians),
$λ 1, λ 2$ : longitude of point 1 and longitude of point 2 (in radians). (Wikipedia)

Using iloc to choose certain columns

The next step, was to convert the latitude and longitude degrees to radians, since the Haversine formula works on radian measurements. I knew that to convert degrees to radians, we multiply the degree measurement by pi and divide by 180 degrees. I estimated pi to be 3.14.

The harder part is to know how to isolate the column that has a series of latitudes and then the column that has a series of longitudes to apply the mathematical formula.

I used iloc to choose certain columns from my data set. iloc works by index. The code I used to convert the latitude degrees to radians is as follows :

data['Latitude in Radians'] = ((data.iloc[:, 3:4]*3.14)/180)

This line of code creates a new column named "Latitude in Radians", which is each element in the series of the 4th column of the dataframe multiplied by 3.14 and divided by 180. The third column of the dataframe is the latitude in degrees of the airports. The iloc[:, 3:4] picks out all the rows ( : ) specifically for the column starting at the 4th column and ending at the beginning of the 5th column (3:4). Since iloc uses indexes and we count starting from index 0, the 4th column is denoted by 3.

Similar code was used to convert the longitudes in degrees to radians.

Dictionary to map airport to their respective latitudes and longitudes

Given two three-letter IATA codes that represents two airports, I had to pull out their respective latitudes and longitudes to apply a formula to find the distance between the two airports.

This sounds like a dictionary is required. A dictionary can easily map two elements together. I wanted to create a dictionary that mapped the IATA code to the airport's latitude and another dictionary that mapped the IATA code to the airport's longitude.

I thought that I could create a list of IATA codes, a list of latitudes and a list of longitudes and create two dictionaries that way.

But first, I needed to create those lists.

List of lists to a single list

To create a list of IATA codes, a list of latitudes, and a list of longitudes from three series objects, I had to use the values attribute and the tolist attribute. The code I used to create a list of individual IATA codes is as follows:

dftolistIATA = data.iloc[:, 0:1].values.tolist()

The IATA code is the 1st column and the values attribute takes all the values in a series and the tolist attribute converts those values into a list. The tolist attribute actually created a LIST OF LISTS. So, each individual IATA code was a separate list. Lists are denoted with brackets, so a list of letters would be something like [a, b, c, d, e], but a list of lists would have brackets inside brackets, like so: [[a], [b], [c], [d], [e]].

I really just need one list, so I needed a flattened list. How was I supposed to do that? I used stackoverflow to find the code to create a flattened list. It is as follows (following the code from above):

flattened_list = []

flattened_list = [y for x in dftolistIATA for y in x]

This code is getting the elements of the elements in dftolistIATA (the list of lists), which are the individual IATA codes, to form a single list themselves. I still have to read more on flattened list to really understand how this code works, though.

So, after I created my flattened lists, I was able to create my dictionaries using the following code:

mydictionary= dict(zip(flattened_list, flattened_list2))

mydictionary2= dict(zip(flattened_list, flattened_list3))

Essentially, the first line of code created a dictionary called mydictionary which mapped the flattened_list (list of IATA codes) to flattened_list2 (list of latitudes).

Writing code for a form of the Haversine Formula

The last step was to create a function where I input two IATA codes and it spits out the distance in kilometers between the two airports. This was not that difficult, as I had the form of the Haversine Formula that explicitly told me the distance between two points given their latitudes and longitudes. I just had to write it out in code.

The skeleton for a function in Python is as follows:

def functionname(variable(s)):

function code if necessary

return (something)

Testing out my function

Lastly, I had to test out my function. I printed out distance(JFK, AMS). My function required me to have two variables and it returned the distance. distance(JFK, AMS) returned 5854.025421753566. On Google, the distance is 5850 km. Close enough!

Final Thoughts

I found this problem to be very interesting and worthwhile. I love to travel, so the topic resonated with me. I learned about the Haversine formula and gained a deeper understanding of iloc, parsing through data sets, and dictionaries. I don't still yet understand the code for flattened lists, but that will come with time.

This problem took me a while, probably 5 hours, which could have been cut short if I read the question carefully and if I just understood more Python. But, now that I have more of a taste of what Python and math is capable of doing, I am excited to keep on learning and experimenting.

Github code - Airport_distances.py