Monday, June 17, 2019

Analyzing meteorological data with NumPy

Another interesting problem I wrote code for was a short analysis on the meteorological data from Heathrow. This was also found in Christian Hill's book in chapter 6. Chapter 6 is all about numPy. I know what I used pandas in my previous post about a chapter 6 problem, but this time, I did use numPy to complete my analysis.

The problem gives a data set of the meteorological data in Heathrow including the year, month, max temperature, min temperature, rainfall, and hours of sun.
I had to find using numPy:
1) 10 hottest and coldest months in the data set
2) the total rainfall in each year and the wettest year in the data set
3) the least sunny June.

Reading the txt file with NumPy

I first named my txt file and then specified the dataypes and column names that existed in my file. I opened my txt file by specifying that name I gave it, the datatypes, how many rows I would like to skip, and which columns in the data set I would like to use.

So, the code ended up looking like this:
fname = 'heathrowdata.txt'
dtype1 = np.dtype([('year', 'f8'), ('month', 'f8'), ('maxtemp', 'f8'), ('mintemp', 'f8'), ('rainfall', 'f8'), ('sun', 'S8')])
a = np.loadtxt(fname, dtype=dtype1, skiprows=7, usecols=(0,1,2,3,5,6))

'f8' and 'S8' means a float datatype and a string datatype respectively.

Using argsort 

The numPy method, argsort, sorts a given axis of an array in ascending order and returns the sorted indicies of the elements. I used argsort to sort the maximum temperature column. That returned an array of indicies that corresponded to the sorted maximum temperature column. Then I got the corresponding year to the sorted indicies and used slicing to get the last 10 temperatures, which is the highest ten temperatures in the dataset.

The following is the code I used:

out_arr_max_temp = np.argsort(a['maxtemp'])

print(a['year'][out_arr_max_temp][-10:])

#prints out sorted max temperature by year starting from the tenth to last year

print(a['month'][out_arr_max_temp][-10:])

#prints out sorted max temperature by month starting from the tenth to last month.


So, to answer number 1:

The first hottest date is 7.0, 2006.0, with a temperature of 28.2 degrees celsius.
The second hottest date is 7.0, 1983.0, with a temperature of 27.6 degrees celsius.
The third hottest date is 7.0, 2013.0, with a temperature of 27.0 degrees celsius.
The fourth hottest date is 8.0, 1995.0, with a temperature of 27.0 degrees celsius.
The fifth hottest date is 7.0, 1976.0, with a temperature of 26.6 degrees celsius.
The sixth hottest date is 8.0, 2003.0, with a temperature of 26.4 degrees celsius.
The seventh hottest date is 7.0, 1995.0, with a temperature of 26.3 degrees celsius.
The eighth hottest date is 7.0, 1994.0, with a temperature of 26.2 degrees celsius.
The ninth hottest date is 8.0, 1990.0, with a temperature of 26.0 degrees celsius.

The tenth hottest date is 8.0, 1975.0, with a temperature of 25.9 degrees celsius.

The first coldest date is 1.0, 1963.0, with a temperature of -4.6 degrees celsius.
The second coldest date is 2.0, 1956.0, with a temperature of -3.6 degrees celsius.
The third coldest date is 2.0, 1986.0 with a temperature of -2.7 degrees celsius.
The fourth coldest date is 1.0, 1979.0 with a temperature of -2.6degrees celsius.
The fifth coldest date is 2.0, 1963.0 with a temperature of -2.2 degrees celsius.
The sixth coldest date is 1.0, 1985.0 with a temperature of -1.8 degrees celsius.
The seventh coldest date is 12.0, 1981.0 with a temperature of -1.5 degrees celsius.
The eighth coldest date is 12.0, 2010.0 with a temperature of -1.5 degrees celsius.
The ninth coldest date is 2.0, 1991.0 with a temperature of -1.3 degrees celsius.
The tenth coldest date is 12.0, 1962.0 with a temperature of -1.1 degrees celsius.


The formatting of the months and years is in float form because I made the datatype a float instead of a string...

While loop

The next step is to find the total rainfall and then the wettest year in the data set. How can I sum up parts of a column and compare my sums? I used a while loop to loop through each year and append all the rainfalls of that year into a list. This formed a list of arrays. Each element in my list of arrays is an array which had the rainfalls for a particular year, sorted in order from the earliest to the latest year.
My while loop looks like this:

list_of_rainfalls = []
n= 1948
while n < 2017:
    m = n == a['year']
    list_of_rainfalls.append((a['rainfall'][m]))
    n += 1

How can I sum up my individual arrays? I decided to convert my list of arrays into a list of lists by using a list comprehension and the tolist() method. My list comprehension is as follows:

list_of_lists = [individual_arrays.tolist() for individual_arrays in list_of_rainfalls]


After I got my list of lists, I needed to find the individual sums of each list in my list of lists. I used another while loop. This time, I had an empty list to append my individual sums too called individual_sum. I let n start from -1 and while n < 68 (because there are 68 years between the earliest and latest year- 1948 and 2016), I would add increment n by 1 and then sum the elements in the lists in the nth index of my list of lists. Then, I would append my sum to list, individual_sum. Individual_sum, at the end of the loop, will be a list of the sums of each list of rainfalls for each year.

I can then find the index of the maximum sum and multiply it by 12 to get the original index for the start of the wettest year. I multiply by 12 because each sum represents at most 12 numbers in a sense for the 12 records of rainfall in a year. Then, knowing the original index, I can find the year, which turns out to be 2014.

Therefore, 2014 is the wettest year.

List comprehension and Dictionaries

The last thing I had to do was to find the least sunny June. This means, the June with the minimum amount of hours of sun.

Looking at the column for hours of sun, I noticed that there are some missing values. How can I deal with those missing values? I used list comprehension to replace those values with something else.

But before that, I decided to read the hours of sun column as strings. Then, using the following list comprehension, I replaced the --- with 999.


k = [i if i!=b'---' else 999 for i in a['sun']]


Then, I used another list comprehension to convert all my strings to floats:


j = [float(i) for i in k]


Now, I have a list of hours of sun. I can create a dictionary to map the hours of sun to their respective months.


mydictionary = dict(zip(j, a['month']))


Here, the keys are the hours of sun and the values are the months.

Another list comprehension coming up!


june_month_sun = [sunlight for sunlight, month in mydictionary.items() if month == 6.0]


The above list comprehension spits out all the values of sun hours if the month is June (or 6.0).

Given a list of sun hours in June, I can find the minimum hour, find its original index and then using that index find the year.

Finally, we find out that 2016 had the least sunny June.


Final Thoughts

Another great problem to work on because it honed my numPy knowledge. I also learned more about slicing and indexing. There was a lot of list comprehension and I just loved the logic behind the code. This is my first time using numPy to read and extensively analyze a data set. I think I like pandas more, but numPy is actually very useful too.

Github Code - Heathrow

No comments:

Post a Comment