Reading tsv files with pandas
The 4 data sets were in tsv format, so I used pandas read_csv to read them. Each of the three data sets had a list of countries and either their population, average BMI for men, GDP per capita, or the continents the countries were in. The data sets did not have the same amount of data. Some were missing countries, some were missing values for BMI , GDP, or population.
Making Dictionaries
Since I did not know how to directly create a scatterplot from data sets that did not have an equal amount of rows, I thought I should create a dictionary that had keys as countries that have existing data on average BMI , GDP , and population to values of BMI, GDP, and population.
How can I do that?
I first made 4 dictionaries that had keys as the countries and values as either BMI, GDP, population, or continent. I first had four list of lists I had to flatten to form into four different dictionaries. I wrote a function to create a dictionary from a dataframe object.
This did not get rid of my missing values, but they will be dealt with later.
For loop
I then decided to write a for loop where for each country in my GDP dictionary, if that country is in the set of countries in my BMI dictionary (ignoring the missing countries here), then I would append a tuple that consisted of the average BMI in that country and the GDP of that country to an empty list. I later added in more elements into my tuple where the population of that country and the continent of that country was returned. (The population element was divided by 8000000 because the size of the bubbles were too large, so I had to scale the population down.)
Getting the above paragraph into code form required a lot of logical reasoning and some knowledge of how dictionaries worked!
The result is a list of tuples.
List of tuples to scatterplot
The next step was to create a scatterplot from my list of tuples. I looked it up and there is a quick way to do that. My list of tuples was named list_of_bmi_gdp and I applied
x,y,z,q= zip(*list_of_bmi_gdp)
zip combined with * will unzip a list, and the unzipped lists are assigned to x, y, z, and q.
Then, I plotted my scatter plot with the following code:
plt.scatter(x,y,s=z, c=q)
x represents my BMI (shown on my x-axis), y represents my GDP (shown on my y-axis), s = z means that my third dimension, size, will be represented by my list of population z, and c =q means that my fourth dimension, color, will be represented by my list of colors q.
Note that I had to first convert my continents into colors before I could assign c to q. That was done with a for loop like so:
for country, continent in color_dictionary.items():
if continent == 'Europe':
color_dictionary[country] = 'red'
The last step is to add a legend to my scatterplot to show which colors represented what continent. I used patches to add a custom legend. I imported matplotlib.patches and mpatches.
To make a red patch, I used this code:
red_patch = mpatches.Patch(color='red', label='Europe')
To add that to my legend, I used this code:
plt.legend(handles=[red_patch, orange_patch, yellow_patch, green_patch, blue_patch, purple_patch])
About the Scatterplot
From the scatterplot, we see that Asia has several countries that have a huge population, denoted by the large orange bubbles relative to the other bubbles. However, Africa and Asia has low GDP and low BMI. There seems to be a trend where countries with higher GDP has higher BMI on average.
But, you can still see some Asian countries having relatively high GDP, but low BMI (the orange dots that are in the middle of the graph). North America seems to have the highest BMI, but also relatively high GDP. Europe is clustered at a relatively high BMI and high GDP.
Final Thoughts
I get really excited when I am asked to create multi-dimensional scatter plots because I know the results will be very nice looking. Just look at the scatter plot! It tells so much and there are so many layers too it, but not too much to be overwhelming. I had to explain to my friend what was going on, but once she got the hang of reading it, there are correlations that come up from the plot that are worthwhile to note.
I learned about patches and adding color to scatterplots in this exercise. I also learned more about the zip function and how to pull things from dictionaries.
Overall, another worthwhile problem!
Here is the code I used on Github- BMI.py
No comments:
Post a Comment