Notes from Karen: scatterplot

Thursday, June 20, 2019

BMI variation with GDP

How does per capita GDP compare to the BMI of men across the world? That is a question in Christian Hill's book in Chapter 7. I was asked to create a scatter plot (bonus if it was colored) to compare BMI and GDP. The size of the bubbles are relative to the population of the countries and the countries are color-coded by continent.

Reading tsv files with pandas

The 4 data sets were in tsv format, so I used pandas read_csv to read them. Each of the three data sets had a list of countries and either their population, average BMI for men, GDP per capita, or the continents the countries were in. The data sets did not have the same amount of data. Some were missing countries, some were missing values for BMI , GDP, or population.

Making Dictionaries

Since I did not know how to directly create a scatterplot from data sets that did not have an equal amount of rows, I thought I should create a dictionary that had keys as countries that have existing data on average BMI , GDP , and population to values of BMI, GDP, and population.

How can I do that?

I first made 4 dictionaries that had keys as the countries and values as either BMI, GDP, population, or continent. I first had four list of lists I had to flatten to form into four different dictionaries. I wrote a function to create a dictionary from a dataframe object.

This did not get rid of my missing values, but they will be dealt with later.

For loop

I then decided to write a for loop where for each country in my GDP dictionary, if that country is in the set of countries in my BMI dictionary (ignoring the missing countries here), then I would append a tuple that consisted of the average BMI in that country and the GDP of that country to an empty list. I later added in more elements into my tuple where the population of that country and the continent of that country was returned. (The population element was divided by 8000000 because the size of the bubbles were too large, so I had to scale the population down.)

Getting the above paragraph into code form required a lot of logical reasoning and some knowledge of how dictionaries worked!

The result is a list of tuples.

List of tuples to scatterplot

The next step was to create a scatterplot from my list of tuples. I looked it up and there is a quick way to do that. My list of tuples was named list_of_bmi_gdp and I applied

x,y,z,q= zip(*list_of_bmi_gdp)

zip combined with * will unzip a list, and the unzipped lists are assigned to x, y, z, and q.

Then, I plotted my scatter plot with the following code:

plt.scatter(x,y,s=z, c=q)

x represents my BMI (shown on my x-axis), y represents my GDP (shown on my y-axis), s = z means that my third dimension, size, will be represented by my list of population z, and c =q means that my fourth dimension, color, will be represented by my list of colors q.

Note that I had to first convert my continents into colors before I could assign c to q. That was done with a for loop like so:

for country, continent in color_dictionary.items():
if continent == 'Europe':
color_dictionary[country] = 'red'

matplotlib.patches

The last step is to add a legend to my scatterplot to show which colors represented what continent. I used patches to add a custom legend. I imported matplotlib.patches and mpatches.

To make a red patch, I used this code:

red_patch = mpatches.Patch(color='red', label='Europe')

To add that to my legend, I used this code:

plt.legend(handles=[red_patch, orange_patch, yellow_patch, green_patch, blue_patch, purple_patch])

About the Scatterplot

From the scatterplot, we see that Asia has several countries that have a huge population, denoted by the large orange bubbles relative to the other bubbles. However, Africa and Asia has low GDP and low BMI. There seems to be a trend where countries with higher GDP has higher BMI on average.
But, you can still see some Asian countries having relatively high GDP, but low BMI (the orange dots that are in the middle of the graph). North America seems to have the highest BMI, but also relatively high GDP. Europe is clustered at a relatively high BMI and high GDP.

Final Thoughts

I get really excited when I am asked to create multi-dimensional scatter plots because I know the results will be very nice looking. Just look at the scatter plot! It tells so much and there are so many layers too it, but not too much to be overwhelming. I had to explain to my friend what was going on, but once she got the hang of reading it, there are correlations that come up from the plot that are worthwhile to note.

I learned about patches and adding color to scatterplots in this exercise. I also learned more about the zip function and how to pull things from dictionaries.

Overall, another worthwhile problem!

Here is the code I used on Github- BMI.py

Wednesday, March 20, 2019

Education vs. Earnings

I was curious about what Ivy League students do once they graduate. I can read anecdotes of cases of what happened to students after they graduated from top universities, but I wanted to look at some data. I did find some more general data on education, specifically the National Household Education Survey. I don't think the data that I am about to show is particularly new, but creating the visual still required some thinking. A disclaimer I have is that I didn't really write the code. I actually googled to find how I could make my ideas come to life. I tweaked it, liked adding the labels for the x and y axes though. Throughout the process, I did come up with some insights as to how I wanted my data to look to other and that guided me to googling the right questions. I think that is important because there is so much information out there that I can spend the whole day parsing through tutorials and never find out what I need to create exactly what I want.

So, here is the finished product created from matplotlib in Python.

It's a scatter plot that shows 3000 data points from a 2016 survey.

The actual data files are a lot bigger, but I just chose two columns to compare again, specifically highest level of education obtained and the person's earnings in the past 12 months. As you go from left to right on the x-axis, the higher the level of education obtained by the person is. As you go from bottom to top on the y-axis, the higher the earnings of the person was in the past 12 months.

There were also people who skipped this question on the survey, which is why there is a Skip row.

The size of the dots show the relative occurrence of the combination of level of education and earnings in the survey.

Actually, the whole process of choosing to display the data as a scatterplot was not straightforward. My first instinct was to represent the data as a scatterplot, but the first scatterplot I made just showed the same size and same opacity of dots that covered basically the whole graph that it didn't show anything significant. So, I thought, perhaps I can use a bar graph? I can count how many people were in certain categories and then make a bar graph? It turns out, there are too many categories and perhaps, I could have made intervals for the categories, but then I got an insight. The reason why I could not find anything significant in the first scatterplot was because there wasn't a way to distinguish the intensity or how many people are represented by each dot. The magnitude was a dimension I wanted to show, so if I could show by color or by size the number of people who was in a certain category, then the scatterplot would make sense!

It turns out, others had the same question that I did. How can I change the size or color of my dots in a scatterplot to represent the relative amount of data it represented? It turned out to be a simple line of code that I still don't really understand, but does the job.

So, a little analysis on the scatterplot.

It seems that there are people with little education who makes a fortune (to me at least) and also makes nothing, and people with a lot of education who makes a fortune and also who makes nothing.

Either that there is just more people with high school diplomas than no high school diplomas, or the plot shows that getting a high school diploma does increase your chances of getting a higher salary, although there is still a large amount of high school graduates who make very little. But, without a high school diploma, it seems there is little to no chance of making over $150k a year.

There is still a trend of high school graduates making less money than more money (the size of the dots gets smaller from bottom to top), but the trend reverses for those with a college degree- BA. The size of the dots gets bigger from bottom to top, showing that more people with a college degree are more likely to earn more money than not. In fact, there is a lot of people with college degrees and not higher, who earn between 75-100k a year.

There is a similar trend to those with a Masters Degree--more likely than not you are earning more money- smaller dots on the bottom, bigger dots on the top. But, the share of those earning in the 75-100k range is a bit smaller. Maybe it's because there are just less people with Master's degrees though.

If you get more than a Master's Degree, there is little chance that you will be making less than 20k, but it can still happen. There may be too few people who have higher than a Master's Degree to effectively say that getting more than a Master's will boost earning potential. But, if you compare the relative size of the dots in the Doctorate or Professional Degree Column, you will most likely be richer than poorer if you got that higher degree, it seems.

I remember reading articles about how getting more education does not equal to getting a higher earning potential, and I can perhaps see why now. But, the scatter plot does seem to show that the chances of being poorer is diminishes as you get a higher education.

This concludes my second project. And a quote from George Washington Carver: "Education is the key to unlock the golden door of freedom."

Is freedom money?