Wednesday, March 20, 2019

Education vs. Earnings

I was curious about what Ivy League students do once they graduate. I can read anecdotes of cases of what happened to students after they graduated from top universities, but I wanted to look at some data. I did find some more general data on education, specifically the National Household Education Survey.  I don't think the data that I am about to show is particularly new, but creating the visual still required some thinking. A disclaimer I have is that I didn't really write the code. I actually googled to find how I could make my ideas come to life. I tweaked it, liked adding the labels for the x and y axes though. Throughout the process, I did come up with some insights as to how I wanted my data to look to other and that guided me to googling the right questions. I think that is important because there is so much information out there that I can spend the whole day parsing through tutorials and never find out what I need to create exactly what I want.

So, here is the finished product created from matplotlib in Python.

It's a scatter plot that shows 3000 data points from a 2016 survey.

The actual data files are a lot bigger, but I just chose two columns to compare again, specifically highest level of education obtained and the person's earnings in the past 12 months.  As you go from left to right on the x-axis, the higher the level of education obtained by the person is. As you go from bottom to top on the y-axis, the higher the earnings of the person was in the past 12 months. 

There were also people who skipped this question on the survey, which is why there is a Skip row. 

The size of the dots show the relative occurrence of the combination of level of education and earnings in the survey. 

Actually, the whole process of choosing to display the data as a scatterplot was not straightforward. My first instinct was to represent the data as a scatterplot, but the first scatterplot I made just showed the same size and same opacity of dots that covered basically the whole graph that it didn't show anything significant. So, I thought, perhaps I can use a bar graph? I can count how many people were in certain categories and then make a bar graph? It turns out, there are too many categories and perhaps, I could have made intervals for the categories, but then I got an insight. The reason why I could not find anything significant in the first scatterplot was because there wasn't a way to distinguish the intensity or how many people are represented by each dot. The magnitude was a dimension I wanted to show, so if I could show by color or by size the number of people who was in a certain category, then the scatterplot would make sense!

It turns out, others had the same question that I did. How can I change the size or color of my dots in a scatterplot to represent the relative amount of data it represented? It turned out to be a simple line of code that I still don't really understand, but does the job. 

So, a little analysis on the scatterplot.

It seems that there are people with little education who makes a fortune (to me at least) and also makes nothing, and people with a lot of education who makes a fortune and also who makes nothing.  

Either that there is just more people with high school diplomas than no high school diplomas, or the plot shows that getting a high school diploma does increase your chances of getting a higher salary, although there is still a large amount of high school graduates who make very little. But, without a high school diploma, it seems there is little to no chance of making over $150k a year. 

There is still a trend of high school graduates making less money than more money (the size of the dots gets smaller from bottom to top), but the trend reverses for those with a college degree- BA. The size of the dots gets bigger from bottom to top, showing that more people with a college degree are more likely to earn more money than not. In fact, there is a lot of people with college degrees and not higher, who earn between 75-100k a year.  

There is a similar trend to those with a Masters Degree--more likely than not you are earning more money- smaller dots on the bottom, bigger dots on the top.  But, the share of those earning in the 75-100k range is a bit smaller. Maybe it's because there are just less people with Master's degrees though. 

If you get more than a Master's Degree, there is little chance that you will be making less than 20k, but it can still happen. There may be too few people who have higher than a Master's Degree to effectively say that getting more than a Master's will boost earning potential. But, if you compare the relative size of the dots in the Doctorate or Professional Degree Column, you will most likely be richer than poorer if you got that higher degree, it seems.

 I remember reading articles about how getting more education does not equal to getting a higher earning potential, and I can perhaps see why now. But, the scatter plot does seem to show that the chances of being poorer is diminishes as you get a higher education. 

This concludes my second project. And a quote from George Washington Carver: "Education is the key to unlock the golden door of freedom."

Is freedom money? 



Sunday, March 17, 2019

Box Plot Analysis of NYC Total Capture Rates for Recyclables

As an newcomer to Python and data analysis, I decided to create some box plots for my first project. Troy advised me to work on some side projects to learn more about Python and data science. His project which analyzed the amount of money given to doctors by pharmaceutical companies inspired me to find a data set I was interested in and create a visual that represented the data.

Therefore, I searched for a data set on recycling. The disclaimer for this data set is that the Department of Sanitation of New York has not used capture rates since 2013, so there might be some accuracy issues in the data. Also, when I parsed through the data, there were entries for late 2019 which obviously could not have been recorded as of now. Nonetheless, I would say that the data set is great for generating some visuals, like box plots.

The data set is about the capture rates of metal, plastic, glass, and paper. The capture rate of those recyclables is "the percentage of total paper or metal/glass/plastic in the waste stream that is disposed of by recycling". As you can see in the box plots below, the capture rates vary within the boroughs of New York City. Brooklyn and Queens are separated into two different groups, respectively. (Also, the capture rates are above 100 and I am not sure what the actual units used for the capture rates are. In the data set, the total capture rate is labeled as "Total Recycling - Leaves (Recycling)) / (Max Paper + Max MGP))x100". But, it doesn't matter so much for my analysis.)

These box plots were created with matplotlib in Python. Before I jump into my general analysis of the box plots, I want to write about how I created it. I am 99% sure that there is an easier way to create the box plots, but here is how I did it: 

1. I imported pandas and matplotlib in Python.
2. I wrote some code to group my data into the 6 zones ( Bronx, Brooklyn N, Brooklyn S, Manhattan, Queens E, Queens W).
3. I decided to choose to compare only the Zones and the Capture Rates and not other information that was given to me in the data set.
4. I wrote some code to find the median, minimum, maximum, first and third quartile of the capture rates grouped by the zones.
5. I wrote some code to create the box plots. (This step required a lot of Googling.)

Now, for my little analysis of the box plots I created...

I think it is important to talk about how I read a box plot first. There are actually 6 box plots plotted on the same axes. The box plot itself looks like a syringe to me, or a box with two whiskers extending from two sides. (That is why box plots are sometimes called box-and-whisker plots.) There is also a line that cuts the box into two parts in a box plot. 

A box plot shows you a lot of information, but the key information you can see right away is the maximum (highest number), minimum (lowest number), median (middle number), first quartile (25th percentile of your numbers), and the third quartile (75th percentile of your numbers). Depending on the the position of the box plot (if it is vertical or horizontal--in this case, it is vertical), the maximum number is represented by the location of the tip of the upper whisker. (If you look at the box plot for the Bronx, the maximum would be around 810.) The minimum number is represented by the location of the tip of the lower whisker. (Bronx's minimum is around 380.) The median is represented by the line that cuts the box into two parts. (Bronx's median is around 550.) The first quartile is represented by the edge of the box that the lower whisker is attached to. (Bronx's first quartile is around 470.) The third quartile is represented by the edge of the box that the upper whisker is attached to. (Bronx's third quartile is around 680.)

As you can see, a box plot is divided into four parts and each of the four parts, although they may not visually look like they are equal in length, contain 25% of the data for the particular zone.

So, it seems that Bronx and Brooklyn North have lower capture rates than the other parts of the city. It is interesting to compare Brooklyn North to Brooklyn South and how very different their capture rates are. The greatest range (difference between the maximum and minimum values) appears to belong to the Bronx. Queens West, which has the highest minimum capture rate and one of the highest maximum rates may win the prize of "Best Recycler in the City". 

If you compare the IQR or the interquartile range (the difference between the first and third quartiles), you can see another spread of your data. Bronx and Manhattan has the highest IQR (their boxes are the longest) and perhaps the data is that way because only Bronx and Manhattan are not broken up into sub-counties, while Brooklyn and Queens are broken up into sub-counties. The bigger spread in the capture rates in Bronx and Manhattan may cover the fact that different areas in the large counties recycle differently. The IQR may be lower in the Bronx and Queens box plots because the data set is just smaller for those sub-counties.

It would be interesting to break up Bronx into Northern and Southern Bronx, as well as Manhattan. It would also be interesting to compare the socioeconomic status of the people who live in the areas to see if there is any correlation between that and the capture rates of recyclables. Why is it that Brooklyn North and South have such different capture rates, but Queens West and East seem to have a smaller difference in rates? 

This concludes my first visualization project using Python, pandas, and matplotlib.

Hopefully, there was something new to ponder about here!