Sunday, March 17, 2019

Box Plot Analysis of NYC Total Capture Rates for Recyclables

As an newcomer to Python and data analysis, I decided to create some box plots for my first project. Troy advised me to work on some side projects to learn more about Python and data science. His project which analyzed the amount of money given to doctors by pharmaceutical companies inspired me to find a data set I was interested in and create a visual that represented the data.

Therefore, I searched for a data set on recycling. The disclaimer for this data set is that the Department of Sanitation of New York has not used capture rates since 2013, so there might be some accuracy issues in the data. Also, when I parsed through the data, there were entries for late 2019 which obviously could not have been recorded as of now. Nonetheless, I would say that the data set is great for generating some visuals, like box plots.

The data set is about the capture rates of metal, plastic, glass, and paper. The capture rate of those recyclables is "the percentage of total paper or metal/glass/plastic in the waste stream that is disposed of by recycling". As you can see in the box plots below, the capture rates vary within the boroughs of New York City. Brooklyn and Queens are separated into two different groups, respectively. (Also, the capture rates are above 100 and I am not sure what the actual units used for the capture rates are. In the data set, the total capture rate is labeled as "Total Recycling - Leaves (Recycling)) / (Max Paper + Max MGP))x100". But, it doesn't matter so much for my analysis.)

These box plots were created with matplotlib in Python. Before I jump into my general analysis of the box plots, I want to write about how I created it. I am 99% sure that there is an easier way to create the box plots, but here is how I did it: 

1. I imported pandas and matplotlib in Python.
2. I wrote some code to group my data into the 6 zones ( Bronx, Brooklyn N, Brooklyn S, Manhattan, Queens E, Queens W).
3. I decided to choose to compare only the Zones and the Capture Rates and not other information that was given to me in the data set.
4. I wrote some code to find the median, minimum, maximum, first and third quartile of the capture rates grouped by the zones.
5. I wrote some code to create the box plots. (This step required a lot of Googling.)

Now, for my little analysis of the box plots I created...

I think it is important to talk about how I read a box plot first. There are actually 6 box plots plotted on the same axes. The box plot itself looks like a syringe to me, or a box with two whiskers extending from two sides. (That is why box plots are sometimes called box-and-whisker plots.) There is also a line that cuts the box into two parts in a box plot. 

A box plot shows you a lot of information, but the key information you can see right away is the maximum (highest number), minimum (lowest number), median (middle number), first quartile (25th percentile of your numbers), and the third quartile (75th percentile of your numbers). Depending on the the position of the box plot (if it is vertical or horizontal--in this case, it is vertical), the maximum number is represented by the location of the tip of the upper whisker. (If you look at the box plot for the Bronx, the maximum would be around 810.) The minimum number is represented by the location of the tip of the lower whisker. (Bronx's minimum is around 380.) The median is represented by the line that cuts the box into two parts. (Bronx's median is around 550.) The first quartile is represented by the edge of the box that the lower whisker is attached to. (Bronx's first quartile is around 470.) The third quartile is represented by the edge of the box that the upper whisker is attached to. (Bronx's third quartile is around 680.)

As you can see, a box plot is divided into four parts and each of the four parts, although they may not visually look like they are equal in length, contain 25% of the data for the particular zone.

So, it seems that Bronx and Brooklyn North have lower capture rates than the other parts of the city. It is interesting to compare Brooklyn North to Brooklyn South and how very different their capture rates are. The greatest range (difference between the maximum and minimum values) appears to belong to the Bronx. Queens West, which has the highest minimum capture rate and one of the highest maximum rates may win the prize of "Best Recycler in the City". 

If you compare the IQR or the interquartile range (the difference between the first and third quartiles), you can see another spread of your data. Bronx and Manhattan has the highest IQR (their boxes are the longest) and perhaps the data is that way because only Bronx and Manhattan are not broken up into sub-counties, while Brooklyn and Queens are broken up into sub-counties. The bigger spread in the capture rates in Bronx and Manhattan may cover the fact that different areas in the large counties recycle differently. The IQR may be lower in the Bronx and Queens box plots because the data set is just smaller for those sub-counties.

It would be interesting to break up Bronx into Northern and Southern Bronx, as well as Manhattan. It would also be interesting to compare the socioeconomic status of the people who live in the areas to see if there is any correlation between that and the capture rates of recyclables. Why is it that Brooklyn North and South have such different capture rates, but Queens West and East seem to have a smaller difference in rates? 

This concludes my first visualization project using Python, pandas, and matplotlib.

Hopefully, there was something new to ponder about here!




No comments:

Post a Comment