Introduction
Some of you know about GoFundMe, which is a free online fundraising platform.
With crowdfunding, people can harness social media to fund for anything from
medical expenses to honeymoon trips.
Some of you know about GoFundMe, which is a free online fundraising platform.
With crowdfunding, people can harness social media to fund for anything from
medical expenses to honeymoon trips.
If you were a potential user of GoFundMe, it would be nice to know about how
much of your goal you would successfully be able to raise. If you know
beforehand your success rate, you will have an idea of whether or not GoFundMe
is the right place for you to raise money. I predict that what you are fundraising
for, your goal amount, and what your story is will be able to give some idea
of what percent of your goal you will be able to fundraise.
much of your goal you would successfully be able to raise. If you know
beforehand your success rate, you will have an idea of whether or not GoFundMe
is the right place for you to raise money. I predict that what you are fundraising
for, your goal amount, and what your story is will be able to give some idea
of what percent of your goal you will be able to fundraise.
To see how predictive the category of your campaign, the goal amount and the
story is in the amount you will be able to fundraise, I decided to create a linear
regression model with those features.
story is in the amount you will be able to fundraise, I decided to create a linear
regression model with those features.
Web Scraping
I did web scraping using beautiful soup and selenium to develop my database.
I then did some preprocessing, which included changing the format of my data to
extract the numbers that will be used in future analysis. Then, I did data analysis
which included locating and removing outliers and plotting residual plots. Looking
at residual plots pointed me to do some feature engineering where I used
polynomial features and one hot encoding on categories of campaigns. Lastly, I
fit models to my data by doing train-test-split and cross validation to decide which
model gave the most predictive power.
I then did some preprocessing, which included changing the format of my data to
extract the numbers that will be used in future analysis. Then, I did data analysis
which included locating and removing outliers and plotting residual plots. Looking
at residual plots pointed me to do some feature engineering where I used
polynomial features and one hot encoding on categories of campaigns. Lastly, I
fit models to my data by doing train-test-split and cross validation to decide which
model gave the most predictive power.
I scraped the location, story, date, category, money raised, goal, and social media
shares. Only the money raised, goal, story, and category ended up being used as
my raw features because the nature of the problem means that I start at time 0 and
have 0 social media shares. Location was removed from my features because most
of the campaigns were in the United States and I didn’t feel it was as important.
With the story, I engineered three features- word count, polarity score, and
subjectivity score.
shares. Only the money raised, goal, story, and category ended up being used as
my raw features because the nature of the problem means that I start at time 0 and
have 0 social media shares. Location was removed from my features because most
of the campaigns were in the United States and I didn’t feel it was as important.
With the story, I engineered three features- word count, polarity score, and
subjectivity score.
Data Analysis
After preprocessing the data, I was able to create some charts. I see that the
distribution of my target, the percent of goal fundraised, is left skewed. It shows that
many people do reach their fundraising goal.
distribution of my target, the percent of goal fundraised, is left skewed. It shows that
many people do reach their fundraising goal.
The next step is to locate and remove outliers. Looking at the amount of goal, we
see that most people’s goal is to fundraise $500,000 or less. I will remove the data
points that have a goal greater than $500,000. Looking at the word count, we see
that most people have stories that are 1500 words or less, so I will remove data points
that have a word count greater than 1500.
see that most people’s goal is to fundraise $500,000 or less. I will remove the data
points that have a goal greater than $500,000. Looking at the word count, we see
that most people have stories that are 1500 words or less, so I will remove data points
that have a word count greater than 1500.
Looking at the residual plot where I compared the features of goal amount, word
count, polarity score and subjectivity score, I see that I am either over-predicting or
under-predicting my percent of goal raised all the time. Maybe this means that using
polynomial features might help.
count, polarity score and subjectivity score, I see that I am either over-predicting or
under-predicting my percent of goal raised all the time. Maybe this means that using
polynomial features might help.
Modeling
Linear regression on 941 observations and using polynomial transformations of goal
amount, word count, polarity score, and subjectivity score, I received an R^2 score of .20.
Linear regression on the one-hot-encoding of categories gave me an R^2 score of -.03.
Combining the one-hot-encoding of categories and the polynomial transformations of
the four features above, I received an R^2 score of .42, which is not entirely bad
considering that I am trying to predict human behavior.
Using cross validation, I was able to see that lasso regularization gave me the
highest R^2 score.
Interpretation
Looking at the coefficients, I was able to see what the percent boost or reduction of
your goal by category was. The values assume that you are holding everything else
constant.
your goal by category was. The values assume that you are holding everything else
constant.
Campaigns that get on average the highest reduction in their percent of goal
include newlywed, business, and competition categories. Campaigns that get on
average the highest boost in their percent of goal include medical, emergency, and
memorial categories.
include newlywed, business, and competition categories. Campaigns that get on
average the highest boost in their percent of goal include medical, emergency, and
memorial categories.
It turns out that in addition to animals, community, creative, and event having coefficients
of 0. Polarity score, subjectivity score and their polarity score times subjectivity score also
has a coefficient of 0 under lasso regularization. This makes sense, from our previous
F-statistic analysis, which shows that polarity might not have anything to do with percent
of goal raised.
of 0. Polarity score, subjectivity score and their polarity score times subjectivity score also
has a coefficient of 0 under lasso regularization. This makes sense, from our previous
F-statistic analysis, which shows that polarity might not have anything to do with percent
of goal raised.
Further Tuning of Model
One glaring error in my model is that it assumes that campaigns are completed by the
time I scraped the data. That is not true, so adding the time component into my model
will greatly fine tune my model. The question of how much one might be able to
fundraise is also a bit flawed. A better question would be how much one would expect to
be able to fundraise in a given time.
I can group my data into different bins according to a "days active" feature and do
further analysis with that. I suspect a similar trend in the categories that get a percent
boost or percent reduction in their goal will remain, but the newer model may have more
accurate predictions.
No comments:
Post a Comment