Tuesday, May 19, 2020

Diary Project Part 2: Topic Modeling

Why?

I thought it would be interesting to see the major themes that showed up throughout my life, specifically before high school, during high school, during college, and post-college. That was the rationale behind doing topic modeling on my diary entries.

What I Found

By just looking at the most common bigrams (two adjacent words that appear in my writing after I "cleaned" it up with some Natural Language Processing techniques), I gained some insight into my life. It seemed like the central things I wrote about involved school of some sort, volunteering, career, and being Chinese.

Before high school, the 8 most common bigrams were: "table_leader", "fourth_grade", "social_study", "good_night", "younger_sister", "best_wish", "bad_news", and "picture_frame".
It is true most of my diary entries before high school was written during the 3rd and 4th grades. "table_leader" most likely refers to when I was a table leader in my class, which was just a role my teacher assigned to a student to collect work from other students in the same table.

During high school, the 8 most common bigrams were "red_envelope", "kings_plaza", "nursing_home", "horseshoe_crab", "math_fair", "touch_pool", "fuck_fuck", and "ca_wait". Red envelopes are literally red envelopes stuffed with cash given out usually during Chinese New Year from married people to children or unmarried relatives. I used to live in Brooklyn and we used to go to Kings Plaza to shop for clothes. "nursing_home" probably refers to the time when I volunteered at a nursing home in high school. "horseshoe_crab" and "touch_pool" most likely refers to the time I volunteered at the NY Aquarium. "math_fair" refers to the time when I wrote a math research paper and the expletive is probably from being stressed about college applications and decisions.

During college, the 8 most common bigrams were "actuarial_exam", "high_school", "study_abroad", "host_mom", "fall_asleep", "concrete_dream", "career_path", and "family_member". The bigrams "actuarial_exam", "concrete_dream", and "career_path" all relate to the time when I was searching for a career path and thought maybe actuarial sciences was something I wanted to do. "study_abroad" and "host_mom" are related to each other as I studied abroad in Spain and lived with a host parent for four months. High school appears probably because I was reminiscing about high school during college.

After college, the 8 most common bigrams were "data_science", "san_francisco", "data_scientist", "high_school", "imposter_syndrome", "chinese_culture", "walked_around", and "make_difference".  What I am surprised is that teaching related bigrams did not show up in my top 30 bigrams. It may be because I did not write about teaching much during my time as a teacher.

Topic Modeling with LDA

To see what kinds of topics topic modeling would pick up in the four stages of my life, I decided to use LDA, or Latent Dirichlet Allocation. LDA assumes that documents with similar topics will use similar words. ("Documents are probability distribution over latent topics. Topics are probability distribution over words.")

In this case, I created a document for each sentence I have written over the years. I had 1005 documents for before high school, 4030 documents during high school,  2589 documents during college, and 3204 documents post-college.

Each document was then "cleaned" by removing punctuation, lower-casing words, removing stop-words, and lemmatizing words. Then, the documents were tokenized and bigrams were created.

Then, comes LDA. We need to corpus from our bigrams by first assigning each bigram to a number and then counting how many times a certain bigram appears in a document. This is done with using gensim. I used the code below to create my corpus.




Now, to use LDA, we have to specify how many topics we assume there are or want to have in our documents. We can pick an arbitrary number, but there is a way to use coherence scores to pick the number of topics. From my understanding, the higher the coherence score, the more semantic similarity there are between the high scoring bigrams in a certain topic. (Topics are made up of a group of bigrams.) It is an okay evaluation for the model in some cases, so I decided to try to find the coherence scores when I choose between 2 and 15 topics inclusive.

What I found was that for the before high school corpus, the greatest coherence score was achieved at 9 topics. But, it got tricky for the other corpuses. The graphs all had some sort of U shape, showing that the max coherence score was either at very little topics (2 or 3) or a lot of topics (14-15). In that case, for during high school I picked 8 topics which was the first peak after a drop in coherence score from 2 topics, as you can see in the below graph.


















I could have picked 6 as well, but I felt I had a lot more documents during high school than before high school and it didn't make sense that I would have less topics than before high school.

For the college corpus, the coherence graph looked similar to the high school coherence graph and I picked 6 topics. For post-college, I picked only 3 topics as the graph looked like this:


















I also checked the topics that came out to see if I could make sense of them for the number of topics chosen and I was satisfied with them. I suppose it is hard to truly evaluate how many topics one should choose for LDA, as there will always be a subjective component to it.

The Topics

So after I chose how many topics I wanted each corpus to produce, I got some results.
For before high school, these were my results:

[(0,
    '0.115*"eating_pizza" + 0.115*"digged_paper" + 0.104*"lowest_score" + '
    '0.092*"bumped_head" + 0.092*"sicky_problem" + 0.082*"hong_kong" + '
    '0.072*"butter_corn" + 0.007*"table_leader" + 0.007*"bad_news" + '
    '0.007*"new_year"'),
   (1,
    '0.306*"dragon_ball" + 0.121*"going_die" + 0.096*"coney_island" + '
    '0.082*"eleven_clock" + 0.066*"flying_chair" + 0.035*"adventure_fall" + '
    '0.035*"free_whale" + 0.028*"multiplication_bee" + 0.005*"year_old" + '
    '0.005*"jerwel_box"'),
   (2,
    '0.234*"whole_class" + 0.146*"fourth_grade" + 0.096*"sesame_place" + '
    '0.089*"student_month" + 0.089*"canal_street" + 0.056*"wat_nice" + '
    '0.038*"adventure_fall" + 0.005*"sign_return" + 0.005*"progress_report" + '
    '0.005*"social_study"'),
   (3,
    '0.182*"year_old" + 0.152*"report_card" + 0.152*"dear_journal" + '
    '0.074*"best_wish" + 0.064*"big_candle" + 0.051*"chat_chat" + '
    '0.051*"entered_multiplication" + 0.051*"ice_cream" + 0.005*"easy_peesy" + '
    '0.005*"whole_class"'),
   (4,
    '0.214*"report_card" + 0.197*"social_study" + 0.090*"even_though" + '
    '0.090*"progress_report" + 0.071*"hope_feel" + 0.007*"new_year" + '
    '0.007*"whole_class" + 0.007*"easy_peesy" + 0.007*"hersey_bar" + '
    '0.007*"paste_sticker"'),
   (5,
    '0.234*"report_dued" + 0.191*"good_night" + 0.103*"younger_sister" + '
    '0.083*"police_officer" + 0.083*"blood_black" + 0.006*"whole_class" + '
    '0.006*"dragon_ball" + 0.006*"dear_journal" + 0.006*"report_card" + '
    '0.006*"year_old"'),
   (6,
    '0.251*"dear_journal" + 0.203*"table_leader" + 0.097*"picture_frame" + '
    '0.067*"ca_unlock" + 0.067*"jerwel_box" + 0.006*"new_year" + '
    '0.006*"multiplication_bee" + 0.006*"adventure_fall" + 0.006*"report_card" + '
    '0.006*"year_old"'),
   (7,
    '0.312*"new_year" + 0.084*"phone_number" + 0.068*"red_eye" + '
    '0.067*"really_high" + 0.067*"around_five" + 0.067*"stop_bleeding" + '
    '0.067*"sign_return" + 0.036*"free_whale" + 0.005*"fourth_grade" + '
    '0.005*"progress_report"'),
   (8,
    '0.147*"easy_peesy" + 0.100*"smartest_girl" + 0.100*"paste_sticker" + '
    '0.100*"need_sleep" + 0.100*"hersey_bar" + 0.091*"bad_news" + '
    '0.079*"throw_dice" + 0.006*"lowest_score" + 0.006*"wat_nice" + '
    '0.006*"good_night"')]

As you can see, there are 9 groups (labeled 0-8) of bigrams. Each group of bigrams represents one topic. LDA does not name your topics for you, so you have to come up with your own meanings on what the groupings of the bigrams mean. The numbers represent the probability that the given bigram will be used in that certain group or topic. The numbers are sorted in descending order- more weight is given to the first bigram.

During high school's corpus created these topics:
[(0,
  '0.136*"ivy_league" + 0.082*"social_science" + 0.082*"fuck_shit" + '
  '0.043*"joyce_thomas" + 0.040*"wo_able" + 0.037*"sat_prep" + '
  '0.033*"park_ranger" + 0.033*"cough_suppressant" + '
  '0.029*"competitive_college" + 0.028*"garbage_bag"'),
 (1,
  '0.048*"last_forever" + 0.039*"aquarium_docent" + 0.038*"rice_ball" + '
  '0.034*"new_hairstyle" + 0.034*"spur_moment" + 0.032*"real_sorrow" + '
  '0.030*"blood_pressure" + 0.029*"tank_top" + 0.029*"force_sneaker" + '
  '0.029*"pair_air"'),
 (2,
  '0.229*"fuck_fuck" + 0.077*"bad_attribute" + 0.055*"sat_ii" + '
  '0.047*"fish_hook" + 0.047*"professional_job" + 0.038*"grandmother_grave" + '
  '0.033*"regent_week" + 0.031*"ap_spanish" + 0.026*"basketball_board" + '
  '0.025*"ap_class"'),
 (3,
  '0.051*"math_team" + 0.039*"play_basketball" + 0.034*"new_york" + '
  '0.030*"eat_breakfast" + 0.028*"rite_aid" + 0.027*"aunt_uncle" + '
  '0.026*"staten_island" + 0.024*"speak_english" + 0.022*"dim_sum" + '
  '0.021*"book_buddy"'),
 (4,
  '0.080*"birthday_party" + 0.052*"horseshoe_crab" + 0.046*"fur_seal" + '
  '0.044*"touch_pool" + 0.038*"sea_star" + 0.034*"math_fair" + '
  '0.032*"lunar_new" + 0.028*"king_highway" + 0.028*"back_forth" + '
  '0.025*"random_guy"'),
 (5,
  '0.115*"red_envelope" + 0.059*"king_plaza" + 0.049*"everyone_else" + '
  '0.049*"living_room" + 0.045*"ice_cream" + 0.038*"dark_secret" + '
  '0.037*"coney_island" + 0.036*"good_luck" + 0.033*"ca_wait" + '
  '0.030*"linda_mindy"'),
 (6,
  '0.067*"ipod_touch" + 0.058*"nursing_home" + 0.046*"photography_club" + '
  '0.036*"ap_lit" + 0.036*"talked_iff" + 0.035*"eats_sleep" + '
  '0.035*"brother_wife" + 0.035*"dad_trimmed" + 0.035*"turned_head" + '
  '0.035*"home_depot"'),
 (7,
  '0.050*"summer_reading" + 0.042*"hot_pot" + 0.038*"cent_store" + '
  '0.038*"took_picture" + 0.035*"year_eve" + 0.033*"celebrate_birthday" + '
  '0.032*"least_worry" + 0.031*"hope_safe" + 0.029*"talk_trivial" + '
  '0.028*"long_thin"')]

During college created these topics:
[(0,
  '0.086*"fall_asleep" + 0.082*"career_path" + 0.076*"host_mom" + '
  '0.039*"emergency_room" + 0.035*"chinese_american" + 0.035*"read_compass" + '
  '0.033*"fit_lifestyle" + 0.030*"stream_consciousness" + '
  '0.027*"everyone_else" + 0.024*"worth_gift"'),
 (1,
  '0.103*"high_school" + 0.074*"family_member" + 0.059*"study_abroad" + '
  '0.038*"pretty_cool" + 0.035*"certain_way" + 0.035*"paper_cup" + '
  '0.034*"month_ago" + 0.029*"un_poco" + 0.026*"food_court" + '
  '0.021*"por_dios"'),
 (2,
  '0.033*"dim_sum" + 0.029*"vanity_fancy" + 0.028*"spring_fling" + '
  '0.027*"graduate_school" + 0.023*"jodi_picoult" + 0.022*"grad_school" + '
  '0.022*"change_career" + 0.021*"nature_preserve" + 0.020*"apple_hill" + '
  '0.020*"education_system"'),
 (3,
  '0.061*"growing_vegetable" + 0.041*"going_hiking" + '
  '0.041*"peruvian_restaurant" + 0.041*"getting_married" + 0.041*"pio_pio" + '
  '0.041*"truly_understand" + 0.041*"bally_laundromat" + 0.030*"bite_nail" + '
  '0.026*"walked_around" + 0.023*"estas_imaginaciones"'),
 (4,
  '0.108*"actuarial_exam" + 0.039*"concrete_dream" + 0.033*"flawless_muse" + '
  '0.030*"teach_america" + 0.024*"new_shoe" + 0.024*"group_therapy" + '
  '0.021*"psychotic_episode" + 0.020*"mud_bath" + 0.019*"pas_exam" + '
  '0.018*"studying_p"'),
 (5,
  '0.058*"new_york" + 0.050*"estoy_aquĆ­" + 0.041*"el_mundo" + '
  '0.037*"siempre_estoy" + 0.035*"city_madrid" + 0.032*"best_worst" + '
  '0.028*"e_tan" + 0.028*"en_mi" + 0.028*"follow_compass" + '
  '0.025*"existential_crisis"')]

I can tell that the topic 4 for the college corpus makes some sense to me in the fact that it is talking about career goals and a desire to attain certain things that aren't attainable like being a "flawless_muse". Topic 4 has also a lot of words related to the actuarial exams, including exam P. 

The post college corpus created these topics:
[(0,
  '0.033*"mental_health" + 0.030*"back_ny" + 0.027*"need_relax" + '
  '0.026*"high_school" + 0.024*"taking_risk" + 0.023*"comfort_home" + '
  '0.022*"dating_jon" + 0.017*"give_cash" + 0.016*"job_description" + '
  '0.014*"dim_sum"'),
 (1,
  '0.041*"kaggle_competition" + 0.028*"imposter_syndrome" + '
  '0.028*"throughout_day" + 0.025*"weekly_spread" + 0.024*"wisdom_tell" + '
  '0.023*"make_difference" + 0.019*"gratitude_log" + 0.017*"pay_attention" + '
  '0.017*"afraid_letting" + 0.017*"sam_harris"'),
 (2,
  '0.107*"data_science" + 0.037*"data_scientist" + 0.026*"bullet_journal" + '
  '0.019*"believed_ability" + 0.016*"someone_else" + 0.015*"bring_joy" + '
  '0.015*"apply_job" + 0.014*"passed_away" + 0.013*"new_jersey" + '
  '0.013*"letting_go"')]

Topic 2 in post college is a solid topic as well. It is clearly about data science and I was interning in New Jersey as a data science intern as well.

Trends

I would say that I had a pretty school-centered childhood, some-what of an angsty high school phase, some sort of existential or career crisis point in college, and more of an acceptance of how things are period post-college. This may just be tainted by my own perspective of course. I could keep on delving into the analysis of my life, but at this time, I don't want to. This diary project was interesting and it gave me some insights into what I journal about and what I cared about in the distant and close past. 

This project aside, I felt that keeping a journal was a habit I thought I never had. But seeing that I had all these entries that built up over the years, I can say I do have a habit of journal writing and that it is a habit I would like to keep. It is a joy to read them once more time.

No comments:

Post a Comment