Tuesday, May 19, 2020

Diary Project Part 2: Topic Modeling

Why?

I thought it would be interesting to see the major themes that showed up throughout my life, specifically before high school, during high school, during college, and post-college. That was the rationale behind doing topic modeling on my diary entries.

What I Found

By just looking at the most common bigrams (two adjacent words that appear in my writing after I "cleaned" it up with some Natural Language Processing techniques), I gained some insight into my life. It seemed like the central things I wrote about involved school of some sort, volunteering, career, and being Chinese.

Before high school, the 8 most common bigrams were: "table_leader", "fourth_grade", "social_study", "good_night", "younger_sister", "best_wish", "bad_news", and "picture_frame".
It is true most of my diary entries before high school was written during the 3rd and 4th grades. "table_leader" most likely refers to when I was a table leader in my class, which was just a role my teacher assigned to a student to collect work from other students in the same table.

During high school, the 8 most common bigrams were "red_envelope", "kings_plaza", "nursing_home", "horseshoe_crab", "math_fair", "touch_pool", "fuck_fuck", and "ca_wait". Red envelopes are literally red envelopes stuffed with cash given out usually during Chinese New Year from married people to children or unmarried relatives. I used to live in Brooklyn and we used to go to Kings Plaza to shop for clothes. "nursing_home" probably refers to the time when I volunteered at a nursing home in high school. "horseshoe_crab" and "touch_pool" most likely refers to the time I volunteered at the NY Aquarium. "math_fair" refers to the time when I wrote a math research paper and the expletive is probably from being stressed about college applications and decisions.

During college, the 8 most common bigrams were "actuarial_exam", "high_school", "study_abroad", "host_mom", "fall_asleep", "concrete_dream", "career_path", and "family_member". The bigrams "actuarial_exam", "concrete_dream", and "career_path" all relate to the time when I was searching for a career path and thought maybe actuarial sciences was something I wanted to do. "study_abroad" and "host_mom" are related to each other as I studied abroad in Spain and lived with a host parent for four months. High school appears probably because I was reminiscing about high school during college.

After college, the 8 most common bigrams were "data_science", "san_francisco", "data_scientist", "high_school", "imposter_syndrome", "chinese_culture", "walked_around", and "make_difference".  What I am surprised is that teaching related bigrams did not show up in my top 30 bigrams. It may be because I did not write about teaching much during my time as a teacher.

Topic Modeling with LDA

To see what kinds of topics topic modeling would pick up in the four stages of my life, I decided to use LDA, or Latent Dirichlet Allocation. LDA assumes that documents with similar topics will use similar words. ("Documents are probability distribution over latent topics. Topics are probability distribution over words.")

In this case, I created a document for each sentence I have written over the years. I had 1005 documents for before high school, 4030 documents during high school,  2589 documents during college, and 3204 documents post-college.

Each document was then "cleaned" by removing punctuation, lower-casing words, removing stop-words, and lemmatizing words. Then, the documents were tokenized and bigrams were created.

Then, comes LDA. We need to corpus from our bigrams by first assigning each bigram to a number and then counting how many times a certain bigram appears in a document. This is done with using gensim. I used the code below to create my corpus.




Now, to use LDA, we have to specify how many topics we assume there are or want to have in our documents. We can pick an arbitrary number, but there is a way to use coherence scores to pick the number of topics. From my understanding, the higher the coherence score, the more semantic similarity there are between the high scoring bigrams in a certain topic. (Topics are made up of a group of bigrams.) It is an okay evaluation for the model in some cases, so I decided to try to find the coherence scores when I choose between 2 and 15 topics inclusive.

What I found was that for the before high school corpus, the greatest coherence score was achieved at 9 topics. But, it got tricky for the other corpuses. The graphs all had some sort of U shape, showing that the max coherence score was either at very little topics (2 or 3) or a lot of topics (14-15). In that case, for during high school I picked 8 topics which was the first peak after a drop in coherence score from 2 topics, as you can see in the below graph.


















I could have picked 6 as well, but I felt I had a lot more documents during high school than before high school and it didn't make sense that I would have less topics than before high school.

For the college corpus, the coherence graph looked similar to the high school coherence graph and I picked 6 topics. For post-college, I picked only 3 topics as the graph looked like this:


















I also checked the topics that came out to see if I could make sense of them for the number of topics chosen and I was satisfied with them. I suppose it is hard to truly evaluate how many topics one should choose for LDA, as there will always be a subjective component to it.

The Topics

So after I chose how many topics I wanted each corpus to produce, I got some results.
For before high school, these were my results:

[(0,
    '0.115*"eating_pizza" + 0.115*"digged_paper" + 0.104*"lowest_score" + '
    '0.092*"bumped_head" + 0.092*"sicky_problem" + 0.082*"hong_kong" + '
    '0.072*"butter_corn" + 0.007*"table_leader" + 0.007*"bad_news" + '
    '0.007*"new_year"'),
   (1,
    '0.306*"dragon_ball" + 0.121*"going_die" + 0.096*"coney_island" + '
    '0.082*"eleven_clock" + 0.066*"flying_chair" + 0.035*"adventure_fall" + '
    '0.035*"free_whale" + 0.028*"multiplication_bee" + 0.005*"year_old" + '
    '0.005*"jerwel_box"'),
   (2,
    '0.234*"whole_class" + 0.146*"fourth_grade" + 0.096*"sesame_place" + '
    '0.089*"student_month" + 0.089*"canal_street" + 0.056*"wat_nice" + '
    '0.038*"adventure_fall" + 0.005*"sign_return" + 0.005*"progress_report" + '
    '0.005*"social_study"'),
   (3,
    '0.182*"year_old" + 0.152*"report_card" + 0.152*"dear_journal" + '
    '0.074*"best_wish" + 0.064*"big_candle" + 0.051*"chat_chat" + '
    '0.051*"entered_multiplication" + 0.051*"ice_cream" + 0.005*"easy_peesy" + '
    '0.005*"whole_class"'),
   (4,
    '0.214*"report_card" + 0.197*"social_study" + 0.090*"even_though" + '
    '0.090*"progress_report" + 0.071*"hope_feel" + 0.007*"new_year" + '
    '0.007*"whole_class" + 0.007*"easy_peesy" + 0.007*"hersey_bar" + '
    '0.007*"paste_sticker"'),
   (5,
    '0.234*"report_dued" + 0.191*"good_night" + 0.103*"younger_sister" + '
    '0.083*"police_officer" + 0.083*"blood_black" + 0.006*"whole_class" + '
    '0.006*"dragon_ball" + 0.006*"dear_journal" + 0.006*"report_card" + '
    '0.006*"year_old"'),
   (6,
    '0.251*"dear_journal" + 0.203*"table_leader" + 0.097*"picture_frame" + '
    '0.067*"ca_unlock" + 0.067*"jerwel_box" + 0.006*"new_year" + '
    '0.006*"multiplication_bee" + 0.006*"adventure_fall" + 0.006*"report_card" + '
    '0.006*"year_old"'),
   (7,
    '0.312*"new_year" + 0.084*"phone_number" + 0.068*"red_eye" + '
    '0.067*"really_high" + 0.067*"around_five" + 0.067*"stop_bleeding" + '
    '0.067*"sign_return" + 0.036*"free_whale" + 0.005*"fourth_grade" + '
    '0.005*"progress_report"'),
   (8,
    '0.147*"easy_peesy" + 0.100*"smartest_girl" + 0.100*"paste_sticker" + '
    '0.100*"need_sleep" + 0.100*"hersey_bar" + 0.091*"bad_news" + '
    '0.079*"throw_dice" + 0.006*"lowest_score" + 0.006*"wat_nice" + '
    '0.006*"good_night"')]

As you can see, there are 9 groups (labeled 0-8) of bigrams. Each group of bigrams represents one topic. LDA does not name your topics for you, so you have to come up with your own meanings on what the groupings of the bigrams mean. The numbers represent the probability that the given bigram will be used in that certain group or topic. The numbers are sorted in descending order- more weight is given to the first bigram.

During high school's corpus created these topics:
[(0,
  '0.136*"ivy_league" + 0.082*"social_science" + 0.082*"fuck_shit" + '
  '0.043*"joyce_thomas" + 0.040*"wo_able" + 0.037*"sat_prep" + '
  '0.033*"park_ranger" + 0.033*"cough_suppressant" + '
  '0.029*"competitive_college" + 0.028*"garbage_bag"'),
 (1,
  '0.048*"last_forever" + 0.039*"aquarium_docent" + 0.038*"rice_ball" + '
  '0.034*"new_hairstyle" + 0.034*"spur_moment" + 0.032*"real_sorrow" + '
  '0.030*"blood_pressure" + 0.029*"tank_top" + 0.029*"force_sneaker" + '
  '0.029*"pair_air"'),
 (2,
  '0.229*"fuck_fuck" + 0.077*"bad_attribute" + 0.055*"sat_ii" + '
  '0.047*"fish_hook" + 0.047*"professional_job" + 0.038*"grandmother_grave" + '
  '0.033*"regent_week" + 0.031*"ap_spanish" + 0.026*"basketball_board" + '
  '0.025*"ap_class"'),
 (3,
  '0.051*"math_team" + 0.039*"play_basketball" + 0.034*"new_york" + '
  '0.030*"eat_breakfast" + 0.028*"rite_aid" + 0.027*"aunt_uncle" + '
  '0.026*"staten_island" + 0.024*"speak_english" + 0.022*"dim_sum" + '
  '0.021*"book_buddy"'),
 (4,
  '0.080*"birthday_party" + 0.052*"horseshoe_crab" + 0.046*"fur_seal" + '
  '0.044*"touch_pool" + 0.038*"sea_star" + 0.034*"math_fair" + '
  '0.032*"lunar_new" + 0.028*"king_highway" + 0.028*"back_forth" + '
  '0.025*"random_guy"'),
 (5,
  '0.115*"red_envelope" + 0.059*"king_plaza" + 0.049*"everyone_else" + '
  '0.049*"living_room" + 0.045*"ice_cream" + 0.038*"dark_secret" + '
  '0.037*"coney_island" + 0.036*"good_luck" + 0.033*"ca_wait" + '
  '0.030*"linda_mindy"'),
 (6,
  '0.067*"ipod_touch" + 0.058*"nursing_home" + 0.046*"photography_club" + '
  '0.036*"ap_lit" + 0.036*"talked_iff" + 0.035*"eats_sleep" + '
  '0.035*"brother_wife" + 0.035*"dad_trimmed" + 0.035*"turned_head" + '
  '0.035*"home_depot"'),
 (7,
  '0.050*"summer_reading" + 0.042*"hot_pot" + 0.038*"cent_store" + '
  '0.038*"took_picture" + 0.035*"year_eve" + 0.033*"celebrate_birthday" + '
  '0.032*"least_worry" + 0.031*"hope_safe" + 0.029*"talk_trivial" + '
  '0.028*"long_thin"')]

During college created these topics:
[(0,
  '0.086*"fall_asleep" + 0.082*"career_path" + 0.076*"host_mom" + '
  '0.039*"emergency_room" + 0.035*"chinese_american" + 0.035*"read_compass" + '
  '0.033*"fit_lifestyle" + 0.030*"stream_consciousness" + '
  '0.027*"everyone_else" + 0.024*"worth_gift"'),
 (1,
  '0.103*"high_school" + 0.074*"family_member" + 0.059*"study_abroad" + '
  '0.038*"pretty_cool" + 0.035*"certain_way" + 0.035*"paper_cup" + '
  '0.034*"month_ago" + 0.029*"un_poco" + 0.026*"food_court" + '
  '0.021*"por_dios"'),
 (2,
  '0.033*"dim_sum" + 0.029*"vanity_fancy" + 0.028*"spring_fling" + '
  '0.027*"graduate_school" + 0.023*"jodi_picoult" + 0.022*"grad_school" + '
  '0.022*"change_career" + 0.021*"nature_preserve" + 0.020*"apple_hill" + '
  '0.020*"education_system"'),
 (3,
  '0.061*"growing_vegetable" + 0.041*"going_hiking" + '
  '0.041*"peruvian_restaurant" + 0.041*"getting_married" + 0.041*"pio_pio" + '
  '0.041*"truly_understand" + 0.041*"bally_laundromat" + 0.030*"bite_nail" + '
  '0.026*"walked_around" + 0.023*"estas_imaginaciones"'),
 (4,
  '0.108*"actuarial_exam" + 0.039*"concrete_dream" + 0.033*"flawless_muse" + '
  '0.030*"teach_america" + 0.024*"new_shoe" + 0.024*"group_therapy" + '
  '0.021*"psychotic_episode" + 0.020*"mud_bath" + 0.019*"pas_exam" + '
  '0.018*"studying_p"'),
 (5,
  '0.058*"new_york" + 0.050*"estoy_aquĆ­" + 0.041*"el_mundo" + '
  '0.037*"siempre_estoy" + 0.035*"city_madrid" + 0.032*"best_worst" + '
  '0.028*"e_tan" + 0.028*"en_mi" + 0.028*"follow_compass" + '
  '0.025*"existential_crisis"')]

I can tell that the topic 4 for the college corpus makes some sense to me in the fact that it is talking about career goals and a desire to attain certain things that aren't attainable like being a "flawless_muse". Topic 4 has also a lot of words related to the actuarial exams, including exam P. 

The post college corpus created these topics:
[(0,
  '0.033*"mental_health" + 0.030*"back_ny" + 0.027*"need_relax" + '
  '0.026*"high_school" + 0.024*"taking_risk" + 0.023*"comfort_home" + '
  '0.022*"dating_jon" + 0.017*"give_cash" + 0.016*"job_description" + '
  '0.014*"dim_sum"'),
 (1,
  '0.041*"kaggle_competition" + 0.028*"imposter_syndrome" + '
  '0.028*"throughout_day" + 0.025*"weekly_spread" + 0.024*"wisdom_tell" + '
  '0.023*"make_difference" + 0.019*"gratitude_log" + 0.017*"pay_attention" + '
  '0.017*"afraid_letting" + 0.017*"sam_harris"'),
 (2,
  '0.107*"data_science" + 0.037*"data_scientist" + 0.026*"bullet_journal" + '
  '0.019*"believed_ability" + 0.016*"someone_else" + 0.015*"bring_joy" + '
  '0.015*"apply_job" + 0.014*"passed_away" + 0.013*"new_jersey" + '
  '0.013*"letting_go"')]

Topic 2 in post college is a solid topic as well. It is clearly about data science and I was interning in New Jersey as a data science intern as well.

Trends

I would say that I had a pretty school-centered childhood, some-what of an angsty high school phase, some sort of existential or career crisis point in college, and more of an acceptance of how things are period post-college. This may just be tainted by my own perspective of course. I could keep on delving into the analysis of my life, but at this time, I don't want to. This diary project was interesting and it gave me some insights into what I journal about and what I cared about in the distant and close past. 

This project aside, I felt that keeping a journal was a habit I thought I never had. But seeing that I had all these entries that built up over the years, I can say I do have a habit of journal writing and that it is a habit I would like to keep. It is a joy to read them once more time.

Thursday, May 7, 2020

Diary Project Part 1: Diary Page Generator App

The History

I have kept a diary for 18 years. Although I don't write in my diary every day and some entries are missing, I recognized my diary collection as a treasure trove of my memories and events that I experienced in my life. It was also a great source of data for doing an NLP project.

I thought, what if I can program a computer to write like me? That was the start of my diary generator app.

A Moment

Here is an excerpt from a "page" of my generated diary:

"I am able to listen to my parents talk about things and my own needs. Honestly, I really want to be able to listen to them, but I don't listen very much. I think I really need to learn how to speak to my own needs. I think I know what I really want to know, but I haven't really communicated it to my parents very well. My dad doesn't like to talk to me so much, but he does talk to me occasionally. I think I am getting better at communicating with dad because I am reflecting back what I say to him. Honestly, I think I need to do more of the following: 1. Be more present. Be passionate about something. Be passionate about math. Talk to more people. Try to keep up with habits. 2. Respect my parents. Be authentic to people. 3. Give accurate and detailed descriptions of what you are doing. 4. Be up to speed with what you are doing. Learning is not just about finding what you are doing. It is also about being noticed."

I can say that there is an essence of me in this generated text. I truly want to be able to communicate with my parents better and I vaguely remember writing about something on that topic. And the numbered list of things "I think I need to do" is mostly true. The being "more present" part cannot be more true. But, aside from the meanings of the generated text, the writing style does feel like a diary entry of mine.

The Details

The heart of this text-generator is a GPT-2 model. I used Max Woolf's gpt-2-simple (https://github.com/minimaxir/gpt-2-simple) to fine-tune a GPT-2 model on my diary dataset.
According to Wikipedia, GPT-2 is a text generating model developed by OpenAI and was trained on a corpus, called WebText, which has over 8 million documents of text from Reddit. It is great at generating fake news, and was thought to pose a significant threat by the academic community. (This project was just for fun and I don't intend to threaten anybody with my diary entries.)

I was able to find and type up 393 total diary entries, varying in different lengths and concatenate them in one .txt file. The GPT-2 model can be fine-tuned with a custom corpus that is in a .txt file with each sentence as a new line as input. I chose the "small" 124M parameter model to fine-tune my diary entries on.

After fine-tuning the model for about 45 minutes, I was able to start generating text with it. Although my app only has one setting, which is the default setting of length and temperature, it is possible to tweak it so that it generates more or less text with more or less "creativity" ("letting the model pick suboptimal predictions"). You can even enter a prefix to tell the generator to start from a certain letter, word, or phrase. But, my app only has one button to run the model and generate text.

The fine-tuning was run in a Google Colab notebook, specifically with this notebook - https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce

After I downloaded the fine-tuned model on my computer, I created a flask app to run it. I then used Docker and Google Cloud to deploy my app.

This was my first time using Docker and Google Cloud.

You can access my diary generator app here - https://diary-khmwtmm5lq-uc.a.run.app/

And the code for it here - https://github.com/morningkaren/diary-generator

Stay tuned for Diary Project Part 2, where I delve into the topic modeling portion of my project.