LDA Topic Modeling on Singapore Parliamentary Debate Records

This interactive topic visualization is created mainly using two wonderful python packages, gensim and pyLDAvis. I started this mini-project to explore how much "bandwidth" did the Parliament spend on each issue. I wasn't utterly surprised. Considering that the bulk of the discussion were on General/Budget (Topic 1) and Nation Building (Topic 2). Hotbed issues on Employment (Topic 3), Social Support/Services (Topic 4) and Healthcare (Topic 5) also featured prominently. These major topics close proximity at the top of the Distance Map just goes to show how major national issues are interrelated.

Other issues that took up substantial bandwidth include Law and Legislation (Topic 7) and Crimes (Topic 13). Another interesting observation (if you are from Singapore) is that in the topic on Town Council, you can find terms like ahpetc, fund, politics, and audit (which you normally won't expect) in the list of most relevant terms.

Play around below and let me know if you have interesting findings!


In [1]:
import logging
from gensim import corpora, models, similarities
import os 
from pprint import pprint
import json
import numpy as np
import warnings
import pyLDAvis
warnings.filterwarnings('ignore')
In [2]:
dictionary = corpora.Dictionary.load('dictionary.dict')
corpus = corpora.MmCorpus('corpus.mm')
lda = models.ldamodel.LdaModel.load('lda.model')
print dictionary
print corpus
print lda
Dictionary(34940 unique tokens: [u'10,900', u'tajudeen', u'circuitri', u'fawk', u'1,800']...)
MmCorpus(3882 documents, 34940 features, 997986 non-zero entries)
LdaModel(num_terms=34940, num_topics=50, decay=0.5, chunksize=2000)

Only 12th and 13th Parlamentary Debates official reports are used. The reports are "downloaded efficiently" from http://www.parliament.gov.sg/publications-singapore-official-reports

Typical cleaning was done:

  • Remove reports of Adjournment
  • Tokenize
  • Remove stop words
  • Convert to lowercase
  • Stemming (using SnowBall stemmer)

Topic Modelling carried out using gensim LDA model

In [3]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(lda, corpus, dictionary)
Out[3]:

I am looking for ideas and data to play around. Please let me know if you have any. Do also let me know if you have any comments. Jovian: hojunhao(at)gmail(dot)com