Topic modeling service

Modeling text topics using Latent Dirichlet Allocation.

Models

Text

  • fields:
    • created: text creation date;
    • true_similarity: float, LDA computed similarity between article headline and text;
    • false similarity: float, LDA computed similarity between article text and a reference article with a rare topic.
    • language: string of length 2 with a language code;
    • headline: article headline cleared by checker.CorrChecker;
    • cleared_text: article text cleared by checker.CorrChecker;
    • message_id: unique text identifier, integer in hex form, length 32.

Submodules

lda

Implements the interface for the corpora (collections of texts) and LDA model. The LDA algorithm implements clustering of texts for a predetermined number of clusters-topics. For each cluster, a set of keywords is formed, which is based on the Dirichlet distribution. The computation of the similarity of the texts is carried out on the basis of the constructed vectors of text distribution by keywords with a help of Jensen-Shannon distance.

Fitted models overview:

Model Number of topics Log Likelihood Perplexity
English 10 -30888501.372 3381.247
Russian 10 -6091576.967 3118.759
Ukrainian 10 -6587708.423 2348.799