ROMANIAN TOPIC MODELING – AN EVALUATION OF PROBABILISTIC VERSUS TRANSFORMER-BASED TOPIC MODELING FOR DOMAIN CATEGORIZATION
In: Revue roumaine des sciences techniques. Série électrotechnique et énergétique, Band 68, Heft 3, S. 295-300
A primary challenge for digital library systems when digitizing millions of volumes is to automatically analyze and group the huge document collection by categories while identifying patterns and extracting the main themes. A common method to be leveraged on unlabeled texts is topic modeling. Given the wide range of datasets and evaluation criteria used by researchers, comparing the performance and outputs of existing unsupervised algorithms is a complex task. This paper introduces a domain-based topic modeling evaluation applied to Romanian documents. Several variants of Latent Dirichlet Allocation (LDA) combined with dimensionality reduction techniques were compared to Transformer-based models for topic modeling. Experiments were conducted on two datasets of varying text lengths: abstracts of novels and full-text documents. Evaluations were performed against coherence and silhouette scores, while the validation considered classification and clustering tasks. Results highlighted meaningful topics extracted from both datasets.