Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Data preprocessing for 20newsgroups dataset

See original GitHub issue

Description

Hi, guys. First of all, thanks for sharing your work with us. I’d like to reproduce the experiments on CTM paper using the 20newsgroups dataset. I have some questions:

are there any additional preprocessing steps other than those reported on paper (“removing digits, punctuation, stopwords, and infrequent words”), e.g. removal of email headers, email addresses etc?
is there any data partition (e.g. train/validation), or the reported metrics are calculated on full training data?
where does the count of 18,173 documents come from? I checked data from both sklearn and the url in the paper and they contain 18,846 docs.

Thanks in advance!

Issue Analytics

State:
Created 2 years ago
Comments:10 (3 by maintainers)

Top GitHub Comments

2reactions

A11en0commented, Dec 14, 2021

Thanks a lot for your quick response! But if there is no split, would the model overfit the datasets?

Yes, it’s a little bit strange. A topic model is different from the normal cluster task. It can be made an inference on the unseen dataset. I read some papers that split the evaluate type with transductive and inductive, the former suggests using the full dataset, and later like a classification task i.e. train a model first on training sets and then inference on unseen testing sets. Refer to https://ojs.aaai.org/index.php/AAAI/article/view/6152.

1reaction

vinidcommented, Oct 19, 2021

Hello @marcospiau!

if you are trying to replicate our results, no, you can just apply those steps. We used the following setup to remove text that was not useful:


data = fetch_20newsgroups(subset='all',
                                       remove=('headers', 'footers', 'quotes'),
                                       categories=categories)

When we remove infrequent words we already remove most of the things like urls/adresses.

You can get the data we used here, might be easier for you just to use this if you need to reproduce:

https://raw.githubusercontent.com/silviatti/preprocessed_datasets/master/ctm/20news_unprep.txt https://raw.githubusercontent.com/silviatti/preprocessed_datasets/master/ctm/20news_prep.txt

reported metrics are computed on the topic lists, so there’s currently no validation involved. (we support early stopping but that functionality was not implemented at the time of the paper).
Some documents have to be removed after pre-processing because they became empty

Let me know if you have more questions 😃

Top Results From Across the Web

20 newsgroup preprocessed | Kaggle

20 newsgroup preprocessed. This dataset is a collection of 18.828 documents divided into 20 newsgroups. Content. target: 20 newsgroups corresponding to a ...

filipefilardi/text-classification: 20 newsgroup preprocessing ...

We are using the 20Newsgroup dataset, collected by Ken Lang and available here, containing 20 different classes and 18.828 documents. 20 newsgroups topics....

5.6.2. The 20 newsgroups text dataset - Scikit-learn

The sklearn.datasets.fetch_20newsgroups function is a data fetching / caching functions that downloads the data archive from the original 20 newsgroups ...

Documents Classification using Machine Learning

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

Classification of text documents: 20-Newsgroup Dataset

import os os.environ['MLCOMP_DATASETS_HOME']='Data/mlcomp'. The statistics of the dataset are: 20 newsgroups dataset for document classification Source: ...