question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Data preprocessing for 20newsgroups dataset

See original GitHub issue

Description

Hi, guys. First of all, thanks for sharing your work with us. I’d like to reproduce the experiments on CTM paper using the 20newsgroups dataset. I have some questions:

  • are there any additional preprocessing steps other than those reported on paper (“removing digits, punctuation, stopwords, and infrequent words”), e.g. removal of email headers, email addresses etc?
  • is there any data partition (e.g. train/validation), or the reported metrics are calculated on full training data?
  • where does the count of 18,173 documents come from? I checked data from both sklearn and the url in the paper and they contain 18,846 docs.

Thanks in advance!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
A11en0commented, Dec 14, 2021

Thanks a lot for your quick response! But if there is no split, would the model overfit the datasets?

Yes, it’s a little bit strange. A topic model is different from the normal cluster task. It can be made an inference on the unseen dataset. I read some papers that split the evaluate type with transductive and inductive, the former suggests using the full dataset, and later like a classification task i.e. train a model first on training sets and then inference on unseen testing sets. Refer to https://ojs.aaai.org/index.php/AAAI/article/view/6152.

1reaction
vinidcommented, Oct 19, 2021

Hello @marcospiau!

  1. if you are trying to replicate our results, no, you can just apply those steps. We used the following setup to remove text that was not useful:

data = fetch_20newsgroups(subset='all',
                                       remove=('headers', 'footers', 'quotes'),
                                       categories=categories)

When we remove infrequent words we already remove most of the things like urls/adresses.

You can get the data we used here, might be easier for you just to use this if you need to reproduce:

https://raw.githubusercontent.com/silviatti/preprocessed_datasets/master/ctm/20news_unprep.txt https://raw.githubusercontent.com/silviatti/preprocessed_datasets/master/ctm/20news_prep.txt

  1. reported metrics are computed on the topic lists, so there’s currently no validation involved. (we support early stopping but that functionality was not implemented at the time of the paper).

  2. Some documents have to be removed after pre-processing because they became empty

Let me know if you have more questions 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

20 newsgroup preprocessed | Kaggle
20 newsgroup preprocessed. This dataset is a collection of 18.828 documents divided into 20 newsgroups. Content. target: 20 newsgroups corresponding to a ...
Read more >
filipefilardi/text-classification: 20 newsgroup preprocessing ...
We are using the 20Newsgroup dataset, collected by Ken Lang and available here, containing 20 different classes and 18.828 documents. 20 newsgroups topics....
Read more >
5.6.2. The 20 newsgroups text dataset - Scikit-learn
The sklearn.datasets.fetch_20newsgroups function is a data fetching / caching functions that downloads the data archive from the original 20 newsgroups ...
Read more >
Documents Classification using Machine Learning
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
Read more >
Classification of text documents: 20-Newsgroup Dataset
import os os.environ['MLCOMP_DATASETS_HOME']='Data/mlcomp'. The statistics of the dataset are: 20 newsgroups dataset for document classification Source: ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found