Data preprocessing for 20newsgroups dataset
See original GitHub issueDescription
Hi, guys. First of all, thanks for sharing your work with us. I’d like to reproduce the experiments on CTM paper using the 20newsgroups
dataset. I have some questions:
- are there any additional preprocessing steps other than those reported on paper (“removing digits, punctuation, stopwords, and infrequent words”), e.g. removal of email headers, email addresses etc?
- is there any data partition (e.g. train/validation), or the reported metrics are calculated on full training data?
- where does the count of 18,173 documents come from? I checked data from both sklearn and the url in the paper and they contain 18,846 docs.
Thanks in advance!
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (3 by maintainers)
Top Results From Across the Web
20 newsgroup preprocessed | Kaggle
20 newsgroup preprocessed. This dataset is a collection of 18.828 documents divided into 20 newsgroups. Content. target: 20 newsgroups corresponding to a ...
Read more >filipefilardi/text-classification: 20 newsgroup preprocessing ...
We are using the 20Newsgroup dataset, collected by Ken Lang and available here, containing 20 different classes and 18.828 documents. 20 newsgroups topics....
Read more >5.6.2. The 20 newsgroups text dataset - Scikit-learn
The sklearn.datasets.fetch_20newsgroups function is a data fetching / caching functions that downloads the data archive from the original 20 newsgroups ...
Read more >Documents Classification using Machine Learning
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
Read more >Classification of text documents: 20-Newsgroup Dataset
import os os.environ['MLCOMP_DATASETS_HOME']='Data/mlcomp'. The statistics of the dataset are: 20 newsgroups dataset for document classification Source: ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Yes, it’s a little bit strange. A topic model is different from the normal cluster task. It can be made an inference on the unseen dataset. I read some papers that split the evaluate type with transductive and inductive, the former suggests using the full dataset, and later like a classification task i.e. train a model first on training sets and then inference on unseen testing sets. Refer to https://ojs.aaai.org/index.php/AAAI/article/view/6152.
Hello @marcospiau!
When we remove infrequent words we already remove most of the things like urls/adresses.
You can get the data we used here, might be easier for you just to use this if you need to reproduce:
https://raw.githubusercontent.com/silviatti/preprocessed_datasets/master/ctm/20news_unprep.txt https://raw.githubusercontent.com/silviatti/preprocessed_datasets/master/ctm/20news_prep.txt
reported metrics are computed on the topic lists, so there’s currently no validation involved. (we support early stopping but that functionality was not implemented at the time of the paper).
Some documents have to be removed after pre-processing because they became empty
Let me know if you have more questions 😃