question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improvements to multilingual stop module

See original GitHub issue

This is a collection post for a number of issues that I ran into while trying out the new language-general stop module module on some Classical Chinese data. I’ll probably have a go at many of them myself over the next couple of days, but would be interested to hear the opinion of @diyclassics in particular if my proposed solutions are ok!

  • StringStopList does not support some of the new CorpusStopList’s more advanced parameters, especially basis, even though the docs suggest that it might: https://github.com/cltk/cltk/blob/f84d35c20be5a087d9320ecda31ff56554358522/cltk/stop/stop.py#L73 Is there a reason why we would want to keep the StringStopList code and not just replace it with a simple call to CorpusStopList passing a list with the single string? Can we scrap the distinction between these two classes altogether by making a SimpleStopList class whose build_stoplist() is sensitive to whether it is passed a single string or a list/collection?
  • improve support for non-alphabetic writing systems: currently, tokenization of both the StringStopList and CorpusStopList is done based on spaces, which means that for Chinese texts that are not word-tokenized one has to manually put a space after every character. scikit’s CountVectorizer and TfidfVectorizer both take an analyzer argument which can simply be set to char (instead of the default word), so for the CorpusStopList at least the best solution would probably be to also take an analyzer argument that gets passed on to the vectorizers.
  • make remove_punctuation less language specific, for example it can’t currently handle the Chinese fullwidth punctuation markers such as 。、,!?This might be possible to do elegantly by replacing https://github.com/cltk/cltk/blob/f84d35c20be5a087d9320ecda31ff56554358522/cltk/stop/stop.py#L254 with a regex that can match/replace on Unicode codepoint properties (not with the default re library, but with another library like regex, look for \p in the documentation to see what this would look like). If it’s not possible then the documentation should be updated to state that it’s the user’s responsibility to clean up punctuation in advance.
  • make remove_numbers less language specific. Same approach as for punctuation.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
kevinstadlercommented, Mar 18, 2018

Thanks, that makes more sense now!

  • Regarding the StringStopList, I would just scrap this class altogether, since its functionality is completely subsumed by the CorpusStopList, since the CorpusStopList calculations can all be applied to a single string as well, simply by treating it as a document collection of 1 document, or am I wrong? So my suggestion was to rename the CorpusStopList to SimpleStopList (in the same way that Abstract* classes often come with Simple* skeleton implementations in OO paradigms) and have its build_stoplist() function auto-box single strings into a collection of one document under the hood. I understand that there might be concerns regarding overloading the method signature with a string/collection here, and I’m not sure what the convention for that is through the rest of the code base, just at the moment there is a lot of duplicate code between the existing classes.
  • Ah yes the class hierarchy makes perfect sense for language-specific processing of course, in that case I’ll just subclass the CorpusStopList and override the punctuation methods. I still think it would make sense to expose the analyzer argument at the top-level though, as using the word or character as a basis is not actually language specific. For unsegmented Chinese one will want to look at individual characters, if the text is however segmented into words (with extra spaces added around word boundaries) then using analyzer="word" would make sense for logographic scripts as well.
0reactions
diyclassicscommented, May 23, 2018

“Can we scrap the distinction between these two classes altogether by making a SimpleStopList class whose build_stoplist() is sensitive to whether it is passed a single string or a list/collection?” Testing this out this morning—and I’m more and more inclined to agree, esp. since the sklearn vectorizers work with a document list of length 1.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Choosing and installing multilingual modules - Drupal
Drupal 8's Multilingual improvements. When you install Drupal 8, the first thing you need to choose is the base language for the installation....
Read more >
Multilingual Drupal websites: core features, add-on modules ...
We review the features and modules that Drupal has for multilingual websites. We will also show how to set up multilingual functionality on ......
Read more >
Top 15 Drupal 9 Multilingual Modules [Most Installed]
1. Simple XML Sitemap. The module generates multilingual sitemaps for entities, views, and custom links. Contributed entity types like commerce ...
Read more >
Is Drupal 9 The Secret Weapon For Your Multilingual Website?
This module lets you choose from 94 languages, and allows you to assign your preferred language for everything from nodes and users to...
Read more >
Drupal 8 modules for multilingual features | Blog Drudesk
If you are ready to be multilingual, but are not ready to present a particular language translation to your users, you can hide...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found