Improvements to multilingual stop module
See original GitHub issueThis is a collection post for a number of issues that I ran into while trying out the new language-general stop module module on some Classical Chinese data. I’ll probably have a go at many of them myself over the next couple of days, but would be interested to hear the opinion of @diyclassics in particular if my proposed solutions are ok!
-
StringStopList
does not support some of the newCorpusStopList
’s more advanced parameters, especiallybasis
, even though the docs suggest that it might: https://github.com/cltk/cltk/blob/f84d35c20be5a087d9320ecda31ff56554358522/cltk/stop/stop.py#L73 Is there a reason why we would want to keep theStringStopList
code and not just replace it with a simple call toCorpusStopList
passing a list with the single string? Can we scrap the distinction between these two classes altogether by making aSimpleStopList
class whosebuild_stoplist()
is sensitive to whether it is passed a single string or a list/collection? - improve support for non-alphabetic writing systems: currently, tokenization of both the
StringStopList
andCorpusStopList
is done based on spaces, which means that for Chinese texts that are not word-tokenized one has to manually put a space after every character. scikit’sCountVectorizer
andTfidfVectorizer
both take ananalyzer
argument which can simply be set tochar
(instead of the defaultword
), so for theCorpusStopList
at least the best solution would probably be to also take ananalyzer
argument that gets passed on to the vectorizers. - make
remove_punctuation
less language specific, for example it can’t currently handle the Chinese fullwidth punctuation markers such as 。、,!?This might be possible to do elegantly by replacing https://github.com/cltk/cltk/blob/f84d35c20be5a087d9320ecda31ff56554358522/cltk/stop/stop.py#L254 with a regex that can match/replace on Unicode codepoint properties (not with the defaultre
library, but with another library likeregex
, look for\p
in the documentation to see what this would look like). If it’s not possible then the documentation should be updated to state that it’s the user’s responsibility to clean up punctuation in advance. - make
remove_numbers
less language specific. Same approach as for punctuation.
Issue Analytics
- State:
- Created 6 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
Choosing and installing multilingual modules - Drupal
Drupal 8's Multilingual improvements. When you install Drupal 8, the first thing you need to choose is the base language for the installation....
Read more >Multilingual Drupal websites: core features, add-on modules ...
We review the features and modules that Drupal has for multilingual websites. We will also show how to set up multilingual functionality on ......
Read more >Top 15 Drupal 9 Multilingual Modules [Most Installed]
1. Simple XML Sitemap. The module generates multilingual sitemaps for entities, views, and custom links. Contributed entity types like commerce ...
Read more >Is Drupal 9 The Secret Weapon For Your Multilingual Website?
This module lets you choose from 94 languages, and allows you to assign your preferred language for everything from nodes and users to...
Read more >Drupal 8 modules for multilingual features | Blog Drudesk
If you are ready to be multilingual, but are not ready to present a particular language translation to your users, you can hide...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks, that makes more sense now!
StringStopList
, I would just scrap this class altogether, since its functionality is completely subsumed by theCorpusStopList
, since theCorpusStopList
calculations can all be applied to a single string as well, simply by treating it as a document collection of 1 document, or am I wrong? So my suggestion was to rename theCorpusStopList
toSimpleStopList
(in the same way thatAbstract*
classes often come withSimple*
skeleton implementations in OO paradigms) and have itsbuild_stoplist()
function auto-box single strings into a collection of one document under the hood. I understand that there might be concerns regarding overloading the method signature with a string/collection here, and I’m not sure what the convention for that is through the rest of the code base, just at the moment there is a lot of duplicate code between the existing classes.CorpusStopList
and override the punctuation methods. I still think it would make sense to expose theanalyzer
argument at the top-level though, as using the word or character as a basis is not actually language specific. For unsegmented Chinese one will want to look at individual characters, if the text is however segmented into words (with extra spaces added around word boundaries) then usinganalyzer="word"
would make sense for logographic scripts as well.“Can we scrap the distinction between these two classes altogether by making a SimpleStopList class whose build_stoplist() is sensitive to whether it is passed a single string or a list/collection?” Testing this out this morning—and I’m more and more inclined to agree, esp. since the sklearn vectorizers work with a document list of length 1.