Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improvements to multilingual stop module

See original GitHub issue

This is a collection post for a number of issues that I ran into while trying out the new language-general stop module module on some Classical Chinese data. I’ll probably have a go at many of them myself over the next couple of days, but would be interested to hear the opinion of @diyclassics in particular if my proposed solutions are ok!

StringStopList does not support some of the new CorpusStopList’s more advanced parameters, especially basis, even though the docs suggest that it might: https://github.com/cltk/cltk/blob/f84d35c20be5a087d9320ecda31ff56554358522/cltk/stop/stop.py#L73 Is there a reason why we would want to keep the StringStopList code and not just replace it with a simple call to CorpusStopList passing a list with the single string? Can we scrap the distinction between these two classes altogether by making a SimpleStopList class whose build_stoplist() is sensitive to whether it is passed a single string or a list/collection?
improve support for non-alphabetic writing systems: currently, tokenization of both the StringStopList and CorpusStopList is done based on spaces, which means that for Chinese texts that are not word-tokenized one has to manually put a space after every character. scikit’s CountVectorizer and TfidfVectorizer both take an analyzer argument which can simply be set to char (instead of the default word), so for the CorpusStopList at least the best solution would probably be to also take an analyzer argument that gets passed on to the vectorizers.
make remove_punctuation less language specific, for example it can’t currently handle the Chinese fullwidth punctuation markers such as 。、，！？This might be possible to do elegantly by replacing https://github.com/cltk/cltk/blob/f84d35c20be5a087d9320ecda31ff56554358522/cltk/stop/stop.py#L254 with a regex that can match/replace on Unicode codepoint properties (not with the default re library, but with another library like regex, look for \p in the documentation to see what this would look like). If it’s not possible then the documentation should be updated to state that it’s the user’s responsibility to clean up punctuation in advance.
make remove_numbers less language specific. Same approach as for punctuation.

Issue Analytics

State:
Created 6 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

kevinstadlercommented, Mar 18, 2018

Thanks, that makes more sense now!

Regarding the StringStopList, I would just scrap this class altogether, since its functionality is completely subsumed by the CorpusStopList, since the CorpusStopList calculations can all be applied to a single string as well, simply by treating it as a document collection of 1 document, or am I wrong? So my suggestion was to rename the CorpusStopList to SimpleStopList (in the same way that Abstract* classes often come with Simple* skeleton implementations in OO paradigms) and have its build_stoplist() function auto-box single strings into a collection of one document under the hood. I understand that there might be concerns regarding overloading the method signature with a string/collection here, and I’m not sure what the convention for that is through the rest of the code base, just at the moment there is a lot of duplicate code between the existing classes.
Ah yes the class hierarchy makes perfect sense for language-specific processing of course, in that case I’ll just subclass the CorpusStopList and override the punctuation methods. I still think it would make sense to expose the analyzer argument at the top-level though, as using the word or character as a basis is not actually language specific. For unsegmented Chinese one will want to look at individual characters, if the text is however segmented into words (with extra spaces added around word boundaries) then using analyzer="word" would make sense for logographic scripts as well.

0reactions

diyclassicscommented, May 23, 2018

“Can we scrap the distinction between these two classes altogether by making a SimpleStopList class whose build_stoplist() is sensitive to whether it is passed a single string or a list/collection?” Testing this out this morning—and I’m more and more inclined to agree, esp. since the sklearn vectorizers work with a document list of length 1.