Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Language detection interface

See original GitHub issue

Problem

Whilst the original language checker is absolutely brilliant, it fails at small ciphertexts, or those with high entropy. An AI solution would be cool, but would be a bit OTT for rigid data structures such as json or CTF flags

Solution

We present an interface for defining a custom candidate acceptor:

class LanguageChecker(ABC):
    @staticmethod
    @abstractmethod
    def getArgs(**kwargs) -> dict: pass

    @abstractmethod
    def checkLanguage(self, text: str) -> bool: pass

ciphey_acceptor = LanguageCheckerDerived()

This can be passed as an argument to each cracker. Since a basic chi-squared filtration is applied by the core to the candidates, relatively few should make it through. This allows the is_acceptable function to be a halt condition: if it returns true on any candidate, we stop.

However, a user shouldn’t be expected to write such code for all text, that would be absurd.

Instead, we provide a few language detection interfaces that can be selected by the user instead of them writing their own. My suggestion would be the following:

Name	Args	Description
`brandon`	None	Current system, for long natural text
`neuralnet`	`path`	A neural net for text with deep structure, or smaller text
`regex`	`regex`	A regular expression for very simple structures
`user`	None	Prompts the user for each candidate

I also put forward the following flags:

Flag	Description
`-a <name>`	The internal language detector `<name>` (maybe with a configurable path)
`-A <path>`	A python script/module at `<path>` containing the `ciphey_acceptor` variable
`-p <param>=<value>`	Sets the kwarg `<param>` to `<value>` when the `is_acceptable` method is called

Issue Analytics

State:
Created 3 years ago
Comments:14 (14 by maintainers)

Top GitHub Comments

1reaction

bee-sancommented, Jun 17, 2020

Thoughts on synonyms

the synonyms appear to be ordered, possibly by the frequency of the word itself which is a feature from Google Books. if the exact synonyms are non deterministic and going by order, I would assume brilliant will always equal good. this makes my job easier.

but I cannot guarantee this for every possible text.

If I have gotten this wrong and it is in fact non deterministic and they are not perfect mappings of eachother, we could generate a list of synonyms per word and use the dictionary lookup to see if that word appears.

if we have to do that, I would argue we wouldn’t need synonym checking at all and to remove it. I will have to explore further to see which is the case and whether or not it is worth it.

It may be possible for me to create a dictionary of synonyms, and store it in cipheyDists. For example, when I convert “brilliant” I can store it as “good” in the dict format {"brilliant": "good"}. That way we have a 1-to-1 mapping of words to synonyms.

To find out if the word is in the dict, we would first check to see what synonym is stored using the dict in the last paragraph, and then we search it in the dictionary dict using the value.

Point 1

I agree, I messed up - my bad!

Point 2

It is a dictionary, but it also uses spooky NLP magic to make it efficient. I.E. It isn’t a simple lookup table, but it exploits how the English language is made and formed to provide a more efficient table.

For what it’s worth, we’re using the smallest model available (the smallest English model) and with no plans to expand to the medium or large versions.

Point 3

I am thinking of something like this:

accept = [ ] 
lemmitisztion (append 1 to accept)
word endings (simple if "ing, en," on end of a word) (append to list)
if synonyms (append to list)

phase 2 enters if list >= Len 2

So we have lots of little checks, but together they should give a relatively clear picture on whether it’s worth it to go to phase 2 or not.

Also, I will make it so as soon as the list hits 2 or X amount, it will go to phase 2. I don’t want to wait around when we’ve already accepted that it’s going to phase 2. Will use a While loop for this. like

funcs = [self.lem, word.ends, word.sym] 
I = 0
while accept <= 2:
    accept.append(1) if funcs[I] else Continue
    I += 1

this way we can easily add more smaller checks if we wanted to change something while maintaining the decorum of the program.

The funcs list is ordered in terms of fastest to slowest too. So if we have plaintext it would quickly go onto the next one.

Ideas for checks

These are ordered in terms of how much processing power they require.

A regex to check for non alphabetical chars in the middle of a word. So “h3llo” would fail, but so would “hel3a4918!!!lo”. We can run this across each word individually, so when we meet a threshold of words that do not match this regex we return True for this one specific test.
Chi squared
Stop word removal (removing words like “and, an, a, that”)
Word endings (“ing, eng”)
Lemmatization
Expand contractions (Don’t -> do not)
Part of speech tagging (very expensive. Basically ,what is a noun, adverb in this list)
Entity extraction (Microsoft is an entity, and so is Jake.)

Synonyms will be added if I can figure out if they work in a way that I can use them. Stop word removal isn’t needed at all for phase 2, so it is likely I’ll place that last. Word endings is efficient to compute in O(n), but again this is kind of the job of lemmatization, so not too true.

Perhaps I will have to have a “phase 0”, where I perform incredibly simple checks with a very low threshold, phase 1, where the checks get a little harder but not too difficult with a medium threshold and phase 3 where we can be certain it is plaintext with a high threshold.

However, I will check the accuracy & speed of just using the pre-processing of phase 2 as a check for phase 1. I want it to be efficient, so testing the speed differences is important 😄

I still lament, however, that Chi Squared was a fantastic metric and very quick to compute. I understand that the crackers in basicEncryptions use the metric, but nothing else does. Perhaps a version where we only check the first 5 or 6 most popular chars?

If anyone has any ideas for other simple checks we can run, let me know. They have to be extremely fast, but the accuracy doesn’t matter too much.

1reaction

Cyclic3commented, Jun 3, 2020

1. Yeah, that sounds good. I’ll put it in cipheydists so that it is more maintainable 3. I really don’t like seeking paths that only exist on a few OSs. I believe this is more suited for #94 4. By user prompting, I meant pass each candidate to the user, using the best trained neural net I know of (the human brain). It is more annoying, but the best way of making sure nothing slips through for high-entropy plaintexts

Top Results From Across the Web

Enable language detection | Multilingual guide

Enabling and setup of the language detection. ... through the 'Language Detection and Selection' user interface at Configuration > Regional ...

Language detection and implementation

Where 'de' is the language identifier set by your application. ... Player languages, where you can also translate the user interface.

Language Detection Using Natural Language Processing

We are using the Language Detection dataset, which contains text details for 17 different languages. Languages are: * English. * Portuguese.

TIKA - Language Detection

This tool should accept documents without language annotation (metadata) and add that information in the metadata of the document by detecting the language....

Bert Base Multilingual Cased Language Detection

Language Detector. Language detector with support for 45 languages. Created as part of the huggingface course community event. Text content.