Language detection interface
See original GitHub issueProblem
Whilst the original language checker is absolutely brilliant, it fails at small ciphertexts, or those with high entropy. An AI solution would be cool, but would be a bit OTT for rigid data structures such as json or CTF flags
Solution
We present an interface for defining a custom candidate acceptor:
class LanguageChecker(ABC):
@staticmethod
@abstractmethod
def getArgs(**kwargs) -> dict: pass
@abstractmethod
def checkLanguage(self, text: str) -> bool: pass
ciphey_acceptor = LanguageCheckerDerived()
This can be passed as an argument to each cracker. Since a basic chi-squared filtration is applied by the core to the candidates, relatively few should make it through. This allows the is_acceptable
function to be a halt condition: if it returns true on any candidate, we stop.
However, a user shouldn’t be expected to write such code for all text, that would be absurd.
Instead, we provide a few language detection interfaces that can be selected by the user instead of them writing their own. My suggestion would be the following:
Name | Args | Description |
---|---|---|
brandon |
None | Current system, for long natural text |
neuralnet |
path |
A neural net for text with deep structure, or smaller text |
regex |
regex |
A regular expression for very simple structures |
user |
None | Prompts the user for each candidate |
I also put forward the following flags:
Flag | Description |
---|---|
-a <name> |
The internal language detector <name> (maybe with a configurable path) |
-A <path> |
A python script/module at <path> containing the ciphey_acceptor variable |
-p <param>=<value> |
Sets the kwarg <param> to <value> when the is_acceptable method is called |
Issue Analytics
- State:
- Created 3 years ago
- Comments:14 (14 by maintainers)
Top GitHub Comments
Thoughts on synonyms
the synonyms appear to be ordered, possibly by the frequency of the word itself which is a feature from Google Books. if the exact synonyms are non deterministic and going by order, I would assume brilliant will always equal good. this makes my job easier.
but I cannot guarantee this for every possible text.
If I have gotten this wrong and it is in fact non deterministic and they are not perfect mappings of eachother, we could generate a list of synonyms per word and use the dictionary lookup to see if that word appears.
if we have to do that, I would argue we wouldn’t need synonym checking at all and to remove it. I will have to explore further to see which is the case and whether or not it is worth it.
It may be possible for me to create a dictionary of synonyms, and store it in
cipheyDists
. For example, when I convert “brilliant” I can store it as “good” in the dict format{"brilliant": "good"
}. That way we have a 1-to-1 mapping of words to synonyms.To find out if the word is in the dict, we would first check to see what synonym is stored using the dict in the last paragraph, and then we search it in the dictionary dict using the value.
Point 1
I agree, I messed up - my bad!
Point 2
It is a dictionary, but it also uses spooky NLP magic to make it efficient. I.E. It isn’t a simple lookup table, but it exploits how the English language is made and formed to provide a more efficient table.
For what it’s worth, we’re using the smallest model available (the smallest English model) and with no plans to expand to the medium or large versions.
Point 3
I am thinking of something like this:
So we have lots of little checks, but together they should give a relatively clear picture on whether it’s worth it to go to phase 2 or not.
Also, I will make it so as soon as the list hits 2 or X amount, it will go to phase 2. I don’t want to wait around when we’ve already accepted that it’s going to phase 2. Will use a While loop for this. like
this way we can easily add more smaller checks if we wanted to change something while maintaining the decorum of the program.
The funcs list is ordered in terms of fastest to slowest too. So if we have plaintext it would quickly go onto the next one.
Ideas for checks
These are ordered in terms of how much processing power they require.
Synonyms will be added if I can figure out if they work in a way that I can use them. Stop word removal isn’t needed at all for phase 2, so it is likely I’ll place that last. Word endings is efficient to compute in O(n), but again this is kind of the job of lemmatization, so not too true.
Perhaps I will have to have a “phase 0”, where I perform incredibly simple checks with a very low threshold, phase 1, where the checks get a little harder but not too difficult with a medium threshold and phase 3 where we can be certain it is plaintext with a high threshold.
However, I will check the accuracy & speed of just using the pre-processing of phase 2 as a check for phase 1. I want it to be efficient, so testing the speed differences is important 😄
I still lament, however, that Chi Squared was a fantastic metric and very quick to compute. I understand that the crackers in basicEncryptions use the metric, but nothing else does. Perhaps a version where we only check the first 5 or 6 most popular chars?
If anyone has any ideas for other simple checks we can run, let me know. They have to be extremely fast, but the accuracy doesn’t matter too much.
1. Yeah, that sounds good. I’ll put it in cipheydists so that it is more maintainable 3. I really don’t like seeking paths that only exist on a few OSs. I believe this is more suited for #94 4. By user prompting, I meant pass each candidate to the user, using the best trained neural net I know of (the human brain). It is more annoying, but the best way of making sure nothing slips through for high-entropy plaintexts