Method for adding config settings
See original GitHub issueLook mama, no config files!
I was wrestling with config files for some of the settings when I ran across this google group discussion about tesseract using java and it made my mouth water. Here’s a code snippet from their discussion:
tesseract = new Tesseract();
tesseract.setOcrEngineMode(TessAPI.TessOcrEngineMode.OEM_TESSERACT_ONLY);
tesseract.setPageSegMode(7);
tesseract.setTessVariable("load_system_dawg", "0");
tesseract.setTessVariable("load_freq_dawg", "0");
tesseract.setTessVariable("load_punc_dawg", "0");
tesseract.setTessVariable("load_number_dawg", "0");
At first you may think, well that’s cool I guess but you can really do the same thing by just defining a long string of configs and calling it whenever you need it. For example, '--psm 10 --oem 3 -c load_system_dawg=0 load_freq_dawg=0 load_punc_dawg=0 . . .'
In the tesseract documentation, it mentions that you can’t change ‘init only’ parameters with tesseract executable option -c
. And those ‘init only’ parameters would include some of the ones I’ve been messing with. I think that most people would say that it would be nice to be able to set your variables for your config file directly in python using a set_config_variable
method instead of having to go make a config file. Since some of the variables that are being set in the code above are in fact ‘init only’, the Java guys must be creating a config file (I did not sniff through their code to verify this, however) from java code.
I haven’t done it yet because I’m not too familiar with the code inside pytesseract
, but right now making a temporary config file and letting it be loadable via a set_config_variable
method doesn’t seem very hard from my perspective. Here’s the high level logic I’m thinking about:
- When pytesseract is imported, check the config folder to see if a temp.txt file exists. If so, wipe it clean. If not, create one.
- When someone calls the
tsr.set_config_variable
method, just write the variable, a space, and the value on a new line in the temp.txt file. - You could also have a method to delete the variable from the file and thus return tesseract to the default.
- When any of the OCR functions are called, if the user does not manually supply another config file, use the temp.txt as the config file unless it’s empty.
Why this would be a good feature:
- For me and others like me who wrote their first line of code 8 months ago, even little trips to the back-end of config files or source code can be confusing and take lot’s of time.
- There’s a lot of super ridiculously lazy people out there just like me who would rather not know anything about how the programs and libraries work which they’re using, but just want to use them to make other interesting applications.
But maybe it’s actually not very easy to implement. Is this actually possible?
Issue Analytics
- State:
- Created 5 years ago
- Reactions:6
- Comments:18
Top GitHub Comments
Hi, thank you very much for the proposal. This can be implemented and in fact - you can implement it with a custom logic for yourself. At the end of the day - you can make your own logic for handling config files, and then you can pass the resulting config file via the
config
method argument.As far as integrating this into pytesseract - well, if I have some free time, I will try to implement the logic for this. The only “problematic” part of this is - where to store this temp config.
And btw, we can have this nice python approach:
I was able to supply my own config file by using the following: (“words” is the name of my config file) pytesseract.run_and_get_output(im, extension=“txt”, config=“words”)