Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

captcha22 client label should copy files

See original GitHub issue

I came across this project and thought I’d give it a try with a very limited set of 4 captchas.

After downloading and installing using pip I ran:

captcha22 client label --input=captchas

I labeled the images and a data directory was created:

$ ls -lh data/
total 2,0M
-rw-rw-r-- 1 thijs thijs 665K dec  5 22:08 2YREA9.png
-rw-rw-r-- 1 thijs thijs 630K dec  5 22:08 6PA5K7.png
-rw-rw-r-- 1 thijs thijs  11K dec  5 22:03 NTEMYU.png
-rw-rw-r-- 1 thijs thijs 724K dec  5 22:08 XAMK6Q.png

Unfortunately it seems all my original captchas were moved by this script and I mistakenly deleted the data dir. Have to harvest some new ones now 😦 Would be really useful if captcha22 leaves the originals alone or mentions it very clearly in the readme that these files will be moved.

It also seems JPG files are not supported:

INFO:Captcha22 Label Scripts:Executing CAPTCHA Typing Script
INFO:Captcha22 Typer:No png files found

Update: found the --image-type option for this:

captcha22 client label --image-type=jpg --input=captchas

Issue Analytics

State:
Created 3 years ago
Comments:13 (13 by maintainers)

Top GitHub Comments

1reaction

TinusGreencommented, Dec 7, 2020

Hi thijstriemstra,

The reason captcha22 moves the images is since when you are labelling a couple, say 1000 - 2000, the amount of space the images require becomes fairly large.

To save space, captcha22 moves the images that have already been labelled. This feature also ensures that if you stop the labelling process at any time, you will not have to start with the first captcha again, as those captchas are now in the ouput directory. Captcha22 never deletes images, just moves them.

We can however look to introduce a flag that will leave the images in the directory, but if this flag is set it would probably mean that it you stop and restart the labelling process, your progress point will be lost.

In terms of your error, ideally you should be using the API to copy the data.zip file, since it would solve the naming convention for you. The idea is to keep your captchas organised, so it takes the form of

<username>_<captcha_name>_<captcha_version>.zip

So it would be something like: thijstriemstra_testcaptcha_1.zip

You can always rename the data.zip manually to this format before placing it in the Unsorted directory.

We will be pushing a UI at the end of this year or start of 2021 which will also make it easier to interface with the server.

0reactions

TinusGreencommented, Jan 14, 2021

ps. could you publish a new version to pypi? the current release on there is broken.

Published and thanks, forgot to do it last time.

See attached zipfile.

Looking at the captchas I’d say just added more solved captchas to train with will be the main thing to improve your solving accuracy.

However, this is also a captcha with static noise, so another approach would just be to filter the noise out. You could or a background filter, comparing static values across multiple captcha images, but I think a normal gray filter would do the trick as well.

This is an example of gray filter we used previously, of course the start and end values would have to be tweaked.

import numpy as np
 import cv2
 
 def clean_image(imgfile, newname):
     histRange = np.arange(0,255)
     convertBins = np.arange(0,255)
 
     #Specify the colour value that we want to filter out here
     start = 100
     end = 151
 
     #Create gray filter
     while (start <= end):            
         convertBins[start] = 0
         start += 1
 
     convertBins[254] = 0
 
     #Now load and filter the image
     img = cv2.imread(imgfile)
     gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
 
     hist, bins = np.histogram(gray, bins=histRange)
 
     #The hist values can be used to determine which colour values we need to filter,
     #look for the largest values and filter those first
     print (hist)
 
     for bin in convertBins:
         gray[np.where(gray == bin)] = 254
 
     cv2.imwrite(newname, gray)
 
     #Image is now clean

Would be really nice to have tensorflow 2.x support though, using the trained model with tensorflow 2.x seems impossible, as well as converting it.

We are working on it, but will only be able to get this up and running once we’ve done a full convert of AOCR, since that is the main learning engine. I’ll create a milestone for it after we finalise the UI update.

Top Results From Across the Web

CAPTCHA22 is a toolset for building, and training, CAPTCHA ...

CAPTCHA22 is a toolset for building, and training, CAPTCHA cracking models using neural networks. These models can then be used to crack CAPTCHAs...

Breaking the multi colored box: a study of CAPTCHA

The first way is to use a predefined list stored in a text file with the application. The second is that it can...

Breaking e-Banking CAPTCHAs

These broken e-banking. CAPTCHA schemes are used by thousands of financial institutions worldwide, which are serving hundreds of millions of e-banking customers ......

Python only get specific value in json - Stack Overflow

Here's what you have to do: var_name = file_name['description']. The file_name is supposed to be the name of the opened json file.

(PDF) A comparison Study for CAPTCHA Security

Finally, some security analysis methods for CAPTCHA will be ... intended to allow a computer to determine if a remote client is human...