Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CVAT does not work when annotating PDFs

See original GitHub issue

Hi,

I am trying to annotate pdf documents instead of images with cvat and noticed a number of problems that I am not able to resolve alone. I am using the develop branch, because on the master branch the Docker image of cvat does not build successfully.

I am only able to upload a single pdf document (with many pages) but not several pdf documents. The error Code explains that I can only upload a single pdf but it would be helpful to understand the rational for this: ValueError: Only one video, archive, pdf or many image, directory can be used simultaneously, but 0 image(s), 0 video(s), 0 archive(s), 2 pdf(s), 0 directory(s) found.
The conversion from pdf to image with pdf2image is not working, because poppler is missing from the Dockerfile. I fixed it by adding it to the Dockerfile:

# Install poppler for working with pdfs
RUN apt-get update && apt install -y poppler-utils

After annotating a few items, I attempted to dump the annotation and no matter which format I use it fails, here is the error message. Note, dumping annotated png images works perfectly, seems to be a problem specific to pdfs.

2019-12-07 23:45:12,475 DEBG 'rqworker_default_1' stderr output:
23:45:12 default: cvat.apps.engine.annotation.dump_task_data('5', <SimpleLazyObject: <User: admin>>, '/home/django/data/5/5_IDP.admin.2019_12_07_23_45_12.zip', <AnnotationDumper: AnnotationDumper object (YOLO ZIP 1.0)>, 'http', 'localhost:8080') (admin@/api/v1/tasks/5/annotations/YOLO ZIP 1.0/5_IDP)

2019-12-07 23:45:12,574 DEBG 'rqworker_default_1' stderr output:
23:45:12 cvat.apps.engine.utils.InterpreterError: ValueError at line 308: '.upload' is not in list
Traceback (most recent call last):
  File "/home/django/cvat/apps/engine/utils.py", line 45, in execute_python_code
    exec(source_code, global_vars, local_vars)
  File "<string>", line 1, in <module>
  File "<string>", line 104, in dump
  File "/home/django/cvat/apps/annotation/annotation.py", line 325, in group_by_frame
    _get_frame(annotations, shape).labeled_shapes.append(self._export_labeled_shape(shape))
  File "/home/django/cvat/apps/annotation/annotation.py", line 308, in _get_frame
    rpath = os.path.sep.join(rpath[rpath.index(".upload")+1:])
ValueError: '.upload' is not in list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/rq/worker.py", line 812, in perform_job
    rv = job.perform()
  File "/usr/local/lib/python3.5/dist-packages/rq/job.py", line 588, in perform
    self._result = self._execute()
  File "/usr/local/lib/python3.5/dist-packages/rq/job.py", line 594, in _execute
    return self.func(*self.args, **self.kwargs)
  File "/home/django/cvat/apps/engine/annotation.py", line 135, in dump_task_data
    annotation.dump(filename, dumper, scheme, host)
  File "/home/django/cvat/apps/engine/annotation.py", line 740, in dump
    execute_python_code("{}(file_object, annotations)".format(dumper.handler), global_vars)
  File "/home/django/cvat/apps/engine/utils.py", line 60, in execute_python_code
    raise InterpreterError("{} at line {}: {}".format(error_class, line_number, details))
cvat.apps.engine.utils.InterpreterError: ValueError at line 308: '.upload' is not in list
Traceback (most recent call last):
  File "/home/django/cvat/apps/engine/utils.py", line 45, in execute_python_code
    exec(source_code, global_vars, local_vars)
  File "<string>", line 1, in <module>
  File "<string>", line 104, in dump
  File "/home/django/cvat/apps/annotation/annotation.py", line 325, in group_by_frame
    _get_frame(annotations, shape).labeled_shapes.append(self._export_labeled_shape(shape))
  File "/home/django/cvat/apps/annotation/annotation.py", line 308, in _get_frame
    rpath = os.path.sep.join(rpath[rpath.index(".upload")+1:])
ValueError: '.upload' is not in list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/rq/worker.py", line 812, in perform_job
    rv = job.perform()
  File "/usr/local/lib/python3.5/dist-packages/rq/job.py", line 588, in perform
    self._result = self._execute()
  File "/usr/local/lib/python3.5/dist-packages/rq/job.py", line 594, in _execute
    return self.func(*self.args, **self.kwargs)
  File "/home/django/cvat/apps/engine/annotation.py", line 135, in dump_task_data
    annotation.dump(filename, dumper, scheme, host)
  File "/home/django/cvat/apps/engine/annotation.py", line 740, in dump
    execute_python_code("{}(file_object, annotations)".format(dumper.handler), global_vars)
  File "/home/django/cvat/apps/engine/utils.py", line 60, in execute_python_code
    raise InterpreterError("{} at line {}: {}".format(error_class, line_number, details))
cvat.apps.engine.utils.InterpreterError: ValueError at line 308: '.upload' is not in list

2019-12-07 23:45:15,528 DEBG 'runserver' stderr output:
[Sat Dec 07 23:45:15.528224 2019] [wsgi:error] [pid 151:tid 139962191009536] [remote 172.19.0.1:33606] [2019-12-07 23:45:15,528] ERROR django.request: Internal Server Error: /api/v1/tasks/5/annotations/5_IDP

Issue Analytics

State:
Created 4 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

3reactions

philippschwcommented, Dec 9, 2019

Thanks for your detailed Response:

Unfortunately, the code fails silently. Though it says task has been created, the task is not there in the overview ready for the annotation.

The following output from the logs, shows that not frames have been created for the task.

019-12-09 16:49:23,218 DEBG 'rqworker_default_1' stderr output:
16:49:23 default: cvat.apps.engine.task._create_thread(1, {'server_files': [], 'remote_files': [], 'client_files': ['DP_Telekom_Lexware_Unbekannt.pdf', 'DATEV.PDF']}) (/api/v1/tasks/1)

2019-12-09 16:49:23,230 DEBG 'rqworker_default_1' stderr output:
[2019-12-09 16:49:23,230] INFO cvat.server: create task #1

2019-12-09 16:49:23,245 DEBG 'rqworker_default_1' stderr output:
[2019-12-09 16:49:23,245] INFO cvat.server: Founded frames 0 for task #1

What is more, in the data folder no .jpg file is getting saved when I upload pdfs (projectid 1) but when I upload images (projectid 2) , it works as expected:

django@2e82356a9f21:~/data$ ls 1/data/
django@2e82356a9f21:~/data$ ls 2/data/0/0/
0.jpg  1.jpg
django@2e82356a9f21:~/data$

I use your code only minimally adapted: cvat/cvat/apps/engine/media_extractors.py

        self._dimensions = []
        count = 0
        for source in source_path:
            for root, _, files in os.walk(source):
                paths = [os.path.join(root, f) for f in files]
                paths = filter(lambda x: get_mime(x) == 'pdf')
                for path in paths:
                    pages = convert_from_path(path)
                    for page in pages:
                        # Note: There's probably a better way to assign a name than using `count`
                        output = os.path.join(self._temp_directory, str(count) + '.jpg')
                        count += 1
                        self._dimensions.append(page.size)
                        page.save(output, 'JPEG')

        self._length = len(os.listdir(self._temp_directory))

    def _get_imagepath(self, k):
        img_path = os.path.join(self._temp_directory, str(k) + '.jpg')
        return img_path

Complete Code: https://github.com/philippschw/cvat

1reaction

benhoffcommented, Dec 8, 2019

I can’t speak to the dumper errors.

As far as the rationale behind only being able to load a single PDF, I submitted this while working a job for a client. All the client needed was the ability to upload a single PDF per task. And I had many, many other responsibilities 😃 .

The upload code can easily be extended to account for your use case.

You would need to wrap lines 92 - 97 in a for loop. Line 92 is linked below:

https://github.com/opencv/cvat/blob/1ec89b5f6a445aaa86854356cb73deb7e070d346/cvat/apps/engine/media_extractors.py#L92

I think the DirectoryExtractor has a somewhat relevant example, the only difference being that file_ = convert_from_path(self._source_path) is a little mis-labeled. I believe file_ is a list of multiple file paths of images that each need to be handled.

The relevant section of DirectoryExtractor code is linked below.

https://github.com/opencv/cvat/blob/1ec89b5f6a445aaa86854356cb73deb7e070d346/cvat/apps/engine/media_extractors.py#L129

Below is a take on my comments from above.

# Note: The following code would replace the existing code starting at:
# https://github.com/opencv/cvat/blob/1ec89b5f6a445aaa86854356cb73deb7e070d346/cvat/apps/engine/media_extractors.py#L91

self._dimensions = []
count = 0
for source in source_path:
    for root, _, files in os.walk(source):
        paths = os.path.join(root, f) for f in files]
        paths = filter(lambda x: get_mime(x) == 'pdf')
        for path in paths:
            pages = convert_from_path(path)
            for page in pages:
                # Note: There's probably a better way to assign a name than using `count`
                output = os.path.join(self._temp_directory, str(count) + '.jpg')
                count += 1
                self._dimensions.append(page.size)
                page.save(output, 'JPEG')

self._length = len(os.listdir(self._temp_directory))

# Note: you would need to redefine the below method for `PDFExtractor`
def _get_imagepath(self, k):
    img_path = os.path.join(self._temp_directory, str(k) + '.jpg')
    return img_path

# Note: You would need to change `unique` to be `False` in the following line:
# https://github.com/opencv/cvat/blob/1ec89b5f6a445aaa86854356cb73deb7e070d346/cvat/apps/engine/media_extractors.py#L276

But I don’t have any way to test the above code currently.