Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Annotate rigid objects in 2D image with standard 3D cube

See original GitHub issue

My actions before raising this issue

I have read and searched the official docs and past issues for the solution. No one had the same problem with me.

Expected Behaviour

I want to annotate the head orientation of people in 2D image with a standard 3D cube. Here, the head is a rigid object. A standard cube is defined as follows: three sides of any vertex are perpendicular to each other, and all twelve sides are equal in length, or in unit length.

After labeling, we could get the eight projected vertices of the cube in the two-dimensional coordinate system. If three Euler angles (pitch, yaw, roll) are used to represent the orientation of the head, these precise projection points can be converted into corresponding angles.

Current Behaviour

Current cuboid annotation The current provided cuboid annotation function in CVAT is not suitable for rigid object.
1. Firstly, it can not guarantee that the edges of each vertex of the labeled cuboid are perpendicular to each other.
2. Secondly, the length, width and height of cuboids are not necessarily equal.
3. Finally, the side face of the current cuboid is always vertical. They can’t be rotated. This makes it lack a dimension. These conditions make cuboid can not be used to mark the head orientation. In addition, I also think that such a cuboid is not suitable for labeling cars, chairs and other rigid objects.
Alternative choice: ployline As an alternative, I try to annotate three consecutive non planar edges of the cube by using the ployline label. In this way, four points of the three edges can be used to estimate the Euler angles. However, this alternative can only solve the third problem of cuboid label mentioned above, and the first and second problems have not been solved. What we actually get are the rotated cuboids.

Possible Solution

I have three suggestions or roadmaps for adding unit cube label in the new version of CVAT.

Improve cuboid The current cuboid is actually oblique. However, objects in the real world should be marked with regular cuboids which satisfy that three edges of each vertex are perpendicular. At the same time, we need to release the third dimension of cuboid and allow it to rotate freely. I don’t know if it’s easy to implement with TypeScript. Three.js and other open source packages may be used for reference.
Modify cuboid-3d As far as I know, recent versions of CVAT already support 3D point cloud annotation. So is it possible to transplant the 3D cuboid module to the 2D image annotation? I’m not very familiar with the content of point cloud annotation, so it’s inconvenient for me to give my opinions.
Add cube If possible, consider adding a new cube label to the candidate label button on the left side of CVAT. Users could choose to add new 3D cube graphics. The cube instance supports rotation at any angle on three dimensions. The software will automatically record the final Euler angle when the shape of cube is fixed.

Here are two examples of 3D model interaction. The first is the rotation interaction of a 3D head model in mayavi. The interactive operation needs to rely on both mouse and keyboard. The second is to use the 3D image editing tool in Windows 10 to place and operate 3D models on 2D images. All you need to do is use the mouse.

demo1_pymayavi-3D_head_model Example 1

demo2_windows10_3D_edit Example 2

Next steps

Looking forward to your reply. I will be willing to do whatever I can to advance this functional part.

Issue Analytics

State:
Created 2 years ago
Comments:7 (1 by maintainers)

Top GitHub Comments

3reactions

hnuzhycommented, Jul 8, 2021

@hnuzhy , I agree that we need to improve the functionality. Your explanation is really helpful. Could you please describe your research area and organization? Unfortunately my team has huge amount of requests and we already have an approximate roadmap for Q3’21 and Q4’21. Thus I’m trying to clarify details which will help me to increase the priority of the feature.

@nmanovic Hi, I’m glad you agree to my proposal. I am a PhD student in computer department from SJTU University. My research field is the intersection of AI and education. The detailed research direction is object detection and pose estimation in computer vision. I would like to talk about the motivation of this question from two aspects.

Aspect one: Academic Value

Recently, I’ve been studying the methods of attention detection for students in the classroom. Among them, head orientation (head pose estimation) is one of the key factors. However, as far as I know, the head pose estimation algorithm of multi-person in 2D image is not well developed. At present, there are some SOTA algorithms for head pose estimation of a single well cropped head, including FSA-Net(CVPR2019) and WHE-Net(BMVC2020). But their effect is not ideal, and it is not easy to extend to the case of multiple people in a single image. Most importantly, the datasets used by these algorithms are obtained by 3D head projection (300W-LP & AFLW2000-3D), or the 3D Euler collected by depth camera in the experimental scene (CMU Panoptic Studio Dataset).

demo_FSANet Prediction example 1 of FSA-Net (The input can only be a single person’s head with visible face.)

demo_FSANet_multiple Prediction example 2 of FSA-Net (First, the head bbox of each person is detected by MTCNN, and then the single head is estimated. Therefore, this is not an efficient or essential multi-person head pose estimation algorithm.)

demo_WHENet_360 Prediction example of WHE-Net (The input can only be a single person’s head with wide range pose. The predictable yaw angle of the head is omnidirectional.)

Dataset has always been the cornerstone of deep learning algorithms, so is head pose estimation. Therefore, I want to try to annotate the 3D head orientation, or three Euler angles of the head directly in the 2D image. As mentioned for the first time in this issue, the most accurate annotation scheme focuses on how to use 3D cube to interact freely on 2D images. In my opinion, once such a dataset is constructed, it will help promote the great progress of the corresponding algorithm research. For example, a bottom-up method could be designed to directly predict the pose of all heads in the image at one time. At the same time, compared with a single captured head image, the complete scene and human body information in the original image can assist more accurate head pose estimation.

Aspect two: Enhancement Feasibility:

After investigation, I didn’t find tools with real 3D cube annotation. Fortunately, close functional options were found in CVAT. The first is Draw new cuboid. However, the new builded cuboid lacks rotation freedom. The second is Draw new polyline. By annotating three consecutive non coplanar edges of a 3D cube approaching the head orientation, we can deduce the approximate Euler angle. Unfortunately, there is a great subjectivity error in this annotation process. We can’t see the actual pose of the generated cube directly, unless we use a real 3D cube to annotate interactively. If we use this method reluctantly, the credibility of the final annotation will be questioned.

Here are three examples of rough annotation results with Draw new polyline. Images are all from the public CrowdHuman dataset. The object we annotate is the head with any orientation in the image, including the visible, occluded and invisible face. In many cases, the current method of polyline annotation is difficult and inaccurate.

issue_anno_img1

issue_anno_img2

issue_anno_img3

In a word, it is very useful to add interactive annotation of rigid 3D graphics (which can only be rotated, translated and scaled) to 2D images. In addition to supporting the head orientation marking, the new function can also be extended to the annotation of other rigid objects. After the construction of similar datasets about general objects, we can try to develop a simple and direct 3D object pose estimation algorithm only based on 2D images. We expect that this method can be comparable to estimation algorithms based on RGB-D or 3D point cloud.

Finally, I am not good at giving the overall improvement framework of CVAT about this enhancement from UI design or code addition, but I am willing to do what I can. I sincerely thank CVAT’s main contributors for their work, and hope to carefully consider adding this task to roadmap.

1reaction

Kucevcommented, Apr 7, 2022

I support the request. We also have a need for such functionality.