question about input data format / using different pose extractors
See original GitHub issueHi, i have just started looking into geometric learning and as a first try i want to get the network running in my environment. My issue is that i am not using the joints from openpose so my “input” is formatted in a different way. I am specifically talking bout N, C, T, V, M = x.size()
from forward() and extract_feature(). Going from the paper “Spatial Temporal Graph Convolutional Networks for Skeleton-Based ActionRecognition” I am guessing that N is the number of joints, C is the number of channels of the feature (2 for 2d joint positions), T is time as in the number of frames that are processed. For V and M i am at the loss and now im stuck because i cant convert my own pose coordinates into the proper format, I would appreciate any help. I tried installing openPose just to explore the data format more but after endless conflicts because of anaconda and cuda mismatches I gave up.
tl;dr - what are N, C, T, V, M = x.size() of the pose data ?
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:5 (2 by maintainers)
Hi, i just noticed my error with the N as well - thanks for reiterating. I cant use the Dataloader unfortunately because of my RL setup so right now im trying to figure out how the normalization was done (working theory is remove half of width/height and then divide by width/height since its distributed between -0.5 and 0.5). As for score i was just setting the visible joints to confidence 1 but maybe your idea is better. In case anyone is working on a similar issue and is interested:
x.size() = [64, 3, 300, 18, 2]
, minibatchsize=64, channels x,y,score = 3, T is 2 times the temporal window size which is 150 as per config, V is 18 for kinetics as per paper, for the last im not quite sure, the only thing i can say is that the second “person” seems to be missing quite often during training. this is an output ofprint(x[0, :, 150, :, 0])
, so one middle frame of the first sample in the minibatch for the first personGood luck with your application and thanks for your help !
I meant the dimensions are in the order given:
(id_within_minibatch, channels, frame_num_aka_time, keypoints, person_id)
So N = id_within_minibatch (hint: use a DataLoader to make minibatches in the 1st dimension) C = channels (x, y, score) OR (x, y) – has to match num_channels T = frame_num_aka_time V = keypoint/joint (probably stands for vertex) M = person ID (for when there are multiple people within a frame I would suppose)
By the way, I have been passing just (x, y) without score since I’m working with images + OpenPose and I think it might be rather dependent upon camera setup/resolution so would prefer to sacrifice in-domain accuracy for generalisation. It’s up to you whether you include score or not.
Would do with the pasting but my code isn’t working at the moment.
Here’s one of the yaml files used:
https://github.com/open-mmlab/mmskeleton/blob/master/configs/recognition/st_gcn/kinetics-skeleton-from-openpose.yaml