Thursday, 9 April, 2020 UTC


Summary

Source
Project done by Jupiter Zhu, a deep-learning intern at Hyprsense
Along with the significant improvement of object detection/segmentation technology from deep learning, and the rising trend of virtual celebrity, Hyprsense is experimenting with whole-body tracking for virtual-beings developers.
At the core of it is the body/limb tracking network, and here is the report on the implementation of a network base on the architecture of the Openpose model, and how the training could be further accelerated onto a cloud platform like GCP.

Here is a Brief Overview of the Project:

  • Goals: 2D body tracking and joints predictions
  • Dataset: COCO dataset, around 56k image in train and 2k in validation
  • Network structure: Partial affinity refining network based on the paper of ‘OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
  • Training hardware/platform: 2080Ti on local machines & v100 cluster on GCP

And in This Blogpost, We Shall Cover the Following:

  1. PAF: What partial affinity fields are and why are they helpful in joint detection.
  2. Implementation of the network and its challenges: The specific network design from openpose and the problem one will run into in training time.
  3. Improvements and acceleration: We used a C++ data generator to facilitate the large data generation/augmentation task. We also containerize the program so that it could be trained on GCP easily in a scalable way.
  4. Result and future direction: We shall present some samples of prediction pictures and videos. We will also cover how this network fits in the big picture of Hyprsense vision on the virtual-beings development.

PAF:

To predict where the joints are, it is necessary to know the approximate location of the limb those joints correspond to and the limb’s orientation. After all once one knows the joints location, the location of the limb and its orientation is determined. Thus, the direction and location of the whole limbs, instead of just their joints, are natural things to consider in a body-tracking network.
To represent such information, we introduce partial affinity fields (PAF). Each limb is represented on a separate channel of vector fields. So for example, if we include 19 limbs, and our output size is a 100 by 100 picture, then we shall have 19 * 100 * 100 *2 output, or in implementation, 100*100*38 if channels come last.
Considering two joints j1 and j2 (say right elbow and right wrist, and thus, we are on the channel of the right forearm). We can represent its direction by its unit vector v.
PAF on this channel is then the following way of assigning each point of this channel a vector. Those who are on the limb we shall assign them by v and those who are not will be assigned the 0 vector(in R2). So now the question is, given a point p in space, how do we decide if p is “on” the limb j1-j2 mathematically?
We determine an arbitrary width, call it w, which will be a hyperparameter one can tweak later in training. (One can also tweak the w differently for different limbs.) Note these two equations will result in a rectangular region around the limb like the picture below.
So if u, the vector from j1 to p is within that width (the purple area), then we say such p is on the limb, and at that pixel, we shall have the vector v on pixel p on this channel. On a technical note: if multiple limbs of the same type, say right forearms, overlap in a certain region, the pixels on that region will take the average value of those unit vectors.

The end result is something of this form:

Input photo (left) & the result of PAF with all channels combined (right)
Note that the above picture is collapsing all channels of PAF together for better visuals.
Once the PAF is predicted, one can use it to help limb matching given multiple people in the same picture.
Given the heatmap prediction like above, say we have two elbows and two wrists in the heatmap prediction, the limb-matching that matches the flow the best should be picked.
This “along-the-flow” extent could be measured by the above equation, where Lc is the PAF for limb c, which consists of the keypoints j1 and j2. The vector v stands for the unit direction from j1 to j2. By integrating, thus summing the dot product between v and Lc, the line integral E, ranged from -1 to 1, indicates how compatible the assignment is with the background PAF.

Trending AR VR Articles:

1. Ready Player One : How Close Are We?
2. Augmented Reality — with React-Native
3. Five Augmented Reality Uses That Solve Real-Life Problems
4. Virtual Reality Headsets: What are the Options? Which is Right For You?

Implementations:

We implemented a “4+2” structure with PAF and heatmap refinement stage. Every PAF stage will output a prediction of the PAF based on the input and similarly for the heatmap(HM) stage. Every stage takes some previous results along with original input. For every PAF stage, it aims to refine the result of the previous PAF stage. After the 4th stage, every HM stage will intake the original input and the last PAF output.
At train time, from the ground truth, which consists of locations of the keypoints (joints), one can generate the labels of PAF and HM. Note that such label generation also depends on certain hyperparameters, like the decay rate of the Gaussin in HM or the aforementioned limb width parameter.
Each stage output will then be compared to its corresponding label in L2 Euclidean norm. In fact, because of the small area usually one limb covers in a picture, the L2 difference will be only taking effect between pixels that honestly have non zero vectors in the label.
This is the same as comparing the total L2 norm with multiplying a binary mask which takes value 1 on annotated points.
We believe such maneuvers should encourage the model to predict non-trivial results instead of always output 0 vector fields. In the equation above, c is ranging from all the limb type, the circle dot is the pixel-wise product with the binary mask W on channel c. L could be PAF or HM.
Above is an illustration of what’s inside of one stage. It consists of several blocks of conv layers with highways concatenated at the end. The stage is finished with 2 1 by 1 conv layers.
The weight mask along with the label creation presents its challenges for large data generation tasks in training, which is usually a task handled by the CPU. One should deploy parallel usage of CPU and GPU to avoid GPU downtime.
The above picture demonstrates the CPU and GPU work distribution. One batch size is about 20 pictures and the total epoch should take around 30 minutes on a 2080Ti.

Improvements and accelerations:

Prof. Hong of the Hyprsense team designed and coded a training network that utilizes a C++ data generator via Boost Python, which embeds the model structured by the Keras API.
The above picture illustrates the usage of multi-threading in CPU data generation tasks. Compared to a pure Python implementation, this will further improve the stability of the training and the speed of data generation. Such improvement will come in handy when we bring the model training to the Google Cloud Platform(GCP).
One of the major challenges this model provides is its large number of hyperparameters. The freedom of the label creations along with other traditional deep learning parameters (learning rate, batch size, augmentation extent, etc.) makes the training and experimenting cycle extremely long. To accelerate this process, one logical solution would be using a scalable cloud computing cluster to run multiple experiments in parallel.
We use Docker to containerize our code and push to a Google Kubernetes Engine’s image registry. The data is stored in Persistent Volume Claim (PVC). The multiple experiments, with different hyperparameter setup, simultaneously read the data from the PVC and store their tensorboard results to a Google Storage Bucket.
One issue is the tf.keras.model.save is not supported to intake a gs:// (Google storage) URL, thus we cache the model inside of the Docker’s ephemeral storage and then write into a bucket using the byte I/O. Though this process is inefficient, it is only happening every epoch or so it should not greatly impact the training time.
One thing to keep in mind is that most of the clusters have CPU and GPU constraints. With cluster using V100 Tesla GPU, the CPU will likely bottleneck from the data generation. With a P100 GPU cluster, though the CPU is superior, the GPU is inferior to 2080Ti.
Thus to fully utilize GCP, one should consider optimizing the CPU parts of the code to avoid GPU downtime. In fact, Prof Hong has now updated the C++ data generator design, so that the GPU downtime on the V100 cluster is minimal.

Results

Results from the COCO dataset:
From left to right: original, PAF label, PAF predictions, HM labels, HM predictions
We shall also upload our demo in the near future, so remember to tune in for the update.

Future direction:

  • Integrate this limb tracking model with the Hyprface to develop a whole-body tracking all-in-one solution.
  • Experiment with depthwise conv2d layer to compress the model.
  • Extend the current model from 2D to 3D.
  • Testing on various optimization methods to see if the result could be further improved.

Acknowledgments:

I am grateful for the opportunity of working at Hyprsense and being assigned and entrusted with such a significant task. The team and company as a whole are always so passionate about my project and are very helpful when the task comes to a bottleneck. I also want to express my sincere gratitude especially towards my team leader Dr. Jeong-Mo Hong for his mentorship.

Reference:

  • OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
  • Blog by Quan Hua.PAF definition.
  • COCO dataset. Common object in Context.
Hyprsense develops real-time human sensing technology. Hyprface is our product-ready software fully built in-house to track expression values and remap them into a 3D character in real-time. The SDK supports iOS, Android, Windows, Unity, and Unreal. If you are interested, feel free to ask us for a free 30-day trial SDK.

Don’t forget to give us your 👏 !

https://medium.com/media/1e1f2ee7654748bb938735cbca6f0fd3/href
Implementation of PAF (Openpose) Pose Detection Network & its Training Accelerations on GCP was originally published in AR/VR Journey: Augmented & Virtual Reality Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.