3D HUMAN POSE ESTIMATION IN VIETNAMESE TRADITIONAL MARTIAL ART VIDEOS

. Preserving, maintaining and teaching traditional martial arts are very important activities in social life. That helps preserve national culture, exercise and self-defense for practitioners. However, traditional martial arts have many di(cid:27)erent postures and activities of the body and body parts are diverse. The problem of estimating the actions of the human body still has many challenges, such as accuracy, obscurity, etc. In this paper, we survey several strong studies in the recently years for 3-D human pose estimation. Statistical tables have been compiled for years, typical results of these studies on the Human 3.6m dataset have been summarized. We also present a comparative study for 3-D human pose estimation based on the method that uses a single image. This study based on the meth-ods that use the Convolutional Neural Network (CNN) for 2-D pose estimation, and then using 3-D pose library for mapping the 2-D results into the 3-D space. The CNNs model is trained on the benchmark datasets as MSCOCO Keypoints Challenge dataset [1], Human 3.6m [2], MPII dataset [3], LSP [4], [5], etc. We (cid:28)nal publish the dataset of Vietnamese’s traditional martial arts in Binh Dinh province for evaluating the 3-D human pose estimation. Quantitative results are presented and evaluated


Introduction
Estimating and predicting the actions of the human body is a well-studied problem in the robotics and computer vision community.3-D human pose estimation is also applied in many other applications such as sports analysis, evaluation analysis and playing games with 3-D graphics, or in health care and protection.Especially, 3-D human pose estimation has the estimated results that can fully see human actions in the real world, and addresses cases when human parts are obscured.However, 3-D human pose estimation have many challenges.The estimation in the 3-D space is very dicult to extract and train the features vector because 3-D data is much more complex than data in 2-D space (image space), or estimate many people in the outdoor environment, noise of data (data missing parts of the human body).There are two methods to do recovering 3-D human pose: The rst is recovering 3-D human pose from a single image; The second is recovering 3-D human pose from a sequence of images [6].Regarding the rst method 3-D human pose estimation using a single image usually performs 2-D human pose estimation and then maps to 3-D space.The second method using a sequence of images is the combination of its 2-D pose human estimation and based on geometric transformations (ane transformations)/mapping to build the skeleton in the 3D space of the person [7].
To address 2-D human pose estimation can be based on a set of methods such as analyzing people in the images, locating people in the images, locating key points on human bodies and identifying joints on points represented on the body (skeleton).In recent years, studies of these methods are often based on the CNN models.2-D human pose estimation is usually based on color images and depth images or it is based on objects and action context [8].The above studies often use color images, depth images [9], or skeleton [10] obtained from dierent types of sensors (e.g, Microsoft (MS) Kinect version 1, MS Kinect version 2, Time-of-Flight-Sensors).
In particular, Microsoft (MS) Kinect sensor version 1(v1) is a common and cheap sensor that can collect information such as color images, depth images, skeleton and acceleration vectors [11].
In this paper, the main contributions are: (1) We survey on recent 3D human pose estimation techniques in the recently years by 3-D human pose estimation; (2) We propose a comparative study for 3-D human pose estimation based on the method that uses a single image, they captured MS Kinect sensor v1; (3) We propose measures to evaluate and publish the dataset of Vietnamese's traditional martial arts in Binh Dinh province.This paper is structured as follows: the rst is the introduction of 3-D human pose estimation (Section 1. ); the second is the literature review of some studies of 3-D human pose estimation in recently years (Section 2. ); the third is the comparative study of 3-D human pose estimation on the Vietnamese's traditional martial arts dataset (Section 3. ); the nal is some conclusions and discussions (Section 5. ).

Related Works
3-D human pose estimation is often using most computer vision techniques.These studies can be based on a single image or a sequence of images.The human poses and actions estimation is applied in many application such as: human interaction (such as body language or gesture recognition), human interaction with robots, video surveillance (use to convey human actions) [6].To address 3-D human pose estimation from a single image, these studies are often performed from 2-D pose estimation and then mapping into the 3-D space.The model often applied to estimating 3-D human pose is shown in Figure 3 of [6].In this section, we examine in detail the studies that estimate 3-D human pose following two above methods.Especially in the last few years a number of studies on 3-D human pose estimation have been published on many prestigious conferences and journals of computer science and computer vision.This is shown in Fig. 1.
Most studies of 3-D human pose estimation use the CNN models to train and estimate 2-D human pose (rst method)(studies by Pavllo et al. [12], Wang et al. [13], etc) or use the 2-D human pose annotation (second method) (studies by Karim et al. [14], Hossain et al. [15], etc).These studies use color or depth images as input.The rst method projected the 2-D human pose results into the 3-D space by 3-D pose library as [2] and then nd the most suitable 3-D pose; The second method projected the 3-D space by the parameters of captured sensors [16] or using a CNN model [17].
3-D human pose estimation result was based on M P JP E measurement, as shown in Tab. 1.

3-D human pose estimation from a single image
As reported in the survey of Saraanos et al. [6], 3-D human pose estimation from a single image is performed based on two steps: 2-D human pose estimation and then estimate its depth by matching to a library of 3-D poses as Fig. 2.

3-D human pose estimation from a sequence of images
Especially estimating 3-D skeleton and posture of human is an essential skill in rebuilding the actual environment and estimating joints in the eld of the parts of the human limbs is obscured.

UP-3D
HumanEva-I Human3.6mMPJPE Pavllo et al. [35] No QuaterNet, represents rotations with quaternions and our loss function performs forward kinematics on a skeleton to penalize absolute position errors instead of angle errors; it reduce proning to error accumulation along the kinematic chain Human3.6mMPJPE Martinez et al. [29] Yes 2D pose detections using the state-of-the-art stacked hourglass network which pre-trained on the MPII dataset; we can train data-hungry algorithms for the 2d-to-3d problem with large amounts of 3D mocap data captured in controlled environments Human3.6mHumanEva MPII MPJPE Pavlakos et al. [30] Yes For 2D human pose estimation, authors discretize the space around the subject and use a ConvNet to predict per voxel likelihoods for each joint from a single color image; a subsequent optimization step to recover 3D pose.

Human3.6m
HumanEva-I KTH Football II MPII MPJPE Tekin et al. [33] No For 2D human pose estimation: The authors employed the stacked hourglass network design, which carries out repeated bottom-up, top-down processing to capture spatial relationships in the image; a discriminative fusion framework to simultaneously exploit 2D joint location condence maps and 3D image cues for 3D human pose estimation.

Human3.6m
HumanEva-I KTH Football II LSP MPJPE Haque et al. [41] No The authors propose a viewpoint invariant model for 3D human pose estimation from a single depth image.

3-D Pose Estimation
The activity of the human body is detected and recognized as well as predicted and estimated, based on parts of the human body.Parts are based on the link between the key points.Each part is represented by a L c vector in 2-D space (image space) in a set of vectors on human body S, where the set of vectors L = {L 1 , L 2 , ..., L C }, has C vectors on human body S. The body of S is represented by the key points J, S = {S 1 , S 2 , ..., S J }.For an input image of size (w × h) pixels, the position of the key points can be S j ∈ R w×h , j ∈ {1, 2,...,J}.CNN architecture is shown in Fig. 5.As can be seen in Fig. 5, this CNN consists of two branches performing two dierent jobs.From input data, a set of feature maps F is created from analyzing image, then these condence maps and an-ity elds are detected at the rst stage.The key points on the training data are displayed on condence maps as shown.These points are trained to estimate key points on color images.The rst branch (top branch) is used to estimate key points; the second branch (bottom branch) is used to predict the anity elds matching joints on many people.As shown in Fig. 5, this CNN consists of two branches performing two dierent jobs.From the input data, a set of feature maps F is created from the image analysis; these condence maps and anity elds are detected at the rst stage.Branch in Fig. 5 is the CNN that called "CPM -Convolutional Pose Machines" [46] to estimate 2-D human pose.
The detailed model of training and predicting (Figure 3) of Zhe's study [47] is shown as follows: The input image at stage 1 is an image with 3 color channels (R,G,B) and has a size of h×w and features extracted from multiplication with masks that have the size 9 × 9, 2×, 5×, ...  stage takes the results of the heatmaps of the rst stage as the input.
Therein, each heatmap indicates the location condence (x, y) of the key points.Therefore, the key points on the training data are displayed on condence maps as shown in Fig. 3.These points are trained to estimate the key points on color images.The rst branch (top branch) is used to estimate the key points, and the second branch (bottom branch) is used to predict the anity elds matching joints.
In this paper, we conduct a comparative study of 3-D human pose estimation, as is shown in Fig. 6.In which the methods are presented as follows: • The rst method is called "3-D_COCO_Method": 2-D human pose estimation by using CPM that was trained on the MSCOCO Key points Challeng [1] dataset + mapping to 3-D space by 3-D pose library of Human 3.6m dataset [37].
• The second method is called "3-D_HUMAN3.6_Method":2-D human pose estimation by using CPM that was trained on the Human 3.6m [2] + mapping to 3-D space by 3-D pose library of Human 3.6m dataset [37].
• The third method is called "3- to minimize the following estimate as Eq. 1.
arg min R,µ,a,e,σ N i=1 where, a i e = j a i,j e j is the tensor analog of a multiplication between a vector and a matrix, and || .|| 2 2 is the squared Frobenius norm of the matrix, y axis is assumed to point up and the rotation matrix R i is considered to be rotated against the ground plane.
In the comparative study, the third method is based on the method of Mehta et al. [36], The authors use the regression CNN model to predict the heatmaps by method of Tompson et al. [49] Especially the training of features for learning and predicting the map highlights is based on ResNet (Deep Residual Networks) network [50], which provides a breakthrough idea for building Characteristic and training.The ResNet in [50] is built on the platform of Tensorow library of [51].The model in this network uses the MPII dataset [3], LSP [4], [5] for the training of estimating the key points on the image.To estimate the 3-D human pose, the authors employed the method of Ionescu et al. [52] with the use of Hu-man3.6mdataset [2] and MPI-INF-3DHP [53] for projecting 2-D human pose estimation to 3-D space.

Data collection and evaluation
Traditional martial arts, a very important sport, help people exercise and protect themselves.In many countries around the world, especially in Asia, there are many traditional martial arts handed down from generation to generation.With the development of technology, it is important to maintain, preserve and teach such martial arts [54], [55].There are also many dierent types of image sensors that can collect information about martial arts teaching and learning of the schools of martial arts.The MS Kinect sensor v1 is the cheapest sensor.This type of sensor can collect a lot of information such as color images, depth images, skeleton, acceleration vectors, sounds, etc. From the collected data, it is possible to recreate the environment in 3-D space about teaching martial arts in the schools of martial arts.However, in this paper, based on   the information collected from the MS Kinect sensor, we only use color images for the construction of this study.To obtain data from the sensor environment, the MS Kinect SDK 1.8 is used to connect computers and sensors [56].To perform data collection on computers, we use a data collection program developed at MICA Institute [57] with the support of the OpenCV 3.4 libraries [58], C++ programming language.Between the sensors of color images, depth images, and the skeleton.Therefore, it is recommended to make a calibration to take the data on color images and depth images; particularly, we apply the data calibration of Zhou et al. [59] and Jean et al. [60].In these two calibration tools, the calibration matrix is used as follows:

3-D Comparative Study
where (c x , c y ) is the principle point (usually the image center), f x and f y are the focal lengths.
The matrix H m (in Nicolas et al. [61]) is calculated as follows: is marked on data in 3-D space.To do this, we showed 3-D data (point cloud data) of the scene on the visualization window of a program that we developed based on the Visual Studio programming environment and the support of the PCL library [62] with c ++ programming language.Figure 8 illustrates the 3-D human pose data.We marked 17 key points on the human body.In some cases when the limbs are obscured, we assume that the person's hands or feet, are often close to the human body and they are chosen as in the case of hand or foot data being seen.Currently marking points in 3-D space is manually done, only considering the data of one side of the MS Kinect sensor.This study has not looked into cases when the data is obscured and when the actions of people are complicated.
In order to mark data in 3-D space when obscured, which is often used MOCAP system [63] for calculating the actual coordinates of human hands and feet. ( (5) The dataset is collected from a MS Kinect sensor v1, it can collect data at a rate of about 10 frames/s on a low-conguration Laptop.MS Kinect sensor v1 is mounted on a xed rack; martial arts instructor represents a space of about 3 × 3m as Fig. 9 and calls "VNMA -VietNam Martial Arts".
The obtained images (color images, depth images) are 640 × 480 pixels.The obtained data set consists of 24 videos of dierent postures  The output of 3-D human pose estimation based on the method of Tome et al. [37] (the methods: "3-D_COCO_Method", "3-D_HUMAN3.6_Method") is 17 key points, as shown in Fig. 10.The output of 3-D human pose estimation based on the method of Mehta et al. [36] (the methods: "3-D_VNECT_Method") is 21 key points.
There is the fact that the 3-D ground truth data follows the MS Kinect coordinate system, and the estimated data is based on the coordinate system of the training data to estimate the 3-D human pose as the coordinate system of Human 3.6m dataset or MPI-INF-3DHP dataset [53] .These two types of data are not in the same coordinate system, and thus take steps to the synchronized coordinate system.In this study we combine the ndings of the rotation and the translation matrix into a process, in which the rotation and translation matrices are represented in the 3-D space [65] as Eq. 4 where P (x, y, z) is the estimated point of 3-D human pose estimation result; P (x , y , z ) is the estimated point of 3-D human pose estimation result after transform to the same coordinate system with the 3-D ground truth data.Therefore, we have a formulation as in Eq. ( 5).
From the coordinates of the key points in the 3-D human pose of the dataset, we dene the coordinates of a 3-D pose including n points as in Eq. (6).
In particular, the rotation matrix and translation according to the x, y, z axes are presented in the order θ 1 , θ 2 , θ 2 as in the Eq.(7).
The results of rotation and translation are shown in the vector X , Y , Z as in the Eq. ( 8).
where, x i , y i , z i is the coordinate value on the 3-D pose ground truth data (which is the coordinate system destination that the 3-D human pose estimated to be rotated and translated to it);x j , y j , z j is the coordinates of key points of the 3-D human pose estimated data, which is expected to rotate and translate to the same coordinate system with the 3-D human pose ground truth data.
From this, we have a system of linear equations presented in the Eq.(9).
In which the estimation θ i is the using the Least Squares method (LS) [66], [67] as in Eq. (10).
The entire source of the rotation and translation is stored in the path: https://drive.google.com/file/d/1dIHgal63TcGn0-6_hnTJsEDfh8qkNOsE/ view?usp=sharing and explained in detail in appendix A and appendix B. Finally we have the transformation matrix in the form (θ 1 ; θ 2 ; θ 3 ).
The testing process is performed on workstation computer with Intel (R) Xeon (R) CPU E5-2420 v2 @ 2.20GHz 16GB RAM, GPU GTX 1080 TI-12GB Memory.In this paper, we choose 15 common points between the 3-D ground truth data, the output key points of Tome et al. [37] method and the output key points of Mehta et al. [36] method as in Fig. 12.We use the MPJPE (Mean Per Joint Position Error) (mm) for evaluating 3-D human pose estimation.This measure is the Euclidean distance between the two key points corresponding to the 3-D ground truth data and the estimated 3-D pose; the distance is calculated as in Eq. 11.
where (x g , y g , z g ) is the coordinates of the ground-truth key points p g in the 3-D space, (x e , y e , z e ) is the coordinates of the estimated key points p e in the 3-D space.
The input data of this study is the color images in the video.The output data is the 3-D human pose estimation results.

4.2.
Results of estimation and discussion The results of 3-D human pose estimation on VNMA database are provided in Tab. 5.
Figure 13 shows the error distance distribution when estimating 3-D human pose on the VNMA database with 15 key points.Table 5 and Figure 13 reveal that the rst method "3-D_COCO_Method" has the best estimation results (the average of MPJPE is 170.866mm).These error values are high because the 3-D ground truth data is manually analyzed; therefore it is not as accurate as the 3-D ground truth data calculated from the MOCAP system.The third method "3-D_VNECT_Method" has the lowest estimation results (the average of MPJPE is 279.4472mm).During the testing process, we found that the 2-D human pose estimation result of "3-D_VNECT_Method" method is much wrong as in Fig. 14.
Figure 15 shows several 3-D human pose estimation results on the VNMA dataset with 17 key points.
In particular, 3-D human pose estimation based on the proposed comparative study, has solved the cases when the parts are obscured, 3-D human skeleton is fully restored as in Fig. 16.

Conclusion and future work
The preservation, storage and teaching of traditional martial arts are very important in preserving national cultural identities and training health and inviduals' self-defense.However, the actions of the body (body, arms, legs) of a martial arts instructor are not always clear.There are many hidden joints.
In this paper, we surveyed, summarized the studies on the 3-D human pose estimation in two methods: 3-D human pose estimation from an image or a sequence of images.Many studies in 3-D human pose estimation used the Human 3.6m dataset for training the models estimation and based on MPJPE measurement for evaluating the errors estimation.Studies from 2016 to 2018 have a tolerance of about 80-150 mm, and use a GPU that can be done.However, studies from 2019 have errors smaller than 80 mm, but the number of GPUs required for training and testing is greater than 1.
We proposed a dataset by the Vietnam martial arts called "VNMA" and proposed a comparative study based on the methods which used the CNN model for estimating 3-D human pose.In particular, studies of 3-D human pose estimation restored the full skeleton even when the joints are obscured.
In the future, we will substantially build this body the in 3-D space with mesh technique.From this, we will build the 3-D videos on Vietnamese traditional martial arts, served for storing, preserving, and teaching martial arts.The source code for method "3-D_VNECT_Method" is shown in the "VNect-tensorow-master" folder.The description of the entire source code for this method is shown in the "README.md"le.
The results of "3-D_COCO_Method" method on the VNMA dataset are shown in the "Result_outdata_Human3._Input_COCO" folder.The results of "3-D_Human 3.6_Method" method on the VNMA dataset are shown in the "Result_ourdata_human3.6_liting" folder.
The results of "3-D_VNECT_Method" method on the VNMA dataset are shown in the "Re-sult_ourdataset_VNect" folder.

B appendix: Dataset
The VNMA dataset includes 24 videos and store in the "Data_24_video" folder, where each video includes the color images, depth images, point cloud data in the 3-D space of each frame.
In order to synchronize the coordinate system of the estimated 3D human pose and the ground truth data, we have built the source code to rotate and translate the estimated 3-D human pose data to the same coordinate system with the ground truth data by "calculate_coco.m"and "calculate_matrix_14.m" and "estimateCoord_14.m"les in the "rotated_translated_14_points" folder.
"This is an Open Access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium provided the original work is properly cited (CC BY 4.0)."

Fig. 1 :
Fig. 1: Statistics of published studies on the 3-D human pose estimation following each year

Fig. 2 :
Fig. 2: Illustration of method for 3-D human pose estimation [38]: the input is a RGB image, the rst estimate a 2-D pose and then estimate its depth by matching to a library of 3-D poses.The nal prediction is given by the colored skeleton based on the 3-D poses library, while the ground-truth is shown in gray.

Fig. 4 :
Fig. 4: Illustration of the detail model to extract the feature for training model and to predict the heatmaps at each stage [48].

Fig. 6 :
Fig. 6: Comparative study for evaluating 2-D human pose estimation in the 3-D space.
In this dataset we also provided the 3-D pose annotation.The ground truth data of key points 478 c 2019 Journal of Advanced Engineering and Computation (JAEC)

Fig. 10 :
Fig. 10: The output of 3-D human pose estimationbased on the method of Tome et al.[37]

Fig. 11 :
Fig. 11: Illustration of nding the rotation, translation matrix in the 3-D space.

Fig. 12 :
Fig. 12: Illustrating 3-D human pose for evaluating 3-D human pose estimation.The blue key points are ground truth data, the red key points are the estimated data which transformed the same coordinate system.

Fig. 14 :
Fig. 14: The result of 2-D human pose estimation based on the method of Mehta et al. [36], 21 key points are predicted.

Fig. 15 :
Fig. 15: The results of 3-D human pose estimation.Each block is a pair of correspondences between the 3-D pose of the ground truth data (ground truth -original) and the estimated 3-D human pose (estimating).Each pair of frames in a block has been synchronized to the coordinate system.

Fig. 16 :
Fig. 16: The results of 3-D human pose estimation, when some parts are obscured.
Tab. 5: The results of 3-D human pose estimation on the VNMA dataset with 15 key points.MPJPE (mm) in VNMA dataset with 15 key points The number of key points CPM training by COCO CPM training by Human 3.6m VNECT CNN training by MPII, LSP