Skip to main content

A multi-channel spatial information feature based human pose estimation algorithm

Abstract

Human pose estimation is an important task in computer vision, which can provide key point detection of human body and obtain bone information. At present, human pose estimation is mainly utilized for detection of large targets, and there is no solution for detection of small targets. This paper proposes a multi-channel spatial information feature based human pose (MCSF-Pose) estimation algorithm to address the issue of medium and small targets inaccurate detection of human key points in scenarios involving occlusion and multiple poses. The MCSF-Pose network is a bottom-up regression network. Firstly, an UP-Focus module is designed to expand the feature information while reducing parameter computation during the up-sampling process. Then, the channel segmentation strategy is adopted to cut the features, and the feature information of multiple dimensions is retained through different convolutional groups, which reduces the parameter lightweight network model and makes up for the loss of the feature information associated with the depth of the network. Finally, the three-layer PANet structure is designed to reduce the complexity of the model. With the aid of the structure, it also to improve the detection accuracy and anti-interference ability of human key points. The experimental results indicate that the proposed algorithm outperforms YOLO-Pose and other human pose estimation algorithms on COCO2017 and MPII human pose datasets.

Introduction

With the rapid progress of artificial intelligence technology. In the field of computer vision, there are lots of technologies widely used in our daily life, including human behavior recognition and human posture estimation. Human pose estimation has become the focus of academia and industry, which is extensively applied in biometrics, kinematics and other fields.

Human pose estimation is a method to reconstruct human body by locating key points in images or videos. While the earlier methods used traditional graph structure models and deformable component models, the current research focuses on the application of Convolutional Neural Networks (Kim et al. 2014; Jiaxin et al. 2021). Liu et al. (2023) provided an overall classification of human posture assessment networks from 2 to 3D perspectives. And the concluded that the current mainstream human posture estimation networks were divided into top-down and bottom-up.

The top-down approach consists of two stages: first of all, a human target detector is employed to obtain the human target bounding box on the one stage, and then a single human pose estimation method is utilized to locate the key points within each human box on the other stage. A cascaded pyramid network (Chen et al. 2018) was proposed in the COCO Challenge. This method utilized a feature pyramid (Lin et al. 2017) based object detector as the human bounding box detector, and then employed the proposed cascaded pyramid network for single-person human pose estimation. Xiao et al. (2018) proposed a simple baseline network for human pose estimation, which achieved high accuracy in human pose estimation. Riza Alp Guler et al. (2018) proposed a Dense Pose network based on Mask R-CNN (Xie et al. 2019; He et al. 2020), The network converted 2D human images into 3D images to achieve human pose estimation, which predicted the position and directional vector of each pixel on the surface of the human body, Du et al. (2021) proposed a simple 3D ResNets scaling strategy, The method combined 2D and 3D convolutional network structures to leverage the 3D ResNets architecture. Therefore, a better spatiotemporal information, is captured to output recognition results.

The bottom-up network first detects all human key points in the image, then the relationship between key points is determined with the human body through different clustering and grouping strategies. The bottom-up approach is independent of obtaining the bounding box of the human body. It only can be executed once to obtain all the human poses in the image. Insafutdinov et al. (2016) proposed a more accurate and faster DeeperCut network, which introduced deeper ResNet to improve the accuracy of human body part detection. Cao et al. (2021) proposed a real-time multi-person human pose estimation system, namely OpenPose system. This system utilized a multi-stage iterative convolutional neural network structure to estimate the poses of all individuals in an image through a single pose estimation process. Sun et al. (2019) introduced a High-Resolution Network for human pose estimation at high resolutions. This network comprised multiple stages with each stage was dedicated to extract low-resolution and high-resolution features, which were subsequently fused across stages. Building upon HRNet, Cheng et al. (2020) introduced Higher-HRNet, which was a more complex network structure that incorporated higher-resolution feature maps. This enhancement aimed to improve the precision of key point localization and the accuracy of pose estimation. Yinghao and Bogo (2021) proposed a single-image 3D human pose estimation method based on key point estimation and grid convolutional neural networks. This approach enables the estimation of three-dimensional key points from a single image. In contrast to bottom-up methods that require grouping for detecting human key points, this network relied on heat map regression for joint coordinate information, which introducd potential detection errors in the network's results. Maji et al. (2022) introduced a human pose estimation algorithm called YOLO-Pose, based on the YOLO network architecture (Redmon et al. 2016; Liu et al. 2022). YOLO-Pose diverges from traditional heat map regression and relied on regressing human key points through target detection boxes. However, this approach faces challenges associated with the diversity of human key points and the inherent characteristics of the YOLO network, such as potential instances of missed detections.

In this paper, we analyzed the current situation of small and medium target detection in human key point detection. In order to further improve the quality of small target human key point a multi-channel spatial information feature based human pose estimation algorithm is proposed using YOLOv5 as the basic framework (Yang et al. 2020; Qi et al. 2022; Yan et al. 2021). The proposed algorithm enhances the YOLOv5 network structure and has the following advantages: (1) Learning a broader range of channel and feature information, (2) Reducing convolutional layer parameters to optimize network efficiency, (3) Facilitating faster network convergence. Through experimental comparisons, our proposed convolutional groups exhibit a significant improvement in the accuracy of detecting human key points, particularly in scenarios involving occlusion and various irregular poses in medium to small-sized images.

MCSF-pose network architecture design

Overall network structure design

This paper proposes a model based on the YOLOv5s-Pose architecture as the backbone for feature extraction. In order to make the network model better adapt to human key point feature extraction, additional feature inputs are introduced into the backbone network to enhance the diversity of features. Subsequently, improvements are made on the convolutional groups of the CSPNet to enhance the network's capability in representing multi-dimensional features. Finally, the detection efficiency is improved through the PANet (Liu et al. 2018; Ong et al. 2023; Chen et al. 2022) network structure, which has been enhanced by us. With the assistance of the above enhancements, a human posture network based on multi-channel spatial information (MCSF-Pose) is proposed in the proposed paper, which is illustrated in Fig. 1.

Fig. 1
figure 1

Multi-channel spatial information features based human posture network structure

Aiming at the problem of feature weak association between different dimensions of CSPNet convolutional sets, a parallel learning model of attentional information and spatial information is proposed in this paper. Multi-feature convolution group (Multi-ConvG), module integration utilizes channel attention module and spatial information module. The algorithm achieves the human key points detection network independent of L1 loss, which belongs to the non-heat map method, to realizes the human key points estimation through the end-to-end method.

Design of UP focus data expansion module

The focus module in the YOLOv5 network transforms and adjusts the input feature map to better suit the objectives of the target detection task. Its implementation involves using a smaller receptive field and capturing lower-level features. With the assistance of there, the perception of small-sized targets is enhanced in the model. In YOLOv5, the focus layer performs a slicing operation on the input RGB image. Then this operation results in four feature maps of consistent size in the channel dimension, which are concatenated in our paper. The concatenation increases the channel count by a factor of 4, transforming the feature maps into 12 channels relative to the original RGB three-channel feature map. Subsequently, through convolutional operations, a feature map similar to a twice down-sampled map is obtained without losing information. This reconstruction from high to low resolution enlarges the receptive field of each pixel. Therefore, the backbone network can be extracted more detailed feature information through a series of vector convolution operations and C3 module composed of convolution, as illustrated in Fig. 2.

Fig. 2
figure 2

Focus module structure

In this paper, the focus module in YOLOv5 has been redefined for data augmentation, as depicted in Fig. 3. Firstly, nearest-neighbor interpolation are performed on the input image to up-sample, which can create an image similar to the original with approximately twice the size. The advantage of this method is to reduce the parameter calculation in the process of using the neural network up-sampling, and to expand the image feature, so that the input data has more feature generalization ability. Subsequently, the focus module is applied to slice and concatenate values from the up-sampled image in the channel dimension. Finally, a 1 × 1 convolutional kernel is utilized to further expand the features, resulting in the necessary input for the network. This modification aims to improve the effectiveness of data augmentation in the YOLOv5 Focus module.

Fig. 3
figure 3

UP-Focus module structure

Design of multi feature convolution group (Multi-ConvG) module

Low dimensional features will gradually decrease with the increase of network depth, but low dimensional features contain a lot of feature information that can be described intuitively. After multi-layer convolution calculation, the typical feature of small target in sequence usually, merge with the peripheral features. For example, high-latitude abstract features cannot express the original single feature of small target. This problem can be satisfactorily solved through retaining some low-dimensional feature information in the process of network parameter transfer.

As illustrated in Fig. 4, CSPNet is a network module designed for constructing deep feature extraction. CSPNet employs a strategy that input channels are repeatedly split and stacked. One portion undergoes feature extraction via a shallow convolutional network, while the other part performs feature extraction through a deep convolutional network. The module then merges these two sets of features to enhance the network's generalization capabilities while simultaneously reducing computational requirements and the number of parameters. However, it is noted that the CSP Net overlooks the correlation of contextual information among different-dimensional features.

Fig. 4
figure 4

CSP network structure

Aiming at settling the problem of correlation between multi-dimensional features and the problem that low dimensional information transmission will be better with the gradual deepening of the number of network layers, this paper proposes a Multi-ConvG network, as shown in Fig. 5. The network can be divided into two groups of modules: dense convolution group and global low dimensional feature group. The network initially divides the input features into two parts equally along the channel dimension. With one part is served as the input to the global low dimensional feature set, which can retain a large amount of original information through a small number of convolution groups. The half of the features are input into the sparse convolution group. The other half of the features are input into the dense convolution group. Meanwhile, the module can also evenly split the features into two parts along the channel dimension. One part of the features is passed through the network branch similar to the global low dimensional feature group, which performs a small amount of convolution operations on the input features, and this branch is defined as the mid-latitude feature. The other part of the feature can pass through multiple sets of 1 × 1 dense volumes, and this branch obtains high latitude feature information. By concatenating three groups of features along channel dimensions, this method effectively preserves a large amount of original information, and mid-latitude features are adopted as context information to associate low dimension with high latitude abstract features.

Fig. 5
figure 5

Multi-ConvG structure

The channel splitting method serves to reduce parameter calculations in the network, facilitating the propagation of features with different semantics throughout the network. The computation formulas are outlined in Eqs. (1) to (8). Equations (1) to (4) detail the calculations of parameters during forward propagation, where variable XU represents the overall output of the Dense Convolution Group, and variable XF represents the combined output of the Dense Convolution Group and the Global Low-Dimensional Feature Group.

$$X_{k} = W_{k} *[X_{0^{\prime\prime}} ,X_{1} ,...,X_{k - 1} ]$$
(1)
$$X_{T} = W_{T} *[X_{0^{\prime\prime}} ,X_{1} ,...,X_{k - 1} ,X_{k} ]$$
(2)
$$X_{U} = W_{U} *[X_{0^{\prime}} ,X_{T} ]$$
(3)
$$X_{F} = W_{F} *[X_{{{\text{low}}}} ,X_{U} ]$$
(4)

Equations (5) to (8) describe the gradient calculations of parameters during the backward propagation process, where variables \({W}_{F}{\prime}\) and \({W}_{U}{\prime}\) do not exhibit any dependency relationship when computing gradients. The Multi-ConvG network structure truncates features during the propagation process, preventing excessive repetition of gradient information and thus reducing the computational load of network parameters.

$$W_{k}^{\prime} = f(W_{k} ,g_{0^{\prime\prime}} ,g_{1} ,g_{2} ,...,g_{k - 1} )$$
(5)
$$W_{T}^{\prime} = f(W_{T} ,g_{0^{\prime\prime}} ,g_{1} ,g_{2} ,...,g_{k} )$$
(6)
$$W_{U}^{\prime} = f(W_{U} ,g_{0^{\prime}} ,g_{T} )$$
(7)
$$W_{F}^{\prime} = f(W_{F} ,g_{low} ,g_{U} )$$
(8)

Design of the improved PANet structure

Path Aggregation Network is a module designed for multi-scale feature fusion, aiming to enhance object detection performance, particularly for small objects. YOLOv5 incorporates a feature pyramid at the top of the network to extract multi-scale semantic information from different levels of feature maps. Subsequently, through cascading and aggregation operations, these multi-scale feature maps are fused to obtain rich contextual information and semantic expression. By leveraging the PANet module, YOLOv5 effectively merges feature maps from different levels, which can enable the network to handle targets of varying scales and extract more semantically informative feature representations. This multi-scale feature fusion contributes to improving the accuracy and robustness of object detection, especially for small object detection and localization. The network structure is illustrated in Fig. 6, where the network performs predictions for targets of four different scales through four down-sampling layers.

Fig. 6
figure 6

PANet network structure

The PANet network, with each layer corresponding to a detection head, may compress human features into a single pixel after four down-sampling layers when the input features are the size of 640 × 640. This compression results in the loss of a significant amount of detailed feature information, which lead to appear the lower detection accuracy and poor network convergence. Therefore, the original four-layer network structure of PANet is not applied in this paper, and the three-layer structure is adopted to lightweight the network model by reducing one layer of down-sampling, as depicted in Fig. 7. The Backbone is divided into three stages, and \(n,m\) and \(i\) epresent the number of Multi-ConvG modules within each stage. In the feature transfer process between stages, the upsampled result from the lower stage is combined with the current stage's output. Then this combined result is then utilized to calculate the loss for detecting bounding boxes and human keypoints.

Fig. 7
figure 7

Stage structure diagram of the improved PANet

Loss function design

The loss function in this algorithm primarily consists of the position loss \({L}_{\text{box}}\) for the target and predicted bounding boxes, the key points loss \({L}_{\text{kpts}}\) for human key points, the classification loss \({L}_{\text{cls}}\), and the total loss \({L}_{\text{total}}\), respectively. The CIoU loss function is employed to calculate the position loss for target and predicted boxes, as shown in Eq. (9).

$$L_{{\text{box }}} (s,i,j,k) = \left( {1 - {\text{CIoU}} \left( {{\text{Box}}_{{{\text{gt}}}}^{s,i,j,k} ,{\text{ Box}}_{{{\text{pred}}}}^{{{\text{s}},i,j,k}} \, } \right)} \right)$$
(9)

Aiming to enhance the regression speed of predicted boxes and the convergence speed of the network. The Object Key Point Similarity (OKS) loss function is used to compute the position loss for human key points. Here, \(B{\text{o}}{\text{x}}_{\text{gt}}^{s,i,j,k}\) represents the ground truth label value, and \(B{\text{o}}{\text{x}}_{\text{pred}}^{s,i,j,k}\) represents the predicted box at position (i, j). As shown in Eq. (10).

$$OKS = \frac{{\sum\limits_{i} {\exp \left(\frac{{d_{n}^{2} }}{{2s^{2} k_{n}^{2} }}\right)\delta (v_{n} > 0)} }}{{\sum\limits_{i} {\delta (v_{n} > 0)} }}$$
(10)

The OKS function is a loss function to evaluate the human pose. It measures the similarity between predicted key points and true key points, primarily assessing the Euclidean distance between predicted and true points, along with scale and area factors. Here, \({d}_{n}^{2}\) represents the Euclidean distance between the prediction and label of the n-th human key point, \({s}^{2}\) is the area of the human detection bounding box, \({k}_{n}^{2}\) is the position weight for this point in the dataset, and \({v}_{n}>0\) indicates whether the key point is visible. As depicted in Eq. (11).

$$L_{{kpts_{ - } conf}} (s,i,j,k) = \mathop \sum \limits_{n = 1}^{{N_{kpts} }} BCE\left( {\delta \left( {v_{n} > 0} \right),p_{kpts}^{n} } \right)$$
(11)

The confidence function for human key points is denoted, where \({p}_{kpts}^{n}\) represents the confidence the confidence of the n-th predicted human key point. The expression for the classification loss function is given in Eq. (12).

$$L_{{{\text{cls}}}} (p,y) = - \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} [y\log p + (1 - y)\log (1 - p)]$$
(12)

The total loss is a comprehensive calculation of the total loss by considering three components, bounding box confidence, human key points confidence, and classification confidence, as shown in Eq. (13). Here, \({\lambda }_{cls}\)\({\lambda }_{box}\)\({\lambda }_{kpts}\)\({\lambda }_{kpt{s}_{-}conf}\) are the coefficient weights between each loss, and they are set to \({\lambda }_{cls}\)=0.5、\({\lambda }_{box}\)=0.05、\({\lambda }_{kpts}\)=0.01、\({\lambda }_{kpt{s}_{-}conf}\)=0.5, respectively.

$$L_{{{\text{total}}}} = \mathop \sum \limits_{s,i,j,k} \left( {\lambda_{cls} {\mathcal{L}}_{cls} + \lambda_{box} {\mathcal{L}}_{{{\text{box}}}} + \lambda_{kpts} {\mathcal{L}}_{kpts} \left. { + \lambda_{{kpts_{ - } conf}} {\mathcal{L}}_{{kpts_{ - } conf}} } \right)} \right.$$
(13)

Experiment and result analysis

Datasets

This paper conducts model training and evaluation on the datasets of COCO2017 and MPII. The COCO dataset, originally created by Microsoft for image segmentation and detection research, later included annotations for human key points. It comprises over 200,000 images and 250,000 person instances, as depicted in Fig. 8. The human key points include 17 location points, encompassing the nose, ankles, elbows, shoulders, wrists, knees, hips, ears, and eyes. As a training set, train2017 contains 57,000 images. The val2017 verification set and test-dev2017 test set contain 5,000 and 20,000 images, respectively. The model is trained on the train2017 set and evaluated for effectiveness on both the val2017 and test-dev2017 sets.

Fig. 8
figure 8

COCO2017 dataset

The MPII dataset consists of 25,000 images, containing annotations for 40,000 human instances. Each image is labeled with information such as the head, eyes, ears, shoulders, arms, legs, and other body parts, as illustrated in Fig. 9. The test set also includes challenging data with occlusions of body parts and 3D torso annotations.

Fig. 9
figure 9

MPII dataset

Evaluation metric

This paper primarily evaluates the model using the Average Precision (AP) metric, as calculated in Eq. (14).

$$AP = \frac{{\sum\limits_{n = 1}^{N} {\frac{TP}{{TP + FP}}} }}{N}$$
(14)

In the equation, TP represents the number of correctly predicted positive samples, and FP is the number of incorrectly predicted negative samples. The mAP averages the accuracy of all predicted AP values, as shown in Eq. (15).

$$mAP = \frac{{\sum\limits_{n = 1}^{N} {AP} }}{N}$$
(15)

The AP threshold is set as T  [0.5, 0.75], and the average value across all APs is used as the mAP. For example, AP75 represents the performance metric calculated when the threshold is set to 0.75, and the calculations for 0.5 are similar to before. The correct detection of key points is only considered when the OKS is greater than the specified threshold. Otherwise, it is considered a case of either a missed detection or a false positive. \(A{P}^{M}\) and \(A{P}^{L}\) represent the accuracy of the model in detecting keypoints when the human is in medium and large target areas, respectively.

Experimental environment and parameter settings

Server environment configuration: The operating system is Ubuntu, the deep learning framework is Pytorch 1.9.1 GPU version, the CPU uses Intel Xeon Platinum 8168, the main memory is 512 GB, and the GPU is NVIDIA Tesla V100 with two 16 GB memory blocks.

Training Phase: Data augmentation strategies are employed to enhance the model's generalization capabilities. Random scaling, probabilistic flipping, and scaling are applied to preprocess the training data. Image standardization and normalization are performed, and predictions are handled by using the feature map from the last layer of the entire network. Adam optimizer is utilized in this study, with an initial learning rate set to 1e-3. After 100 epochs, the learning rate is adjusted to 1e-4, and further adjusted to 1e-5 after 200 epochs. The iteration batch size is set to 32. By the time the model reaches 300 epochs, the convergence is observed, and the refined MCSF-Pose network model exhibits faster convergence and higher final accuracy.

Testing Phase: Initially, the dimensions of the input images are adjusted while maintaining the overall aspect ratio. The padding is applied to undersized regions, for the sake of ensuring all input images have uniform dimensions.

Experimental results

Several experiments are performed on COCO 2017 and MPII validation datasets to prove the effectiveness of the proposed method. In order to ensure the objectivity of the comparison experiment, the results of mainstream benchmark algorithms in the table are all derived from the parameters provided by the original paper. Table 1 shows the precision comparison between the proposed algorithm and the mainstream algorithm on COCO2017 dataset.

Table 1 Comparison of AP results between the proposed algorithm and other algorithms on COCO2017 dataset

Based on AP parameters, our algorithm has obvious advantages over existing human pose estimation algorithms. Specifically, compared to YOLOv5s-Pose, our algorithm has a 6.3% improvement in AP and a 0.6% improvement in CPN (Chen et al. 2018), but an 83% reduction in the number of parameters. Compared with AlphaPose (Fang et al. 2022), our algorithm improves by 7.4%. Compared with PoseNet-152 (Papandreou et al. 2018), deep PoseNet network structure has large computational overhead and great requirements for experimental environment. Our algorithm only requires a three-layer network structure. Compared with DensePose (Guler et al. 2018), our algorithm improved by 16.4% on AP and 8.8% on AR. Compared with Openpifpaf (Kreiss et al. 2021), our algorithm improved by 6.8% on APM. In the detection of small and medium-sized targets, the accuracy of APM in our algorithm is higher than that of other mainstream algorithms, and the AR accuracy is 1.6% higher than that of YOLOv5s-Pose. In AP75, our algorithm is 3.1% better than YOLOv5s-Pose. The experimental results show that the algorithm has improved the accuracy and precision of key point detection significantly.

Table 2 shows the experimental comparison between our algorithm and YOLOv5s-Pose, AlphaPose and DeeperCut for human pose estimation on MPII dataset. Our algorithm is superior to YOLOv5s-Pose in every index of AP. The AP value of AlphaPose on MPII data set is 76.7%, and the AP value of DeeperCut on MPII data set is 70%. The AP accuracy of our algorithm is also higher than those of these two algorithms.

Table 2 Comparison of AP results between the proposed algorithm and YOLOv5s-Pose on dataset MPII

Through the comparison between the above data, we found in the experiment that the MCSF-Pose network can retain more low-dimensional feature information through the unique Multi-ConvG module structure, which can describe the features of small and medium-sized targets well. This design can make our algorithm better than other algorithms in the task of small and medium target detection. At the same time, it can be found that compared with PoseNet-152 and other deep networks, our algorithm reduces the unnecessary number of network layers and fully extracts the feature information of low, medium and high dimensions in the network by combining with the Multi-ConvG module. This design enables our network to reduce a large number of parameter calculations. Multi-ConvG network avoids non-end-to-end networks from relying on L1 loss in human posture detection tasks by using end-to-end mode, and human body bounding box and human key point synchronous detection also reduce calculation parameters in the network and improve training efficiency.

In order to analyze the influence of each module of the algorithm on the model, we conducted ablation experiments on UP-Focus module, Multi-ConvG module and optimized PANet on COCO2017 dataset. PANet Layer 3 and layer 4 were selected, and S was the number of iterations of Mul-ConvG modules in each Stage, as shown in Table 3. By combining the three groups of modules separately, it can be found that in experiment 3, when the UP-Focus module is used to expand the image features, the comprehensive performance is optimized by using the three-layer network structure and three iterations of the Multi-ConvG module.

Table 3 Ablation experiments

In order to validate the real-time performance of the proposed algorithm, the speed of our YOLO-Pose algorithm is compared with other mainstream human key point detection algorithms. The detection speeds of each algorithm are summarized in Table 4.

Table 4 Frame rate comparison between the proposed algorithm and other algorithms

The algorithm proposed in this paper achieves a detection rate of 40 frames per second, representing a significant improvement compared to YOLOv5l-Pose. In comparison to YOLOv5s-Pose, our algorithm lags by only two frames per second in terms of recognition speed, while exhibiting a noticeable enhancement in accuracy based on the previous experimental comparisons.

The experimental results of our proposed algorithm and YOLOv5s-Pose are illustrated in Figs. 10 and 11, respectively. Figure 10 depicts the human pose estimation results of these two algorithms on the COCO 2017 validation set. The first row displays the detection results of our algorithm, while the second row shows the results of YOLOv5s-Pose in Fig. 10 and in Fig. 10a, our algorithm demonstrates accurate detection even in low-light conditions, avoiding misidentifications. Figure 10b show the successful detection of all targets by our algorithm, while Fig. 10c highlights the higher accuracy of our algorithm compared to YOLOv5s-Pose.

Fig. 10
figure 10

Comparison of detection results between the proposed algorithm and the YOLOv5s-Pose algorithm on the dataset MPII on dataset COCO2017, among them, (a) comparative results of under dim conditions, (b) comparative results of key point detection, and (c) comparative results of accuracy

Fig. 11
figure 11

Comparison of detection results between the proposed algorithm and the YOLOv5s-Pose algorithm on the dataset MPII, among them, (a) comparative results of accuracy, (b) comparative results of occlusion key point detection, and (c) comparative results of key point detection accuracy

Figure 11 presents the validation results for human pose estimation on the MPII dataset, with the first row showing the results of our algorithm and the second row displaying the results of YOLOv5s-Pose. Figure 11 presents the results for human pose estimation on the MPII dataset, with the first row showing the results of our algorithm and the second row displaying the results of YOLOv5s-Pose. As depicted in Fig. 11a, our algorithm exhibits higher recognition accuracy than YOLOv5s-Pose. In Fig. 11b, our algorithm successfully identifies the positions of individuals' eyes. As shown in Fig. 11c, YOLOv5s-Pose experiences significant deviation in recognizing key points on the left side of the human body in the image, whereas our algorithm accurately identifies these key points. The Multi-ConvG in our network retains more low-dimensional information, ensuring the involvement of both low and high-dimensional information in the loss calculation during key point regression. These comparative experiments in both datasets confirm that our algorithm produces superior results with better handling of details and lower false positive rates.

In order to further compare the algorithm presented in this paper with the YOLO-Pose algorithm, we compared multiple groups of experiments in different scenes, and selected two representative groups of experiments for effect analysis. The first group is to test whether the algorithm can accurately identify the key points of the human body when the body is severely hidden. The second group is to test whether the key points of the human body can be accurately identified when the human body edge and background are similar. We detect the key points of the human body in the two groups of videos, and then save the detection results one frame apart.

The comparison of the first group of experiments is shown in Fig. 12. The results of MCSF-Pose detection in the first group and YOLO-Pose detection in the second group. It can be seen from the figure that the YOLO-Pose algorithm has error detection in three consecutive frames, and the first two frames fail to detect the correct key points of the human body.

Fig. 12
figure 12

Key point detection of human body under occlusion state

The comparison of the second group of experiments is shown in Figs. 13 and 14. In Fig. 13, the first behavior MCSF-Pose detection results and the second behavior YOLO-Pose detection results. The algorithm in this paper can accurately detect the key points of medium and small target human body.

Fig. 13
figure 13

Key point detection of small and medium-sized target human body

Fig. 14
figure 14

Human key point detection under the condition of blurred human edge

In order to more clearly show the meaning of the Multi-ConvG module in our algorithm to retain the three-dimensional features, the first behavior MCSF-Pose detection result and the second behavior YOLO-Pose detection result in Fig. 14 show that the color of the human body on the left side of the image is similar to that of the background. The algorithm in this paper can identify the key points of the human body more quickly. YOLO-Pose did not correctly detect key points on the left side of the body. From the above experimental comparison, the detection accuracy and detection speed of our algorithm are better than that of YOLO-Pose algorithm.

Results analysis

We compare the algorithm proposed in this paper with the current mainstream algorithms in several sets of experiments, and the experimental results show obvious improvement on COCO2017 data set and MPII data set, especially for small targets and target detection accuracy, remarkable progress has been made. In addition, the 3-layer network structure of MCSF-Pose reduces the complexity of the model, improves the lightweight degree of the model, and the number of parameters is far less than other algorithms.

Through the comparison of ablation experiments, it was found that the improvement of detection accuracy of human key points by MCSF-Pose network model came from UP-Focus module and Multi-ConvG module. The UP-Focus module can expand the input information, which enables the network to learn more feature information and make the model more generalized. The Multi-ConvG module is able to obtain more low-dimensional feature information to capture rich information about human posture more comprehensively. This helps to solve the problem that single dimensional features in traditional methods are difficult to cover diversity and complexity. Secondly, the multi-dimensional feature convolution group improves the correlation between various dimensional features and effectively solves the problem of low-dimensional information loss caused by the superposition of layers in the convolutional neural network. Therefore, it can play a good role in the detection of some small human key points.

The MCSF-Pose network uses Multi-ConvG modules, which makes the FPS performance of the algorithm less efficient than that of YOLOv5s-Pose, but the optimized network structure also makes the performance of MCSF-Pose higher than that of YOLOv5l-Pose and other network models.

Conclusion

This paper introduces the multi-channel spatial information feature based human pose estimation algorithm, which achieves significant performance advantages in human pose estimation tasks through innovative designs such as feature fusion and the multi-dimensional feature convolution group. Firstly, feature fusion enables the network to acquire more low-dimensional feature information, comprehensively capturing rich information about human poses. This addresses challenges in traditional methods that single-dimensional features struggle to cover the diversity and complexity of poses. Secondly, the multi-dimensional feature convolution group improves the correlation between various dimensional features, effectively solving the problem of low dimensional information loss caused by the stacking of layers in convolutional neural networks. Furthermore, optimization of parameter memory usage is applied in our paper. There are lots of advantages, such as reducing model complexity, promoting a lightweight model and enhancing robustness. The proposed algorithm not only enhances the ability to model poses but also improves the accuracy and robustness of pose estimation. Finally, we set up a comparison experiment to compare the proposed model of this paper’s method with the traditional model. The experimental results proved that through the MCSF-Pose model of our method, significant performance improvements, particularly in addressing challenges related to small targets and object detection accuracy.

Availability of data and materials

All data generated or analysed during this study are includes in this published article. The COCO2017 data have been deposited in the COCO Key point Detection Task [https://cocodataset.org/#keypoints-2017]. The MPII data have been deposited in the MPII Human Pose Dataset [http://human-pose.mpi-inf.mpg.de/#overview]. Requests for material should be made to the corresponding authors.

References

  • Cao Z, Hidalgo G, Simon T, Wei S-E, Sheikh Y (2021) OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans Pattern Anal Mach Intell 43(1):172–186. https://doi.org/10.1109/TPAMI.2019.2929257

    Article  Google Scholar 

  • Chen X, Zhao Y, Qin Y et al (2022) PANet: perspective-aware network with dynamic receptive fields and self-distilling supervision for crowd counting. SSRN Electron J. https://doi.org/10.2139/ssrn.4194723

    Article  Google Scholar 

  • Chen Y, Wang Z, Peng Y, et al. (2018) Cascaded pyramid network for multi-person pose estimation. In: IEEE/CVF conference on computer vision and pattern recognition. [2023–10–07]. DOI:https://doi.org/10.48550/arXiv.1711.07319.

  • Cheng B, Xiao B, Wang J, et al. (2020) HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, WA, USA. 2020. https://doi.org/10.1109/cvpr42600.2020.00543.

  • Du X, Li Y, Cui Y, et al. (2021) Revisiting 3D ResNets for video recognition. Comput Vis Pattern Recogn

  • Fang HS, Li J, Tang H et al (2022) Alphapose: whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans Pattern Anal Mach Intell 45:7157

    Article  Google Scholar 

  • Guler RA, Neverova N, Kokkinos I (2018) DensePose: dense human pose estimation in the wild. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, Salt Lake City, UT, USA. https://doi.org/10.1109/cvpr.2018.00762

  • He K, Gkioxari G, Dollar P et al (2020) Mask R-CNN. IEEE Trans Pattern Anal Mach Intell 2020:386–397. https://doi.org/10.1109/tpami.2018.2844175

    Article  Google Scholar 

  • Yinghao H, Bogo F, Lassner C (2021) Single image 3D human pose estimation via keypoint estimation and mesh convolutional neural networks

  • Insafutdinov E, Pishchulin L, Andres B et al. (2016) Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14. Springer International Publishing, pp 34–50

  • Jiaxin Y, Fang W, Jieru Y (2021) A review of action recognition based on convolutional neural network[J/OL]. J Phys: Conf Ser 1827(1):012138. https://doi.org/10.1088/1742-6596/1827/1/012138

    Article  Google Scholar 

  • Kim Y (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882

  • Kreiss S, Bertoni L, Alahi A (2021) Openpifpaf: Composite fields for semantic keypoint detection and spatio-temporal association. IEEE Trans Intell Transp Syst 23(8):13498–13511

    Article  Google Scholar 

  • Lin TY, Dollar P, Girshick R, et al. (2017) Feature pyramid networks for object detection. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI. https://ieeexplore.ieee.org/document/8099589/. DOI:https://doi.org/10.1109/cvpr.2017.106.

  • Liu W, Qian Bao Y, Sun TM (2023) Recent advances of monocular 2D and 3D human pose estimation: a deep learning perspective. ACM Comput Surv 55(4):1–41. https://doi.org/10.1145/3524497

    Article  Google Scholar 

  • Liu S, Qi L, Qin H, et al. (2018) Path Aggregation network for instance segmentation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT. https://doi.org/10.1109/cvpr.2018.00913.

  • Liu W, Ren G, Yu R, et al. (2022) Image-Adaptive YOLO for object detection in adverse weather conditions. In: Proceedings of the AAAI conference on artificial intelligence, pp 1792–1800. https://ojs.aaai.org/index.php/AAAI/article/view/20072. DOI:https://doi.org/10.1609/aaai.v36i2.20072.

  • Maji D, Nagori S, Mathew M, et al. (2022) YOLO-Pose: enhancing YOLO for multi person pose estimation using object keypoint similarity loss. DOI:https://doi.org/10.48550/arXiv.2204.06806.

  • Ong JC, Lau SL, Ismadi MZ et al (2023) Feature pyramid network with self-guided attention refinement module for crack segmentation. Struct Health Monitor 2023:672–688. https://doi.org/10.1177/14759217221089571

    Article  Google Scholar 

  • Papandreou G, Zhu T, Chen L C, et al. (2018) PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model. In: European conference on computer vision. Springer, Cham, DOI:https://doi.org/10.1007/978-3-030-01264-9_17.

  • Qi J, Liu X, Liu K et al (2022) An improved YOLOv5 model based on visual attention mechanism: application to recognition of tomato virus disease. Comput Electron Agric 2022:106780. https://doi.org/10.1016/j.compag.2022.106780

    Article  Google Scholar 

  • Redmon J, Divvala S, Girshick R, et al. (2016) You only look once: unified, real-time object detection. In: Computer Vision & Pattern Recognition. IEEE, DOI:https://doi.org/10.1109/CVPR.2016.91

  • Sun K, Xiao B, Liu D, et al. Deep high-resolution representation learning for human pose estimation. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), Long Beach, CA, USA. 2019. https://doi.org/10.1109/cvpr.2019.00584.

  • Xiao B, Wu H, Wei Y (2018) Simple baselines for human pose estimation and tracking. In: Proc. of the European CONF. ON COMPUTER VISIOn (ECCV). pp 466–481.

  • Xie E, Zang Y, Shao S, Gang Y, Yao C, Li G (2019) Scene text detection with supervised pyramid context network. Proc AAAI Conf AI 33(01):9038–9045. https://doi.org/10.1609/aaai.v33i01.33019038

    Article  Google Scholar 

  • Yan B, Fan P, Lei X et al (2021) A real-time apple targets detection method for picking robot based on improved YOLOv5. Remote Sens 2021:1619. https://doi.org/10.3390/rs13091619

    Article  Google Scholar 

  • Yang G, Feng W, Jin J, et al. (2020) Face mask recognition system with YOLOV5 based on image recognition. In: 2020 IEEE 6th international conference on computer and communications (ICCC), Chengdu, China. 2020. https://doi.org/10.1109/iccc51575.2020.9345042.

Download references

Acknowledgements

Thanks to my mentor, Professor Yinghong Xie, for giving me endless inspiration. In the process of writing the paper, she helped me develop my research ideas and carefully taught me to overcome the problems encountered in the simulation process. Thanks to Professor Xiaowei Han for reviewing relevant literature for me. Thanks to Professor Qiang Gao for organizing the relevant data of the program for me. Thanks to Biao Yin for organizing relevant data and analyzing program results for me. Thanks again to everyone on the team.

Use of AI tools declaration

The authors declare they have not used Artificial Intelligence (AI) tools in the creation of t his article.

Funding

This work was supported by the Liaoning Provincial Science and Technology Plan Project 2023JH2/101300205 and the Shenyang Science and Technology Plan Project 23-407-3-33.

Author information

Authors and Affiliations

Authors

Contributions

Yinghong Xie: Algorithm design, Algorithm feasibility analysis, Part of the thesis chapter writing, Paper review, Paper revision. Yan Hao: Algorithm design, Algorithmic programming, Part of the thesis chapter writing, Paper revision. Xiaowei Han: Literature search, Collate datasets, Qiang Gao: Data sorting, Literature search. Biao Yin: Image rendering, Data sorting.

Corresponding author

Correspondence to Yan Hao.

Ethics declarations

Competing interests

The authors declare there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xie, Y., Hao, Y., Han, X. et al. A multi-channel spatial information feature based human pose estimation algorithm. Cybersecurity 7, 49 (2024). https://doi.org/10.1186/s42400-024-00248-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s42400-024-00248-2

Keywords