Ting LIU, Dongsheng LI
Institute of Modern Agricultural Equipment, Xihua University, Chengdu 610039, China
Abstract [Objectives] To explore a rapid detection method of sweet cherry fruits in natural environment. [Methods] The cutting-edge YOLOv4 deep learning model was used. The YOLOv4 detection model was built on the CSP Darknet5 framework. A mosaic data enhancement method was used to expand the image dataset, and the model was processed to facilitate the detection of three different occlusion situations: no occlusion, branch and leaf occlusion, and fruit overlap occlusion, and the detection of sweet cherry fruits with different fruit numbers. [Results] In the three occlusion cases, the mean average precision (mAP) of the YOLOv4 algorithm was 95.40%, 95.23%, and 92.73%, respectively. Different numbers of sweet cherry fruits were detected and identified, and the average value of mAP was 81.00%. To verify the detection performance of the YOLOv4 model for sweet cherry fruits, the model was compared with YOLOv3, SSD, and Faster-RCNN. The mAP of the YOLOv4 model was 90.89% and the detection speed was 22.86 f/s. The mAP was 0.66%, 1.97%, and 12.46% higher than those of the other three algorithms. The detection speed met the actual production needs. [Conclusions] The YOLOv4 model is valuable for picking and identifying sweet cherry fruits.
Key words YOLOv4, Deep learning, Target detection, Sweet cherry
With constant increase of China’s agricultural production in scale, intensification, and precision, the demand for intelligent and automated agricultural smart equipment is also increasing rapidly[1]. However, major fruit harvesting methods still depend on manual picking, which is costly and labor intensive. Sweet cherries, also known as cherries, or big cherries, are favored by consumers because of their taste. According to the Food and Agriculture Organization, China’s sweet cherry cultivation area in 2019 was about 268 800 ha, with an annual output of 2.196 million t, which brought considerable economic benefit to China[2-4]. However, due to the short maturity period of sweet cherry fruits, a large amount of labor is required for picking. With the acceleration of urban development in China, the number of people engaged in agriculture is decreasing, and the harvesting of sweet cherries is facing problems such as labor shortages and increasing picking costs. Automated picking can effectively solve these problems. Target detection is a key challenge for automated picking. Therefore, a fast and efficient target detection method can effectively improve the efficiency of automated cherry picking, and it is in line with the automation of China’s agricultural production. This technology could address the development need for intelligent agricultural equipment.
The fast and accurate detection of sweet cherry fruits in a natural environment is the key to automatic cherry picking. The recognition of cherry fruits against a complex background of fruit trees is affected by many factors: overlap and occlusion between fruits, occlusion by leaves, and complex natural light and environment changes. At present, there are two main types of automated target detection methods: traditional method and deep learning method[5].
The traditional target detection method combines image processing and machine learning. It generally involves the following steps: (i) the image is pre-processed, (ii) the target feature is extracted, and (iii) the feature training classifier is used to achieve target detection. Extensive studies have been carried out about traditional approaches to the target detection of crops, including apples, citrus fruits, kiwis, lychees, tomatoes, and strawberries[6-13]. Cui Yongjieetal.[8]used the canny operator to extract edges, and the Hough transform to identify kiwis, achieving a recognition rate of 96.9%. Xiong Juntaoetal.[9]used fuzzy C-means clustering to identify lychees, with a recognition rate of 95.5%. Si Yongshengetal.[11]used genetic algorithms to identify apples with a recognition rate of 97%. An apple detection system[14]showed that this system could enable robotic apple picking. Fu Longshengetal.[15]used RGB-D (Red, Green, Blue-Depth Map) sensors to detect and locate fruits to improve the positioning accuracy of robotic picking. However, traditional target detection has a strong dependence on human subjectivity, and its universality and robustness in natural environments are poor.
Deep learning technology has been proposed for target detection in recent years. The application of deep learning technology has improved the key technologies relevant to automatic fruit harvesting, and has greatly improved fruit recognition[16]. Using deep learning, these methods first mark the target image and train an image model. Then, they recognize and detect unmarked images. There have been extensive studies about the use of deep learning methods to detect apples, citrus fruits, grapefruits, dragon fruits, tomatoes, kiwis, and mangoes[17-33]. To find a new cherry grading method to replace the traditional method, Liu Fangetal.[19]used an improved YOLOv3 model to identify strawberries in complex environments; the mAP (mean average precision) was 87.51%, and the recognition speed was about 34.99 ms/frame[21]. Chen Yanetal.[22]improved the YOLOv3 model to recognize lychees. The mAP was 0.94%, and the average detection speed reached 22.11 frames/s. Xiong Juntaoetal.[33]used the Faster-RCNN model to identify green citrus, and achieved an mAP of 85.49%, and a recognition speed of 0.4 s/frame. Momeny Metal.[34]used a convolutional neural network (CNN) to detect the appearance of cherries to facilitate their classification and improve their exportability. Compared with traditional target detection, deep learning technology can better obtain target identification features and has good universality and robustness, thus improving sweet cherry fruit detection.
The You Only Look Once (YOLO) network is a general one-stage target detection algorithm. Using a single convolutional neural network for processing images, it can classify and identify the position coordinates of multiple targets. It is fast running, and offers real-time detection[35-36]. The most widely used YOLO models are YOLOv3 and YOLOv4. The YOLO model has not been extensively applied to fruit recognition. Compared with the YOLOv3 algorithm, the improved YOLOv3 is better for tomato detection[37]. The work of Kuznetsovaetal.[38]passed the test of robot detection time by using pre-processing and post-processing to enable the YOLOv3 algorithm to be used in apple detection, greatly improving the convenience of robot harvesting. Experiments also showed that the algorithm could reduce the average detection time, and the error rate was low. The emergence of the YOLOv4 model improved the accuracy of the recognition of occluded objects[39]. The YOLOv4 model performs better then Faster R-CNN, YOLOv2, YOLOv3, and other algorithms, and can achieve accurate and real-time detection of apple flowers.
In the process of cherry identification, issues such as sweet cherry occlusion of the fruit by foliage and other fruit, and the presence of multiple fruits results in relatively low accuracy of recognition. To address this issue, we used the YOLOv4 model to detect sweet cherry fruits in a natural environment, and to provide technical support for the picking of high-quality cherries. The performance of three deep learning algorithms (YOLOv3, SSD (Single Shot MultiBox Detector), and Faster-RCNN) were compared with that of the YOLOv4 model to explore the practicality of the YOLOv4 model for use in the natural environment.
2.1 Image acquisitionWe captured images of sweet cherries of "Hongdeng" variety on May 9, 2021 and May 11, 2021 at a sweet cherry plantation in Qingxi Town, Ya’an City, Sichuan Province, using a Meizu mobile phone. The images were saved as 3 000×4 000 pixel RGB images. The images were taken under sunny, cloudy, and rainy conditions, from many different angles, including horizontally, upwards, and overhead. The pictures covered various complicated situations such as occlusion resulting from branches and leaves, overlapping fruits, and multiple fruits. After discarding blurry and overexposed images, 740 experimental images were retained, of which 500 images were used as a training set and 240 images were used as a test set.
2.2 Image enhancementTo enrich the background of the target object and obtain enough training images, we randomly selected the images in the training set for data enhancement, and the final training set was increased to 800 images. The enhancement methods included mirroring, rotation, contrast enhancement, color enhancement, brightness change, and random transformation of color, as shown in Fig.1. To prevent overfitting and ensure the reliability of the training results, we used some sample images from the training set as the verification set. Ultimately, we divided the 800 images into verification and training sets at a ratio of 1∶9.
Note: a. original image, b. brightness enhancement, c. color enhancement, d. contrast enhancement, e. horizontal mirroring, f. random color change, and g. image rotation.
2.3 Test set divisionThe cherry growth environment and their different presentations during picking are challenges for automatic identification. To test the target detection ability of the YOLOv4 model, we divided the 240 pictures in the test set according to their different characteristics. Table 1 shows the results of the grouping. First, 150 pictures were classed as the occlusion group, including 50 unoccluded images, 50 occluded images featuring branches and leaves, and 50 overlapped occluded images of fruits. These images were used to test the model’s ability to recognize blocked targets. The remaining 90 pictures were classified as the number group. Thirty of the images had 1-2 fruits, 30 images had 3-4 fruits, and 30 images had 5-6 fruits. These images were used to test the model’s ability to recognize different numbers of fruits. Finally, to test the comprehensive recognition ability of the different models for sweet cherry fruits, the comprehensive group involved a sample of 200 images randomly selected from the test set.
Table 1 Division of the test set
3.1 YOLOv4 modelYOLOv4 is an improved version of YOLOv3 (Fig.2). To enhance the model’s detection ability, the backbone feature extraction network, darknet-net53, was modified to CSP darknet-net53, based on the YOLOv3 model. Simultaneously, the enhanced feature extraction network was replaced with a feature pyramid network (FPN) with spatial pyramid pooling (SPP) and a path aggregation network (PANet).
Fig.2 YOLOv4 structure
In this study, the sweet cherry fruit detection and recognition model is based on the YOLOv4 algorithm and consists of the following four modules. (i) The center and scale prediction (CSP) module divides the original stack of residual blocks into two parts: one continues to stack the residual blocks, and the other is connected to the output after only a small amount of processing, an approach which can improve the learning ability of the network. (ii) The convolution, batch normalization, and Mish (CBM) module is used to extract input image features, and includes convolution, normalization, and Mish activation function processing. (iii) The convolution, batch normalization, and Leaky-ReLU (CBL) module is also used to extract image features. Unlike the CBM module, the activation function of this module is Leaky-ReLU. (iv) The SPP module uses the maximum pooling of different scales to pool the input feature layers and then stack them, which can greatly increase the receptive field.
3.1.1Backbone feature extraction network. CSP darknet-net53 has two main improvements over darknet-net53. The activation function of the convolution block (CBL) is modified from Leaky-ReLU to Mish, to generate a new convolution block (CBM); it uses the cross stage partial network (CSPNet) structure to correct the residual. The residual module Resblock-body structure has been improved, as shown in Fig.3, and the two together enhance the feature extraction capability of the backbone feature extraction network.
Note: a. Before improvement; b. After improvement.
3.1.2Enhanced feature extraction network. The enhanced feature extraction network of the YOLOv3 model uses the FPN structure (Fig.4). Compared with this structure, the YOLOv4 model performs a convolution operation on the 13×13 feature layer, inputs the feature layer to SPPNet, and then convolves the processed 13×13 feature layer again before passing it to PANet to repeatedly extract features.
Fig.4 The FPN in YOLOv3
In SPP, the incoming feature layer is subjected to the maximum pooling operation with core sizes of 13×13, 9×9, 5×5, and 1×1, and then, the pooling results are stacked to deliver a new 13×13 feature layer. In PANet, the target feature is extracted from the bottom up, and then the feature is extracted from the top down. After repeatedly acquiring feature information, the network sends stronger information, with core sizes of 13×13, 26×26, and 52×52 features. The application of SSP and PANet in the YOLOv4 deep learning model expands the field of view of the model, reduces the loss of information in the feature extraction process, and further enhances the feature extraction ability of the model.
3.2 Model training
3.2.1Experimental operation platform. To train the detection model, a computer with an Intel Core (TM) i7-10750H processor, a main frequency of 2.60 GHz, 16 GB of memory, and an NVIDIA RTX2060 GPU was used. The computer system used Windows 10, the CUDA version was 10.0, the cuDNN version was 7.4.1.5, Anaconda 3 was used to configure the virtual environment, and Visual Studio Code was used as the editor. We also used Python 3.6.13, the compilation script was PyCharm, PyTorch 1.2.0, and Torchvision 0.4.0.
3.2.2Training parameter settings. The training model used a migration learning strategy with the Adam optimizer to train for a total of 200 generations. The backbone feature extraction network was frozen for the first 100 generations, and all parameters were trained for the last 100 generations. The initial learning rate of training was frozen at (learning rate)lr=10-3, and the CosineAnnealingLR (cosine annealing method) was used to adjust the learning rate, whereT-max=5,eta-min=10-5, and batch size (training batch)=16. After unfreezing, the initial learning rate waslr=10-4, and CosineAnnealingLR was used to adjust the learning rate, whereT-max=5,eta-min=10-5, and batch size=2.
3.2.3Training results and evaluation. In this study, it took about 5 h to train the YOLOv4 model. There were 200 training generations using 800 images, and the changing trend of its loss function (Loss) is shown in Fig.5.
Fig.5 Curve of Loss value changes
TheLossof the model dropped rapidly in the first 20 training generations, from 1 858 to 10, and then the decline slowed until theLossvalue stabilized at about seven before the thaw training. After thaw training, theLossvalue slowly decreased again. After 40 generations, theLossvalue stabilized at around 3.6 until the end of the training. The cherry detection experiment used the training results of the 193rdgeneration, whereTrainLoss=3.638 6, andVailLoss=4.022 0.
Precision (accuracy), recall, and mAP were used to evaluate the model, with the mAP as the core evaluation index, and the confidence was set to 0.5. The proportions of True positive (TP), False positive (FP), and False negative (FN) results were evaluated.
Precision=TP/(TP+FP)
(1)
Recall=TP/(TP+FN)
(2)
(3)
whereCis the number of categories.
4.1 Overcoming occlusion problemsThere are many factors which can interfere with the recognition of sweet cherry fruits in a natural environment, include the occlusion of fruits by natural features (Fig.6). The YOLOv4 model was used for the detection of sweet cherry fruits under different occlusion conditions, and the robustness of the model was tested. The experimental results are shown in Table 2.
Note: a. Images with no occlusion; b. Images of branch and leaf occlusion; c. Images of fruit overlap occlusion.
Table 2 Results of fruit recognition by the YOLOv4 model in different occlusion situations
The mAP values of the YOLOv4 model under the three conditions of no occlusion, branch and leaf occlusion, and fruit overlap occlusion were 95.40%, 95.23%, and 92.73%, respectively. The mAP value was stable above 90% under different occlusion conditions. Therefore, the model could effectively solve the problem of occlusion, and detect occluded sweet cherry fruits under natural conditions.
4.2 Recognition of different numbers of fruitsWith the increase in the number of sweet cherry fruits, phenomena such as adjacent fruits and overlapping fruits will occur during detection. This phenomenon can affect fruit identification. Therefore, the ability of the YOLOv4 model to detect the number of fruits is of great significance. The sweet cherry fruit pictures were divided into three groups, containing 1-2, 3-4, or 5-6 fruits, as shown in Fig.7. The experimental results are shown in Table 3.
Note: a. 1-2 fruits; b. 3-4 fruits; c. 5-6 fruits.
Table 3 shows that the average mAP value for different numbers of sweet cherry fruits detected by the YOLOv4 model was 81.00%, which meets the identification requirements for different numbers of fruits in natural environments. When the number of sweet cherry fruits increased from 1-2 to 3-4, the mAP value decreased by 7.45%, and when the number increased from 3-4 to 5-6, the mAP value decreased by 9.91%. We concluded that the model detected capacity decreased as the number of cherries increased. The reason for this phenomenon is as follows: when the number of fruits increases, the phenomenon of adjacent and overlapping fruits increases, the probability of being blocked by branches and leaves increases, and hence the difficulty of identification increases.
Table 3 Recognition of different numbers of fruits by the YOLOv4 model
4.3 Performance of different modelsTo investigate the value of the use of the YOLOv4 model to detect sweet cherry fruits, in this section experiments into the use of a common deep learning model are discussed. The results are shown in Table 4.
Table 4 Comparison of the recognition performance of different models for sweet cherry fruits
As shown in Table 4, the mAP values of YOLOv4 were 0.66%, 1.97%, and 12.46% higher than those of YOLOv3, SSD, and Faster-RCNN, respectively. For YOLOv4, the precision was the highest among the four models, and the recall was slightly lower. Therefore, the YOLOv4 model performed well at recognizing sweet cherry fruits, and the detection speed met the actual production needs.
In this study, we used the YOLOv4 model to detect sweet cherry fruits in a natural environment. Using images representing different occlusion conditions and different numbers of cherries, we compared the detection performance of different models, and finally reached the following conclusions. (i) The performance of YOLOv4 was tested under various occlusion conditions. The mAP values of the model under conditions of no occlusion, branch and leaf occlusion, and fruit overlap occlusion were stable at above 90%, indicating that the YOLOv4 model can effectively identify sweet cherry fruits that are obscured in a natural environment. (ii) We tested the detection ability of YOLOv4 on 1-2, 3-4, and 5-6 fruits. The average mAP of detection was 81.00%, and the mAP values of the different experimental groups were 89.27%, 81.82%, and 71.91%. The model meets the detection requirements for different numbers of sweet cherry fruits in a natural environment, but the detection ability of the model decreased with the increase in the number of fruits, and further improvement in multi-fruit recognition is needed. (iii) A comparison of the detection capabilities of YOLOv4 and three commonly used deep learning models for sweet cherry fruits, found that the mAP values of YOLOv4, YOLOv3, SSD, and Faster-RCNN were 90.89%, 90.23%, 88.92%, and 78.43%, respectively. The mAP value of YOLOv4 was higher than that of the other three models. The YOLOv4 model had practical advantages in recognizing sweet cherry fruits, and the detection speed of the YOLOv4 model meets actual production needs. In summary, the YOLOv4 deep learning model can effectively detect sweet cherries in a natural environment, and provides technical support for automatic sweet cherry picking.
Asian Agricultural Research2022年1期