MA Xiao(馬 驍),WANG Shaoyu(王紹宇),YE Shaoping(葉少萍), FAN Jingyi(樊靜宜),XU An(徐 安),XIA Xiaoling(夏小玲)
School of Computer Science and Technology, Donghua University, Shanghai 201620, China
Abstract: In recent years, with the rapid development of e-commerce, people need to classify the wide variety and a large number of clothing images appearing on e-commerce platforms. In order to solve the problems of long time consumption and unsatisfactory classification accuracy arising from the classification of a large number of clothing images, researchers have begun to exploit deep learning techniques instead of traditional learning methods. The paper explores the use of convolutional neural networks(CNNs) for feature learning to enhance global feature information interactions by adding an improved hybrid attention mechanism(HAM) that fully utilizes feature weights in three dimensions: channel, height, and width. Moreover, the improved pooling layer not only captures local feature information, but also fuses global and local information to improve the misclassification problem that occurs between similar categories. Experiments on the Fashion-MNIST and DeepFashion datasets show that the proposed method significantly improves the accuracy of clothing classification (93.62% and 67.9%) compared with residual network (ResNet) and convolutional block attention module(CBAM).
Key words: clothing classification; convolutional neural network(CNN); residual network (ResNet); attention mechanism; narrow pooling
In recent years, with e-commerce rapidly rising, the use of cell phones and apps is becoming more and more common. Due to the convenience of online shopping, more and more people prefer buying clothing on shopping sites in their daily lives. When people want to search for clothing on e-commerce platforms, they usually have two ways. The first one is searching for clothing information through a text input engine, which requires sellers to post photograph manually and classify the clothing in their stores beforehand. The other one is uploading clothing images to a platform and detecting the relevant attributes of these clothing images. In these two ways, clothing classification is an indispensable step, which not only facilitates the store’s management of the product, but also helps users to narrow the search scope. This type of classification is different from other image classification tasks that there are many attributes of clothing products, a large number of images, and a high degree of similarity between some classes.
The essence of clothing classification is to determine the classes by extracting image features and designing classification models. Traditional classification models mainly include support vector machine (SVM), extreme learning machine (ELM), random forest, and transfer forest and so on. Salem and Nasari[1]applied SVM to research in clothing. Panetal.[2]proposed to use the back propagation(BP) neural network for the discrimination of knitted fabrics. Bossardetal.[3]extracted the features by histogram of oriented gradients(HOG) and local binary patterns(LBP), and put the features into a clothing classification system containing classifiers such as SVM, random forest, and transfer forest, and the accuracy obtained is 35.05%, 38.29%, and 41.36% respectively. Thewsuwan and Horio[4]proposed a clothing classification method by using two texture features (LBP and Gabor filters) and obtained an average accuracy of 80.27% on a five-category dataset. Zhangetal.[5]added HOG to clothing classification and achieved strong robustness to light, and obtained an accuracy of 73.60% by classifying the Tmall buyer show dataset. Panetal.[6]proposed an ethnic clothing classification algorithm based on scale invariant feature transform(SIFT), HOG and color features, and obtained an average accuracy of 87.6%. Surakarin and Chongstitvatana[7]used texture features and speed up robust features(SURF) obtained by improving SIFT for clothing classification.
Since the raise of deep learning, breakthroughs in the application and improvement of deep learning networks have been made in various fields, and clothing classification is no exception[8]. A large number of deep learning-based clothing classification algorithms have emerged. Nawazetal.[9]proposed an ethnic clothing classification algorithm based on the inception model. They designed a convolutional neural network(CNN) architecture and added the inception module, which led to an improvement in classification accuracy. Liuetal.[10]proposed a clothing classification method based on global convolutional features and local key points to improve the classification results by adding local information. Zhangetal.[11]proposed a clothing classification method based on residual network(ResNet), which used ResNet as baseline and achieved good results. The above studies investigated and improved deep learning for clothing classification, but did not apply global information as well as local information interactions, and the influence of clothing background information still exists.
The contribution of this paper to the above problem is as follows.
(1)We incorporate a hybrid attention mechanism(HAM) that reduces global information loss and pays attention to 3D information interaction.
(2)We use narrow pooling layers to not only acquire and fuse global and local information, but also reduce the interference of irrelevant background information.
(3)We enhance the interactions between feature information in three dimensions and reduce the difficulty of distinguishing similar categories by a novel pooling layer.
CNN is a very common deep learning network nowadays. Its principle is derived from biological vision mechanism. The most basic architecture of CNN consists of a convolutional layer, a pooling layer, and a fully connected layer, where feature learning and classifier are integrated.
Attention model has been widely used in various types of deep learning tasks such as natural language processing, image recognition, and speech recognition. It is one of the core technologies worthy of attention and in-depth understanding in deep learning technology. A neural network trained without attention mechanism processes all features of a picture equivalently. Although the neural network learns the features of the image for classification, these features are not different in the “eyes” of the neural network, so the neural network does not pay much attention to a “region”. The spatial domain attention proposed by Jaderbergetal.[12]transformed the spatial information in the picture to other spaces and retained the useful information to obtain better robustness. Huetal.[13]proposed squeeze and excitation network(SENet) to extract the global information of the channel to get the features of channel dimension, and gave more attention to the features with more messages and suppressed the irrelevant features. Wooetal.[14]proposed convolutional block attention module(CBAM) to combine channel attention with spatial attention, which was able to compensate for the loss caused by upsampling.
The commonly used attention mechanisms are channel attention mechanism and spatial attention mechanism. During the use of both channel attention mechanism and spatial attention mechanism, maximum pooling and average pooling are used in parallel to compress the feature map in spatial dimension. The difference is that the channel attention mechanism sends the results to the multi-layer perceptron(MLP) separately and then sums them, while the spatial attention mechanism performs a concatenation operation on the results after passing through the maximum and average pooling.
Fig. 1 HAM architecture
Given an intermediate feature mapF∈RC×H×Was input, the overall attention process can be summarized as
Fc=Mc(F)?F,
(1)
Fs=Ms(Fc)?Fc,
(2)
whereMcis the channel attention map,Msis the spatial attention map. ? denotes element-wise multiplication, which makes channel attention values be broadcasted along the spatial dimension, and vice versa.Fcis the output of channel attention block, andFsis the final refined output.
We found that the pooling operation reduced the use of feature information and had a negative impact on our attention module. To further preserve the feature mapping, we removed the maximum pooling and average pooling from the two attention mechanisms.
Pooling operation is a very efficient way to obtain a large range of perceptual fields in pixel-by-pixel prediction tasks.
The average pooling is used in traditional clothing classification tasks, and a square kernel ofN×Nis generally taken for pooling to perform feature extraction. We define the input two-dimensional tensor asx, the dimension asH×W, and we have
(3)
whereNis the kernel size of the average pooling, 0
Considering that clothing images are usually regular and rectangular, the use of square convolution kernels is detrimental to the feature information and may lead to collecting useless background information, we propose to use a new narrow pooling layer. This method considers both global and local information usage by using an average pooling layer withH×1 and 1×Wconvolution kernels in both horizontal and vertical directions. This narrow pooling operation defined as
(4)
(5)
Parallel horizontal and vertical pooling branches encode the global horizontal or vertical information presented in the obtained feature maps, and then assign weights to the feature information for optimization. Our pooling operation captures feature information in two ways. On one hand, the kernel is able to collect global information more efficiently through a dimension equivalent to height or width. On the other hand, through narrow pooling operation, we can keep local information while discarding extraneous information, which helps to distinguish some highly similar clothing types more precisely. The architecture of embedding narrow pooling module(NPM) in the bottleneck of ResNet[15]is shown in Fig. 2.
Fig. 2 Bottleneck structure of ResNet which has narrow pooling
Our clothing classification model is based on ResNet, which embeds attention mechanism and narrow pooling in the base network. This model has two main modules: an HAM based on 3D global information interaction, and an NPM for global and local information fusion. We added the hybrid attention module to the first and fourth layer of ResNet, and four main layers for global information interaction. The NPM was added to the last residual block of each layer for global and local information fusion.
In this section, we evaluated the proposed method on popular clothing datasets, including Fashion-MNIST[16]and DeepFashion[10]with classification benchmarking and ablation studies.
The Fashion-MNIST dataset contains 6 000 samples per category and the test dataset contains 1000 samples per category. There are 10 categories in total, the training dataset has a total of 60 000 samples, and the test dataset has a total of 10 000 samples. As shown in Fig. 3, each grayscale clothing image is a 28×28 pixel array, the value of each pixel is an 8-bit unsigned integer (uint8) between 0 and 255, which is stored using a 3D array, and the last dimension indicates the number of channels.
Fig. 3 Example images of Fashion-MINIST dataset
DeepFashion is a large-scale dataset opened by the Chinese University of Hong Kong, China. As shown in Fig. 4, it contains 800 000 images, including images from different angles, different scenes, buyers’ shows,etc. We selected the subset used for classification in DeepFashion: category and attribute prediction benchmark contain 289 222 images with 46 categories in total, and all in JPG format.
Fig. 4 Example images of DeepFashion dataset
Fig. 5 Confusion matrix of baseline
Fig. 6 Confusion matrix of the proposed method
The experiments used graphics processing unit (GPU) to speed up the training of the model, and the Adam optimizer was used to speed up the convergence of the model. In the training phase, the training cycle (epoch) was set to 50 times, and the number of images per batch was 32. Due to the ueven size of the DeepFashion dataset, the images of DeepFashion are augmented by randomly panning and horizontal or vertical flipping, and the image size is uniformly adjusted to 224×224 pixels, which is to enhance the generalization ability of the model during the model training stage.
We first performed our experiments on the Fashion-MNIST dataset. As we can see from Table 1, each part of the proposed method is very useful and represents a significant improvement compared with the ResNet and CBAM baseline.
Table 1 Results on Fashion-MNIST
In Table 1, we can see that adding HAM or NPM, as well as using both methods together, improves the performance better than baseline by 0.33%, 0.05%, and 0.55%, respectively. In addition to the above results, the proposed method is also better than CBAM by 0.40%.
To visualize the improvement in accuracy of the improved model, we compared the confusion matrix generated by baseline with the results obtained by the proposed method shown as Figs. 5 and 6. The vertical axisy_true is the ground-truth and the horizontal axisy_pred is the predicted label. The number is dataset’s count. We find that most of the misclassified cases have been improved to some extent.
We also get excellent results on the DeepFashion dataset as shown in Table 2. The proposed method shows significant improvements in accuracy, average precision, and average recall, which is an improvement of 1.14%, 4.36%, and 2.49%, respectively over baseline.
Table 2 Results on DeepFashion
In this paper, a clothing classification model based on attention mechanism is proposed. The model firstly obtains global information and interacts with features through an improved blending attention mechanism, and then a narrow pooling layer is added to the convolutional layer to enhance the use of global and local features, finally clothing images are classified by feature fusion. The proposed method can be used to help industry managers and researchers to perform fast and effective automatic classification of clothing images. In addition, the model can also help to build image classification models and systems for other scenes.
Journal of Donghua University(English Edition)2022年4期