亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

Text-Independent Algorithm for Source Printer Identification Based on Ensemble Learning

2022-11-10 02:31:20NaglaaElAbadyMohamedTahaandHalaZayed

Computers Materials&Continua 2022年10期

Naglaa F.El Abady,Mohamed Taha and Hala H.Zayed,2

1Department of Computer Science,Faculty of Computers and Artificial Intelligence,Benha University,13518,Egypt

2School of Information Technology and Computer Science(ITCS),Nile University,12677,Egypt

Abstract:Because of the widespread availability of low-cost printers and scanners,document forgery has become extremely popular.Watermarks or signatures are used to protect important papers such as certificates,passports,and identification cards.Identifying the origins of printed documents is helpful for criminal investigations and also for authenticating digital versions of a document in today’s world.Source printer identification (SPI) has become increasingly popular for identifying frauds in printed documents.This paper provides a proposed algorithm for identifying the source printer and categorizing the questioned document into one of the printer classes.A dataset of 1200 papers from 20 distinct (13) laser and (7) inkjet printers achieved significant identification results.A proposed algorithm based on global features such as the Histogram of Oriented Gradient(HOG)and local features such as Local Binary Pattern (LBP) descriptors has been proposed for printer identification.For classification,Decision Trees (DT),k-Nearest Neighbors (k-NN),Random Forests,Aggregate bootstrapping (bagging),Adaptive-boosting(boosting),Support Vector Machine(SVM),and mixtures of these classifiers have been employed.The proposed algorithm can accurately classify the questioned documents into their appropriate printer classes.The adaptive boosting classifier attained a 96% accuracy.The proposed algorithm is compared to four recently published algorithms that used the same dataset and gives better classification accuracy.

Keywords:Document forensics;source printer identification (SPI);HOG;LBP;principal component analysis(PCA);bagging;AdaBoost

1 Introduction

In the early modern era,documents in digital format and their use became more common due to the fast development of advanced and sophisticated technologies.Nowadays,avoiding utilizing them is almost Impossible.Official contract images,invoices,contracts,bills,checks,and scientific literature are digital documents.These documents are unsecure because they lack the necessary security measures.Manipulation of documents has become more accessible as a result of this limitation.These operations are simple to carry out with the help of efficient technologies such as printers and scanners.After scanning the original document,the scanned image has readily been tampered with.As a result,before relying on a document,it is necessary to verify its authenticity.In most cases,active approaches are used to authenticate documents.These approaches are extensively used to protect digital documents[1,2],such as a watermark or signature.These strategies add different extrinsic fingerprints to the document,which can be easily traced if disturbed.However,because such technology is costly and time-consuming,it is impossible to utilize it for all publications.Another strategy is passive,which is based on the intrinsic properties of document images.The fingerprints of hardware and could be used as intrinsic features to prove the authenticity of the document.knowing the source printer might be incredibly beneficial when looking for modifications in printed documents.Each printer has a distinct printing style.This component can be used to inspect the printed document and trace it back to the printer that was used.

When a printed document is scanned,it becomes a traditional pattern recognition problem with feature extraction and classification[3].Chemical or microscopic techniques are used in traditional approaches,which are time-consuming and can harm or even destroy the investigated documents.As a result,all what is required for digital approaches is a scanner and a computer.Several methodologies[4-7],including as examination and machine learning-based approaches,have been developed in the relevant literature.Text-dependent and text-independent approaches are the two primary categories of techniques proposed in the literature.The majority of text-dependent approaches depend on character or word-level imperfections introduced by printers.Despite their effectiveness,such procedures necessarily involve the comparison of semantically similar units(characters or words).Either pre-divided characters(or words)or the integration of an Optical Character Recognition(OCR)system that allows for the comparison of identical characters words is required.Text-independent approaches are more relevant to real-world applications,although they need a huge amount of training data to simulate printer discrimination.Text-independent approaches,on the other hand,are not content-specific and often recommend the use of statistical features acquired from a large number of observations(paragraphs or images).

This paper proposes a feature-based classification of source printers based on scanned images of printed papers.The following are the main contributions:

? Detect forged documents with high accuracy using source printer identification.

? Identifying the source printer and categorizing the questioned document into one of the printer classes.

? Investigate the global and local characteristics of the entire printed documents without using pre-divided characters(or words)or the usage of an OCR system.

? Propose and construct an efficient document classifier capable of identifying a foreign document from a set of questioned documents printed on a separate printer.

The following is how the rest of the paper is structured:Section 2 highlights related work,while Section 3 discusses the details of the suggested method.Section 4 describes the results of the conducted experiments along with an elaborate discussion of these results and a comparison with related work reported in literature.Section 5 concludes the paper.

2 Related Works

Detecting document tampering can be done in a variety of ways.The majority of these approaches detect the source of the variations to determine the likelihood of alteration.Other approaches search for the source printer of document images to authenticate the documents.This section will go over the most common methods for authenticating a document and confirming that it was printed by a legal printer.These methods are classified into two types:text-dependent (Local features are examined)and text-independent (global features are examined).Tab.1 depicts a summary of (Source Printer Identification)SPI techniques based on printed documents.

Table 1:Summary of SPI Techniques based on printed documents

2.1 Text-Dependent Approaches

Text-dependent approaches typically depend on character or word-level constraints imposed by printers.Such approaches require the comparison of semantically related components (characters or words).Generally,it needs the use of either pre-divided words (or characters) or incorporating an OCR system that permits the comparison of identical words or characters.Mikkilineni et al.suggested a texture feature-based descriptor-based method for detecting the source of a document in[3].It examines the document’s connected components(CCs)or characters,as well as the statistics of some specific,frequently occurring characters,such as“e”or“a”for indications of alteration.Text documents scanned at a resolution of 2400 dpi were taken into consideration.For this experiment,all “e”letters were used.The Gray-Level Co-occurrence Matrix (GLCM) was applied to extract 22 statistical features per character to create a feature vector.‘Each feature vector is classified individually using a 5-Nearest-Neighbor (5NN) classifier.Different texture feature extraction methods,such as Discrete Wavelet Transform (DWT) and GLCM,are used in[8]to examine the Chinese printed source and determine the impact of different output devices.When 12 printers were examined,they achieved an identification accuracy rate of 98.4%.Kong[9],proposed the first attempt to differentiate documents produced by an inkjet printer,copier,and laser printer based on attributes obtained from unique characters in the documents.The document’s signatures from the standard device(s)that were used to make the document are evaluated.The experimental results showed that the accuracy reached 90%for all the inkjet printers and most laser printers and copiers.Ferreira et al.[10],proposed three solutions for identifying laser printers.In these solutions,low-resolution scanned documents were employed.The first technique applied two descriptors based on multi-directional and multi-scale textural features of micro-patterns.Letters or areas of interest were used to create these descriptions.As a second descriptor,the Convolution Texture Gradient Filter(CTGF)was proposed.The third method had the advantage of identifying a document’s printing source even if portions of it were unavailable.For frames,characters,and documents,the accuracy of the first method was 98.38%,97.60%,and 88.58%,respectively.The accuracy rates for frames and papers were 94.19%and 88.45%,respectively.

In a system proposed in[11],all of the printed letters were used at the same time to identify the source printer from scanned images of printed documents.A single classifier is used to classify all printed letters,as well as local texture patterns-based features.From scanned images,letters are extracted.Each character is separated into a flat and an edge region,and local binary patterns for these two regions are calculated individually.The method was tested on a public dataset of 10 printers as well as a new dataset of 18 printers scanned at 600 and 300 dpi resolution and printed in four different fonts.The system can simultaneously deal with all the printed letters and use a single classifier outperforming existing hand-crafted feature-based methods.In[12],The authors proposed a solution for the printer attribution problem that can learn discriminative features directly from available training data.The back-propagation process and convolutional neural networks are used in the solution.The method is based on artifacts extracted from various letters of texts in various languages.The authors were able to achieve a 98%accuracy by employing various representations of raw data as input to a Convolutional Neural Network(CNN).Tsai et al.proposed a four-layered CNN architecture for SPI from documents in[13]and compared the results to hand-crafted features.In[14],proposed a deep learning approach to address the difficult image classification problem.Textual documents are classified with an accuracy of 98.4 percent,while natural image-based scanned documents are classified with an accuracy of 99.96 percent,using a 7-layered CNN.Authors reported textual and image-based document accuracy of 97.37 percent and 97.7 percent,respectively,after raising the layers to 13.

2.2 Text Independent Approaches

Text independent strategies look at the entire document at the same time.The algorithms in this category examine statistical properties such as noise across the document to detect modifications.The number of studies in this category is relatively small.Automatically source printers are identified using common-resolution scans(400 dpi)[15].The proposed system is based on the printer’s unique noise.The overall categorization accuracy was 76.75%.A text-independent method for an adequate description of source printers using deep visual Features has been implemented by[16].Using transfer learning on a pre-trained CNN,the system detected 1200 documents from 20 different(13)laser and(7)inkjet printers.In[17],the authors presented a document source printer with a passive technique.Some of the feature extraction techniques that have been used include Key Printer Noise Features(KPNF),Speeded Up Robust Features (SURF),and orientated FAST rotated and BRIEF (ORB).For the classification job,three classification strategies are considered:k-NN,random forest,DT,and the majority vote of these three classification techniques.The system achieved the best accuracy of 95.1%by combining KPNF,ORB,and SURF with a random forest classifier and adaptive boosting technique.For printer attribution,a novel technique based on (SURF),Oriented Fast Rotated,and BRIEF feature descriptors is proposed in[18].Random Forest,Naive Bayes,k-NN,and other combinations of these classifiers were employed for classification.The proposed model is capable of accurately classifying the questioned documents into the appropriate printer.The accuracy was 86.5%using a combination of Naive Bayes,k-NN,and random forest classifiers,as well as a simple majority voting system and adaptive boosting algorithms.In[19],the authors proposed a system for distinguishing inkjet-printed pages from laser-printed pages based on differences in edge roughness.The whole process used,appropriate intrinsic features from the document image are extracted in the first step.The extracted features are compared in the second step to identify documents that did not use the similar printing technique as most of the documents.The key advantage of this technique is that it does not require any prior experience with genuine documents.

3 The Proposed Algorithm

On every printed page,there are some fingerprints left by the printer.Every printer has its own set of fingerprints.These fingerprints are a printer’s distinguishing feature.This research provides an algorithm for identifying the source printer and categorizing the questioned document into one of the printer types.The proposed algorithm is depicted in Fig.1 in two steps.The training phase includes preprocessing,feature extraction,and classification.The testing phase is similar to the training phase after adding the prediction.

The printed document image is cropped from the beginning of the image to three equal images(top,middle,and bottom).Then each image in every collection is cropped to 1024X 1024 pixels.For each image in the collection,global feature descriptor vectors are extracted using HOG features.Using LBP features,local feature descriptor vectors are extracted for each image.For training purposes,both HOG and LBP feature vectors are concatenated.To create the trained models,the proposed algorithm is trained using DT,k-NN,SVM,a combination of them,Bagging,Boosting,and random forest classifiers.For testing purposes,HOG and LBP feature vectors are also concatenated.By inputting the HOG and LBP features of the questioned documents,we can use the trained models to predict the class of the documents.

Figure 1:The proposed algorithm diagram

3.1 Preprocessing

Because the size of the input image is too large,applying the proposed algorithm takes a long time.As a result,resizing the input document image will decrease feature extraction time.To avoid this issue,each document image is cropped from the beginning of the image to three equal images(top,middle,and bottom).Then each image in every collection is cropped to 1024X 1024 pixels,as shown in Fig.2.These cropped photos are then used to extract features.An extra benefit of cropping an image is the generation of the part containing fingerprints left by the printer.

Figure 2:Samples of document images used in the proposed algorithm.(a)at the top(b)at the middle(c)at the bottom

3.2 Feature Extraction

Two feature extraction strategies are utilized in the proposed algorithm.The HOG is used first.The HOG is a robust feature descriptor that uses an intensive feature extraction technique.It retrieves features from all an image’s location areas of interest.HOG extracts the object structures from the gradient information in a picture[16].The feature extraction steps using HOG consist of preprocessing,calculating the gradient directions,and Gradient Magnitude from Eqs.(1)and(2).A HOG features vector is generated by combining the gradient calculations of each pixel,as shown in Fig.3.Generating a histogram for each block by using gradients value.Calculating the normalization[20]of the histograms.

whereGxandGyare Gradient magnitude in x and y direction.

Figure 3:Gradient Directions(left),Gradient Magnitude(right)

In the proposed system,the default ExtractHOGFeatures of MATLAB is used with cell size 128*128.Fig.4 shows the input image and visual HOG feature Extraction.

Figure 4:Input image and visual HOG feature extraction

Finally,one of the operators used to extract texture characteristics is Local Binary Pattern(LBP)used in[21-23].It calculates the image’s local contrast.The LBP is first specified in an eight-pixel radius around the grey value center pixel.The LBP is easy to use and has a low processing complexity,as indicated in Eq.(3).

wheregp,gc(P=0,1,...,P-1)are intensity values of central pixel and neighboring pixels.P.denotes the number of pixels in the neighboring pixels.The Calculation process of the original LBP is shown in Fig.5

The default of LBP is used in the proposed algorithm,with an average number of neighbors of 8,a radius of 1,and a cell size 256*256.

3.3 Classification

A model-training algorithm that uses a feature set as input is called a classifier.A classifier creates a model when the training dataset has successfully trained it.The test data is then classified using this model.Depending on the problem,multi-class or binary classifiers may be used.There are two types of classifiers,single and ensemble[24].Single classifiers such as decision tree (DT)[25,26],K-Nearest Neighbors (K-NN)[27],and Support Vector Machine (SVM)[28].Ensemble classifiers such as Random Forests(RF)[29],Adaptive-Boosting(Boosting)[30],and Aggregate Bootstrapping(Bagging)[31,32].In this paper,the two types are used and generate trained models,which are stored to be used later in the prediction process in the testing phase.

3.4 Trained Models and Prediction

Following the classification technique outlined in the previous section,a group of trained models,including the DT model,KNN model,SVM model,DT-KNN model,DT-SVM model,KNN-SVM model,RF model,boosting model,and Bagging model,are generated.Use the obtained trained models provided after applying various classifiers to predict the type of printer during the testing phase.Choose a model with a high level of accuracy

3.5 Principal Component Analysis(PCA)

PCA is one of the greatest widely used approaches for reducing data dimensionality.PCA can reduce the dimensions of multi-variables while still maintaining the relationship of data as much as possible.PCA is an unsupervised learning method that employs input data regardless of the target output.To reduce the dimension of a feature vector,PCA uses four steps[33]:normalize the image,calculate the covariance matrix,compute eigenvectors and related eigenvalues,and transform the original data into the new reduced feature vector.

The experimental results based on the classifiers stated above and their combinations are discussed in the next section.

4 Experimental Results and Discussion

The proposed algorithm is implemented using MATLAB R2019b and was run and verified with a DELL PC machine with the following configuration:Intel (R) Core (TM) i5-2430 M CPU @2.40 GHz,and 12.00 GB of RAM,64-bit Windows 10.Several experiments were carried out to evaluate the proposed algorithm’s performance.Section 4.1 describes the datasets utilized to train and test the proposed algorithm.The setup of the experiment is provided in Section 4.2.Evaluation measures are offered in Section 4.3.The fourth subsection,introduces a discussion of results.Finally,a comparison with other techniques is discussed.

4.1 Datasets Description

The experimental findings for the proposed algorithm were obtained using Khanna et al.’s public’s dataset[34].The documents in this collection were printed on 13 laser printers and 7 inkjet printers.Each printer is given a total of 50 documents to consider.A printer’s documents are all one-of-a-kind.The dataset contains documents from three categories:contracts,invoices,and scientific papers.The contract only contains text but in different font types and sizes.A contract will never contain pictures,lines and diagrams.The invoices feature different font sizes,logos,composed of a small picture and colored text.The contracts and invoices documents were created artificially.The Scientific Literature consists of real-world examples.All documents of the scientific literature type originally have been released under a license that allows reusing them.The printer model’s datasets used in this paper are listed in Tab.2.

Table 2:Printer models in the dataset

(Continued)

Table 2:Continued

4.2 The Experiment’s Setup

The entire dataset is divided into two parts:80%of the data is used as a training dataset,while the rest 20%is used as a testing dataset.The suggested system’s effectiveness is also evaluated using the 10-fold cross-validation technique.To classify the data,this work considers six classifiers:DT,SVM,KNN,their combinations,random forest,bagging,and boosting.

4.3 Evaluation Metrics

The performance of the proposed algorithm is assessed using a variety of evaluation metrics,including accuracy,recall,precision,and F-measure metrics[15,35,36].

The accuracy is calculated using the formula shown in Eq.(4)and is defined as the percent ratio of successfully identified documents.

where TP stands for the number of correctly classified samples,FP for the number of wrongly classified samples,TN for the number of correctly rejected samples,and FN for the number of wrongly rejected samples.The recall is the percentage of real positive instances compared to all positive cases that are correctly classified.It’s also called the true positive rate(TPR),and it’s calculated with the following Eq.(5):

Precision is also known as a positive predictive value,and it can be calculated using the formula:

The F-measure is calculated as follows[37]:

Tab.3 depicts the proposed algorithm’s recall,precision,and F-score.

Table 3:Recall,precision,and f-score of the proposed algorithm

4.4 Discussion

In this algorithm,the image is partitioned into three parts:top,middle,and bottom.The algorithm was trained and tested for all three parts.However,the top part yielded the best results because it contains most of the printer’s fingerprints see the result in Tabs.4-6.For partitioning technique and bagging and boosting classifiers,recognition rates of 94.5%and 96%were attained,respectively.Figs.6,7 and Tab.4 illustrate that a 90.5% and 92.5% recognition rate was attained using a 10-fold cross-validation technique and bagging and boosting classifiers,respectively.The confusion matrix of employing AdaBoost methodology for HOG+LBP with partition technique is shown in Tab.7.When PCA is applied for dimension reduction,feature vectors decrease from 2708 to 1000.Using PCA,a 90%and 90%recognition rate was attained for the partitioning technique with bagging and boosting classifiers.With a 10-fold cross-validation technique with bagging and boosting classifiers,a recognition rate of 90.5%and 92.5%was attained,as depicted in Tab.8,Figs.8 and 9.The confusion matrix using AdaBoost methodology for HOG + LBP + PCA with partition technique is shown in Tab.9.Several experiments revealed that our system could achieve an accuracy rate of 96% by combining HOG and LBP with a boosting classifier.

Table 4:Accuracy achieved using dividing technique and using a 10-fold cross-validation technique(Top part)

Classifier Dividing technique 10-fold cross-validation HOG(%) LBP(%) HOG+LBP(%)HOG(%) LBP(%) HOG+LBP(%)Bagging 91.5 89 94.5 89 84.3 90.5 Boosting 94 90.5 96 90.5 89.1 92.5 Random forest 90 86.5 90.5 86.3 84.7 87.9

Classifier Dividing technique 10-fold cross-validation HOG(%) LBP(%) HOG+LBP(%)HOG(%) LBP(%) HOG+LBP(%)Decision Tree(DT)51.7 57.1 56.3 47.9 51.8 56.3 SVM 66.7 22.9 66.7 64.1 20.3 66.7 KNN1 45 43.3 46.7 45.7 41.6 46.7 KNN3 40.2 40.8 42.1 41.7 39.3 42.1 KNN5 41.7 37.1 42.9 42.5 36 42.9 DT+SVM 55 44.2 47.9 76.3 74.1 80.4 DT+KNN 45.8 45.4 47.9 79.3 75.4 86.7 SVM+KNN 45.8 41.7 50 72.3 68.5 77.9 DT+SVM+KNN 48.3 47.1 53.3 50.8 43.4 53.3 Bagging 81.3 75.4 80.4 53.5 39.5 50 Boosting 83.3 79.17 86.7 45.1 42.6 47.9 Random forest 75.8 75 77.9 47 40.2 47.9

Table 6:Accuracy achieved using dividing technique and using a 10-fold cross-validation technique(Bottom part)

Figure 6:The accuracy achieved using the dividing technique(Top part)

Figure 7:Accuracy was achieved using a 10-fold cross-validation technique(Top part)

Table 7:Confusion matrix of using AdaBoost methodology for HOG+LBP with dividing technique

Table 8:Accuracy achieved using dividing technique and using 10-fold cross-validation technique and PCA(Top part)

Figure 8:Accuracy achieved using dividing technique and PCA(Top part)

Figure 9:Accuracy achieved using a 10-fold cross-validation technique(Top part)

Table 9:Confusion matrix of using AdaBoost methodology for HOG+ LBP + PCA with dividing technique(Top part)

4.5 Comparison with Other Techniques

Despite the fact that much research on SPI has been proposed,it has all been analyzed using different datasets and experimental setups.As previously mentioned,many studies employ individual characters in a text-dependent framework for experimental purposes.Elkasrawi et al.[15],CNN[16],KPNF+SURF+ORB[17],and SURF and ORB with AdaBoost[18]are some current algorithms that the proposed technique is compared to.Comparison with related work on the dataset of 20 printers is highlighted in Tab.10.On both textural and deep learned features,our proposed algorithm employing HOG and LBP with Adaboost outperforms[15-17]and[18],as shown in Fig.10.It is obvious that the proposed algorithm using both HOG and LBP and testing the whole document outperforms the other four algorithms previously reported in literature.

Figure 10:Comparison with related work on the dataset of 20 printers

Table 10:Comparison with related work on the dataset of 20 printers

5 Conclusion

This paper proposes a text-independent algorithm for detecting document forgeries based on source printer identification SPI.The classifier’s goal is to determine the type of printer that produced the printed documents.The document classifier can classify an odd document out of several tested documents.In this research,the image is partitioned into three parts:top,middle,and bottom.HOG and LBP are used as feature extraction methodologies.For printer identification,classification methodologies such as decision tree,k-NN,SVM,random forest,bagging,and boosting are considered.A public dataset of printed documents from various printers is used to validate the results.Several experiments with multiple classifiers were performed,and the most efficient classifier was chosen.The algorithm was trained and tested for all three parts.However,the Top part yielded the best results because it contains most of the printer’s fingerprints.The AdaBoost classifier achieves the highest classification accuracy (96%) on our proposed algorithm.The proposed algorithm is compared to four recently published algorithms that used the same dataset and gives better classification accuracy.

Funding Statement:The authors received no specific funding for this study.

Conflicts of Interest:The authors declare that they have no conflicts of interest to be reported regarding the present study.

Computers Materials&Continua2022年10期

Computers Materials&Continua的其它文章: Modified Anam-Net Based Lightweight Deep Learning Model for Retinal Vessel Segmentation; Voice to Face Recognition Using Spectral ERB-DMLP Algorithms; Hyper-Parameter Optimization of Semi-Supervised GANs Based-Sine Cosine Algorithm for Multimedia Datasets; CNN-BiLSTM-Attention Model in Forecasting Wave Height over South-East China Seas; CNTFET Based Grounded Active Inductor for Broadband Applications; A Novel Method for Thermoelectric Generator Based on Neural Network