Stavros Ntalampiras| Ilyas Potamitis
1Department of Computer Science, University of Milan,Milan, Italy
2Department of Music Technology and Acoustics,Hellenic Mediterranean University, Rethymno,Greece
Abstract Computational bioacoustics is a relatively young research area, yet it has increasingly received attention over the last decade because it can be used in a wide range of applications in a cost‐effective manner. This work focuses on the problem of detecting the novel bird calls and songs associated with various species and individual birds. To this end, variational autoencoders, consisting of deep encoding–decoding networks, are employed.The encoder encompasses a series of convolutional layers leading to a smooth high‐level abstraction of log‐Mel spectrograms that characterise bird vocalisations. The decoder operates on this latent representation to generate each respective original observation. Novel species/individual detection is carried out by monitoring and thresholding the expected reconstruction probability. We thoroughly evaluate the proposed method on two different data sets, including the vocalisations of 11 North American bird species and 16 Athene noctua individuals.
Acoustic monitoring of bird activity is vital for various environmental, research, and scientific goals [1]. Most existing works focus on detecting birds by their sounds, which is the primary step in a series of applications: biodiversity monitoring, detection of endangered species etc. [2, 3]. Indeed, the area of computational bioacoustics has gained attention in recent years, especially after the mass diffusion of automated recording units, that is, devices that can record, store, and potentially transmit audio recorded in the wild to remote locations where further processing is usually carried out. These devices have become popular owing to the audio mode, an attractive and suitable characteristic for bird monitoring—many bird species are much easier to detect by their audio patterns than by data captured through other modes, such as video. Importantly, the audio mode is not affected by occlusions,lighting conditions etc.,placing it in a unique position for monitoring the activities of bird species.
In general, applications of computational bioacoustics concern the analysis of a habitat's health including population densities/trends of target species, migration patterns, protection of endangered species etc. For example, seabird colonies are currently being analysed using bird call activity [4, 5], by tracking general animal population density acoustically, [6] by identifying calls ofblack‐rumped flamebackin real‐time etc.
Let us define the set S, which includes all species a priori known to reside in a specific habitat. To the best of our knowledge, the existing research has analysed species while assuming complete knowledge of S, that is, its size and composition[7,8].Thus,the outcome is the presence/absence of species included in S. However, this assumption might not always be true,as both new species(even if only for migration purposes, i.e. for a limited time) and new individuals may appear among an existing species. Changes in S alter species diversity, and their analyses could be useful to track migratory movements, record seasonal changes, detect invasive species etc. [9] with the overall goal being the preservation of habitat quality and balance. The consequences depend on the characteristics of the new species,and they range from biodiversity loss to the normal continued operation of the altered ecosystem [10]. This could be particularly useful in long‐term monitoring scenarios, where automatic recording units can be employed in a standardised manner under potentially harsh environmental conditions to deliver accurate biodiversity indices [11].
In such a scenario (see Figure 1]), the first step in processing new species/individuals is detecting them.This work is concentrated on this exact problem—learning in non‐stationary environments where S is not static during system operation—that is,its composition and cardinality are subject to change.Such a problem is addressed in the literature on novelty and concept drift detection [12], while solutions that address the field of audio signal processing are limited. The case most similar to this work is presented in [13], where a typical home environment is considered. There, the solution is based on a change detection test consisting in a hidden Markov model(HMM) that characterises the available set of classes by operating in the feature space formed by Mel‐frequency cepstral coefficients.
This work proposes the use of variational autoencoders(VAEs)for detecting changes in S,as VAEs have been proved effective in similar tasks such as network attack detection[14],anomaly detection in energy time series [15], images [16],machine acoustics [17, 18] etc. VAEs are a family of powerful generative statistical modelling techniques that fit the current problem's specifications. Importantly, they encompass two principal processing stages [19, 20]:
(a) anencoderable to learn a non‐linear projection of the input signal space—the so‐called latent space (encoding)—that is typically characterised by a relatively small number of dimensions, and
(b) adecoderfor inverse non‐linear transformation of the latent coefficients into the original signal space.
Keeping in mind the considerations expressed in [21], we used suitably normalised log‐magnitude spectra to characterise the audio signal space..In sequence,the encoder and decoder consist of a series of convolutional and transposed convolutional layers.The network is trained to minimise reconstruction loss and Kullback–Leibler (KL) divergence between the input and output log‐spectrograms.Detection of novel audio signals is carried out by monitoring the respective reconstruction probability based on a threshold determined on a validation set during the training phase.
The proposed solution is thoroughly evaluated in two data sets: [(a)] the first comprises 11 North American bird species[22],while[(b)]the second comprises 16 individuals of the little owl (Athene noctua) species [23]. The experimental protocol follows leave‐one‐species‐out/leave‐one‐individual‐out logic,and we report excellent results in terms of false positive and false negative rates,thus improving the current state of the art for HMMs. We also present an analysis of the latent spaces learned by the constructed VAE.
The following section formalises the present problem,while section 3 describes the VAE and the change detection process as well as a representation of the audio signal employed.Section 4 details the experimental set‐up including a brief description of the data sets and the parameterisation of the approaches as well as the results analysis. Finally, we draw our conclusions in section 5.
This work assumes availability of a data set including monophonic audio signals,denoted asyt.We further assume a single dominant sound source at each timet,leaving the problem of composite sound scenes for future work. The sources come from a known but unbounded set of classes S ={S1,…,Sm},whereSidenotes thei‐th class, withi∈N+,1
In contrast to the limits described in the vast majority of related literature,S is unbounded for this system,meaning that new classes may appear during system operation. These may correspond to either new species or individuals belonging to an a priori known species. In this case,ytbecomesyt′, that is,
wheret*is the starting time instance of the manifestation of a new sound class. In order to address such an increasingly complex auditory scene,the current analysis method should be adapted, or a new one should be designed from scratch.
This formulation assumes the availability of an initial training sequence,TS=yt,t∈[1,T0], where the involved classes are known via labelled pairs (yt,Si),i∈[1,m]. No assumptions are made about the number of new classes or the properties associated with the probability density functions.The overall goal is to detect changes in S with the smallest false positive and false negative rates.
Data modelling algorithms can be broadly divided into two categories, namely,discriminativeandnon‐discriminative.The first aims to discover the boundaries discriminating classes that exist within the available data, while the second typically models the characteristics of each class independently from the rest. The generative models form an attractive solution for estimating such characteristics in a probability density form,P (x ), describing a given data set. This density estimation typically includes an additional set of random variables,the so‐calledlatentvariables, denoted as z. Such high‐level representation operating in the data domain can be employed to suitably control the data generation process as well.
Generative models are formalised using a joint probability function, such as
wherep(z) represents the Bayesianpriorin the latent space.When the data‐generating process is positioned in latent space z, it corresponds to a probability density function,p(x|z), in the data domain. At the same time, we wish to compute theposteriordistributionp(z|x) with respect to the latent one associated with a sample in the data space x. Bayesian classification frameworks to calculate the specific posterior distribution during their operation form a rigid inference framework. Unfortunately, complex highly non‐linear data distributions are typically intractable without strong assumptions about their occupancy, shape etc. To attempt to overcome this obstacle, avariational inference(VI) framework transforms the distribution estimation problem to one of optimisation [24]. To this end, VI starts from a distribution formalised asq(z|x) and parameterised in z. Such a distribution is freely constructed and updated to reach the true posteriorp(z|x).During this process,the next bound is respected:
Algorithm 1 Proposed variational auto‐encoder–based detection of new bird species/individuals
withDbeing the KL divergence. In this inequality(Equation 3), our modelp(x) is intrinsically optimised via maximisation of the evidence E. The above‐defined bound, a so‐calledevidence lower‐bound(represented as ELBO), is essentially the sum of the likelihoodp(x|z) and the KL divergence imposed on the estimated distributionq(z|x) to shift it towards the priorp(z).
Based on the above‐described VI framework, both generative and inference models can be formulated as normal distributions, that is,
F I G U R E 2 Log‐Mel spectrograms of the considered bird species
After the encoder and the decoder models have been defined,an anomaly detection framework is needed to identify deviations from the normal/known patterns available in the training set. This is fundamentally different logic concerning the way that neural networks are typically used, that is, to achieve a specific outcome given a specific input, such as supervised classification.A straightforward solution would be to assess an observation x by evaluatingpθ(x), potentially using Monte Carlo methods, that is,pθ(x)=Epθ(z)[pθ(x|z)]. However,as described in[28],sampling the prior distribution is not a practical solution.
Following the line of reasoning explained in[29],this work proposes to use the reconstruction probability, defined as Eqφ(z|x)[logpθ(x|z)], for assessing the stationarity of the audio stream.Novel audio classes are expected to bring a bias to the mapped z exhibiting low reconstruction probabilities.Based on the findings reported in[30],each recording x in the validation set is given to model for reconstruction by the encoding–decoding process.There,we compute the largest reconstruction error that can make up the anomaly detection threshold.During testing,each record is reconstructed,and the produced error is checked against the threshold. If the error surpasses the threshold, an anomaly is signalled; in the opposite case, the testing record is considered generated by the normal distribution. The encoding–decoding process is demonstrated in Figure 2, while the change detection algorithm is formalised next.
The proposed VAE‐based change detection algorithm is outlined in Algorithm 1. Its inputs are the set S and a test audio signalyt, while its output is whether a new or known species/individual is detected inyt. Initially, the algorithm divides the data in S into training sets,TS,and validation sets,VS(line 1,Algorithm 1). Then, the log‐Mel spectrograms are extracted(line 2, Algorithm 1), and a VAEVapproximates the distribution exhibited inFTS(line 3,Algorithm 1).Subsequently,Vis applied onFVS, and the corresponding reconstruction errors are computed (lines 5–7, Algorithm 1). The maximum mean squared error is set as the detection thresholdTh(line 8,Algorithm 1).
F I G U R E 4 Log‐Mel spectrograms of the considered individuals (little owl species)
Then, to check the test audio signalyt, we first extract its log‐Mel spectrogram (line 9, Algorithm 1) and feed it toV(line 10, Algorithm 1). The reconstruction error is computed(line 11, Algorithm 1) and compared againstTh(lines 12–15,Algorithm 1). If the reconstruction error is larger thanTh,the algorithm signals the detection of a new species/individual. Conversely, a species/individual existing in S is included inyt.
This section briefly describes the considered feature set,which is a simplification of the Mel‐frequency cepstral coefficients wherein the final dimensionality reduction step based on the discrete cosine transform is omitted, as is typically performed in deep learning solutions that target audio signals [31, 32].Initially, the audio signal is windowed using the Hamming function, and the short‐time Fourier transform (STFT) is computed for each frame. The outcome of the STFT passes through a triangular Mel‐filterbank of 23 filters.Consecutively,we obtain the logarithm to adequately space the data and derive a vector of 23 log‐energies per frame [33].
Representative log‐Mel spectrograms extracted from the considered species and individuals are shown in Figures 3 and 4,respectively.In Figure 3,it is worth noting the inter‐class variance in terms of frequency content (e.g. differences between American crow and house finch)and time evolution(e.g.differences between common yellowthroat and American yellow warbler). By contrast, log‐Mel spectrograms at the individual level are not as diverse (see Figure 4) when examining both time and frequency content.From this point of view, we can expect that change detection in the first case would be easier because large differences are anticipated between the known and unknown species. At the same time, the second task is expected to be characterised by a higher degree of difficulty. Importantly, the use of such a standardised feature extraction mechanism removes the need to conceptualise and implement handcrafted features specifically designed to address a given problem.
This section includes details about (a) the data sets employed, (b) the parameterisation of the proposed and contrasted approaches, and (c) the analysis of the obtained results.
To validate the proposed method, we employed two data sets satisfying the requirements presented in section 1. The first serves new species detection, as it includes the following 11 North American bird species:bluejay,song sparrow,great blue heron,American crow,cedar waxwing, house finch, indigo bunting, marsh wren, common yellowthroat, chipping spar‐row, andAmerican yellow warbler. There are 2762 bird acoustic events adequately distributed among the available bird species, with the audio signals sampled at 32 kHz.More information is available in [8, 22], and the data set can be downloaded at https://zenodo.org/record/1250690#.XfOmzOhKhww.
The second data set serves new individual bird detection,as it encompasses 16 individuals of the Little owl (Athene noctua) species. It consists of 952 bird acoustic events evenly distributed among individuals, with the audio signals sampled at 44.1 kHz. More information is available in [23], and the data set can be downloaded at https://zenodo.org/record/1413495#.XfOnMehKhww.
To assess the performance of the proposed and contrasted approaches, we employed the following two figures of merit:
?False positive index(FPI):This index counts the times a test detects a nonexistent novel acoustic event (percentage).
?False negative index(FNI): This index counts the times an existing novel acoustic event is not detected as such(percentage).
The feature set was extracted using an STFT size of 512 samples,while the signals were windowised in frames of 30 ms overlapped by 50%.
The VAE encoder comprises three convolutional layer sizes [32, 64, and 64], while the decoder network is symmetric. The kernel size is 3 × 3 with a stride equal to 2. The rest of the parameters are (a) latent dimension, 40; (b)number of epochs, 50; (c) batch size, 10; and (d) learning rate, 0.001.
The first contrasted approach is based on an HMM trained, validated, and tested on identical sets of data. The explored number of states ranges from two to seven,while the Gaussian functions composing each state come from the following set: {2, 4, 8, 16, 32, 64}. The probability threshold between subsequent iterations of the Baum–Welch algorithm is 0.001 with a limit of 50 iterations. The combination providing the most accurate modelling in terms of log‐likelihood was chosen [13]. The second comparison concerned aGaussian mixture model (GMM), and the third a one‐class support vector machine (SVM) with a radial basis function kernel [34].
TA B L E 1 Mean and standard deviation of FPI and FNI offered by the proposed and contrasted approaches
F I G U R E 7 Two examples of original and generated log‐Mel spectrograms of a bird species and individual
We present the following experimental results: (a) comparison of HMM and SVM, (b) FPI and FNI rates per bird species/individual class, (c) log‐Mel spectrogram generation examples,and (d) latent space visualisation. All results were obtained following the leave‐one‐species‐out/leave‐one‐individual‐out experimental protocol.
Table 1 tabulates the FPI and FNI rates achieved by VAE and HMM for bird species/individual detection. The best indices are shown in bold. As shown, the proposed method outperforms the contrasted one in both classes for both types of rates. VAE is able to encode and decode from a structured compact latent representation of the input log‐Mel spectrograms providing low reconstruction error/high reconstruction probability. As such, it is able to provide reliable results in terms of FPI and FNI rates. At the same time, the HMM modelling the temporal evolution of the Mel‐frequency cepstral coefficients cannot identify novel acoustic patterns with the same degree of accuracy. We also observe improved performance at the species level, which was expected because differences are more evident in that case than for individuals. Finally, the SVM offers rates that are slightly worse than those of the HMM, which may result from its inability to capture existing temporal patterns.However, the SVM performs better than the GMM‐based solution.
F I G U R E 8 Visualization of the latent space learned by the VAE trainedon bird species vocalizations
F I G U R E 9 Visualization of the latent space learned by the VAE trained on individual bird vocalizations.
Figures 5 and 6 present the FPI and FNI rates achieved at the species and individual levels, respectively. In Figure 5, we observe a relatively large FPI for thecedar waxwing, which may be due to the similarities it exhibits to the rest of the North American bird species. At the same time, FNI rates show a more consistent behaviour across species. Finally, it is worth mentioning that thegreat blue heronis detected quite effectively with very low FPI and FNI rates.
Figure 6 depicts results at the individual level.We observer that individuals VEV,DUB,and KME are associated with high FPI rates. All individuals are acoustically similar in terms of what a human listener can assess.The remaining individuals are characterised by consistent FPI/FNI rates.
Subsequently, we visualised the latent space by capturing the mean and variance encodings (each with a dimension of 40) extracted by feeding the encoder network with test log‐Mel spectrograms. Principal component analysis was carried out on the matrices comprising the encodings for the log‐Mel spectrogram associated with each bird species and individual. Finally, the latent space defined by the means and variances in the first two principal component dimensions is visualised [35]. Figures 7 and 8 visualise the latent space produced for the corresponding cases of bird species and individuals. We see that both latent spaces offer a compact representation of the features associated with the classes of interest.
TA B L E 2 List of notations
Figure 9 demonstrates generated log‐Mel spectrograms of test recordings and their corresponding original. The top row refers to thebluejayand the bottom to alittle owlindividual.We observe that the log‐Mel space has been smoothed,providing a good basis for analysing reconstruction probabilities. In our future work, we intend to elaborate on the generated space in an effort to reverse them into the audio signal space while focusing on interpretable sound intelligibility.
A list of notations used in the data modelling algorithm is provided in Table 2.
This work formalised the problem of detecting novel bird calls/songs by means of a VAE‐based change detection algorithm. Log‐Mel spectrograms of testing sounds are reconstructed by the proposed encoding–decoding networks,and their novelty is assessed in terms of reconstruction error.The superiority over the contrasted HMM‐based change detection algorithm was proven on two use cases,detecting(1)bird species and (2) individuals.
Interestingly, the VAE latent space can be sampled and as such generate new bird vocalisations that come from the distribution exhibited by the training data. Currently,such generated vocalisations present perceptible artefacts.This would be the focus of our future work, that is, user‐defined and adjustable generation of a realistic bird repertoire [36]. At the same time, we intend to analyse the present algorithms from the point of view of computational complexity with the aim to evaluate hardware requirements for real‐time applications. Another fruitful path would be the exploration of few‐shot learning techniques [37] so that new sound classes can be learned and incorporated in the class dictionary on the fly as soon as a change is detected.
ACKNOWLEDGMENTS
This work was carried out within the project automatIc aNalySis of comPlex evovlIng auditoRy scEnes (INSPIRE)funded by the Piano Sostegno alla Ricerca of University of Milan. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan V GPU used for this research.
ORCID
Stavros Ntalampirashttps://orcid.org/0000-0003-3482-9215
CAAI Transactions on Intelligence Technology2021年3期