Huiyu Sun and Suzanne McIntosh
Abstract: The majority of big data analytics applied to transportation datasets suffer from being too domain-specific, that is, they draw conclusions for a dataset based on analytics on the same dataset. This makes models trained from one domain (e.g. taxi data) applies badly to a different domain (e.g. Uber data). To achieve accurate analyses on a new domain, substantial amounts of data must be available, which limits practical applications. To remedy this, we propose to use semi-supervised and active learning of big data to accomplish the domain adaptation task:Selectively choosing a small amount of datapoints from a new domain while achieving comparable performances to using all the datapoints. We choose the New York City (NYC)transportation data of taxi and Uber as our dataset, simulating different domains with 90% as the source data domain for training and the remaining 10% as the target data domain for evaluation.We propose semi-supervised and active learning strategies and apply it to the source domain for selecting datapoints. Experimental results show that our adaptation achieves a comparable performance of using all datapoints while using only a fraction of them, substantially reducing the amount of data required. Our approach has two major advantages: It can make accurate analytics and predictions when big datasets are not available, and even if big datasets are available, our approach chooses the most informative datapoints out of the dataset, making the process much more efficient without having to process huge amounts of data.
Keywords: Big data, taxi and uber, domain adaptation, active learning, semi-supervised learning.
New York City (NYC) transportation data is available to the public on a number of different domains: Taxi, Uber, etc. and many works [Deri, Franchetti and Moura (2016); Freire,Silva, Vo et al. (2014); Sun and McIntosh (2016); Poulsen, Dekkers, Wagennar et al.(2016); Aslam, Lim and Pan (2012); Cramer and Krueger (2016); Sun, Hu, McIntosh et al.(2018)] have been proposed to take advantage of it to perform various analytics and tasks such as predicting traffic, classifying datasets and offering services. But most works rely on data from the same domain to draw conclusions. And when the analytics and models are applied to a different domain, often the results are not very promising. This is due to variations between data sets leading to difference in the trained models. It is difficult to accurately adapt the model trained on one datasets (domain) to a different dataset without training instances available.
Many works investigated learning approaches on big data using a machine learning or neural network based model [Kasun, Zhou, Huang et al. (2013); Najafabadi, Villanustre,Khoshgoftaar (2015); Chen and Lin (2014); Wu, Zhu, Wu et al. (2014); Suthaharan (2014);Belkin, Matveeva and Niyogi (2004); Lv, Duan, Kang et al. (2015)]. Most of these analytics are a form of supervised learning, which requires large quantities of data.
Figure 1: Processing pipeline of our semi-supervised and active learning approach
By contrast, active learning methods [Sun, Grishman and Wang (2017); Akbik, Thilo and Christoph (2014); Sun and Grishman (2018); Sun, Grishman and Wang (2016)] aim to achieve a high performance with as little labeling effort as possible. Semi-supervised[Deri, Franchetti and Moura (2016)] and active learning allow us to selectively choose informative datapoints on the new domain, and using the selected data, it hopes to maximize the performance compared to the performance achieved using all the datapoints on the target domain.
The features for our model is basically the data entries for a given data point: VendorID,pick-up date time, drop-off date time, passenger count, trip distance, pick-up longitude,pick-up latitude, RateCodeID, store and forward flag, drop-off longitude, drop-off latitude,payment type, fare amount, extra, mta tax, tip amount, tolls amount, improvement surcharge, and total amount. Common entries for Uber and taxi are used as features.Variations between different datasets are distinguishable, as found out. Therefore, the features of datasets are enough to distinguish them. The label assigned to each data point is simply which transportation dataset it belongs to: Yellow taxi, Green taxi, or Uber. We investigate domain adaptation of datasets on NYC transportation data: Yellow taxi, Green taxi and Uber.
We use a semi-supervised bootstrapper to classify taxi and Uber data. Bootstrapping is proposed and has been extensively applied. Then active learning is used to select informative datapoints on the target domain. First, we take a semi-supervised approach by investigating ways to bootstrap new data. Then after the most optimized bootstrapping system is achieved,we turn our system into an active learner to select positive and negative datapoints at each iteration. The selected datapoints are given a label (yellow taxi, green taxi or Uber) and are used to classify the target domain data. The pipeline of our system is shown in Fig. 1.
In Semi-supervised learning, we use a bootstrapper that starts out with a set of seed datapoints. It then selects new data based on different sampling strategies. We first consider the edit distance strategy. Each taxi and Uber data entry can be viewed essentially as a string of characters containing time, location, fare amount, etc. The minimum edit distance [Levenshtein (1966)] between two strings (datapoints) is the minimum number of editing operations required to transform one string into the other. The generalized Levenshtein algorithm [Damerau (1964); Bard (2007); Brill and Moore(2000)] with weights and backtrace is used to compute the edit distances. The edit distance scores are normalized to fall in the range of [0, 1] with smaller scores indicating more similar datapoints. Given a set of seed datapoints, we computer the edit distances from a candidate datapoint to each of the seed datapoint. This set of edit distances form an edit distance vector, E, for a candidate datapoint. Then Eq. (1) is applied to E to give a final score, B(x), for a candidate datapoint. This is computed for each datapoint.The datapoints with the smallest scores are selected and added to the seed data set for the next iteration.
This edit distance approach is used when bootstrapping. However, another approach is used for comparison: The measure of entropy. Entropy has been used in many works[Zhu, Wang, Yao et al. (2008); Becker, Hachey, Alex et al. (2005)] to measure uncertainty [Lewis and William (1994)] When negative data are introduced, there are two labels for a datapoint: ‘+’ if it is an example of the relation, and ‘-’ if it is not. Then the entropy of a datapoint can be computed by Eq. (2), where P(+) is the probability of a datapoint being labeled positive and P(-) is the probability of a datapoint being labeled negative.
For bootstrapping, we pick the candidate data with the smallest entropy, as they are the most reliable data. For active learning, on the other hand, candidate data with the largest entropy are picked since they represent the highest uncertainty.
An active learner starts out with a set of seed data. It then selects new seed data based on different sampling strategies. The basic idea is to select informative examples that can maximum the likelihood to get the optimal performance. We investigate several different active learning strategies for sample selection.
The active learning algorithm can be described in Fig. 2. We start with a few seed data.Then the active learning strategy is applied to a set of candidate data and those satisfying the strategy are selected. The iteration continues until a stopping criterion is met. We randomly select 90% of the taxi and Uber dataset as the source domain for training and the rest 10% as the target domain for evaluating our learner. Beside active learning, different search, ranking and information extraction methods [Sun and McIntosh (2016); Sun,McIntosh and Li (2017); Fu, Ren, Shu et al. (2015); Zhou, Yang, Chen et al. (2016); Fu,Sun, Liu et al. (2015); Yuan, Xia, Sun et al. (2017); Xia, Wang, Sun et al. (2016); Xia,Wang, Sun et al. (2014)] have been proposed.
An approach to quantifying similarity is using edit distance [Levenshtein (1996)]. The minimum edit distance between two strings is the minimum number of editing operations(insertion, deletion and replacement) required to transform one string into the other. It is being used in speech recognition, computational biology and other fields [Damerau (1964);Bard (2007); Brill and Moore (2000)].
For a datapoint X of length n and Y of length m, we define D(i, j) as the edit distance between X[1..i] and Y[1..j]. The edit distance between X and Y is thus D(n, m). We use the generalized Levenshtein algorithm [Levenshtein (1996); Damerau (1964)] to produce alignments and alignment scores between two datapoints. It uses backtrace to keep track of alignments. The weights to the edit distance are also considered since operations on some datapoint are more likely ranker than others. The algorithm is shown in Fig. 3. With the edit distance approach, the more similar a candidate datapoint is to the seed datapoint, the smaller the edit distance will be. This allows us to rank each candidate datapoint accordingly.
Figure 2: Active learning algorithm for selecting new datapoints on the target domain
Figure 3: Generalized Levenshtein algorithm with weights and backtrace for edit distance
We observe a limitation of our current active learning methods: any particular seed datapoints is limited in that it can only pick out a small set of the available data. Using our active learning methods to quantity similarity between a seed datapoint and potential candidates picks out different ways to describe a specific datapoint. But it sometimes fails to introduce new instances. Here we combine our edit distance methods with entropy to counter the effect of the aforementioned limitation. Given a set of candidates datapoints,first calculate a set of edit distances. Then the edit distance is first divided by the length of the datapoint. This way, long strings (datapoints) will result in an even lower value. Next we calculate the entropy of the candidate set L the same way as semi-supervised learning using Eq. (1).
Figure 4: F1 curves for different active learning methods for 10 iterations on the target domain
Entropy has been used in many works to measure uncertainty. Here, we use it as an additional metric in combination with the scores calculated using the edit distance method.The uncertainty part might seem counter-intuitive at first since we are mainly trying to discover similar datapoints. But it does have the benefit in introducing completely different datapoints.
We start with a few seed datapoints. The default batch size we used is 1000, meaning in each active learning iteration 1000 new datapoints are selected and added to the data set.We compare the performance in terms of the F-measure achieved, specifically the F1 score.It is a measure of both accuracy and sensitivity as a combination of recall and precision calculated by Eq. (3).
We compare the performance of edit distance and entropy based active learning methods introduced in the previous section. We also use a random selection strategy for comparison.Fig. 4 shows the F1 curves for the different methods. Using the 90% split of source domain, we get a F1 of 48.62% of the target domain, which is iteration 0. Then the F1 score improves with each iteration. We performed 10 iterations for each different method.Randomly selecting datapoints clearly gets much worse performance compared to the other selectively sampling methods. This shows that there is much room for improvement in the sampling method. The Edit distance method, which is the simplest sampling strategy, gets better performance than random by a large margin. The F1 scores after the 10 iterations for the four methods are shown in Tab. 1.
Table 1: F1 scores for random selection and different active learning methods after 10 iterations. Supervised score is also shown
The entropy active learner has a similar performance compared to edit distance in the first few iterations. But after the 4th iteration, it clearly outperforms edit distance and continues to do so in the later iterations. This shows that as more and more datapoints are selected,entropy is a better method.
As expected, our combination of entropy with edit distance active learner achieved the best performance out of the four, with a better F1 in each iteration of active learning. And after 10 iterations, it beats the next best (entropy) by 5% and beats random selection by 18%. This shows the advantage of leveraging both similarity as well as uncertainty in sample selection.Using supervised method, i.e. all the datapoints from the 10% target domain, we get a F1 of 82.37%. That is the upper bound that we are trying to compare with using active learning. With edit distance + entropy active learner, we achieved F1 of 71.26 after iterations, 86.52% of supervised score. We only used a fraction of datapoints out of the new domain which contains over millions of data entries. Therefore in terms of data reduction, we used less than 0.01% of data and achieved comparable score to supervised method (86.52% of supervised scores).
To get an overall performance of our active learner, we need to take the split into account.In our experiment, we did a random 90%/10% splits on the train and test data. We want to see the effect that different data split has on the results as some split might pick out a less or more representative data split. We did 20 of these random splits on train and test and recorded the scores on the test. Distribution of the frequency of F1 scores on the test for entropy and edit distance + entropy are shown in Fig. 5. We see that the variation in scores all somewhat conform to a normal distribution.
An overlooked aspect in these big data analytics is the substantial variance in F1 score across different random splits on the dataset as seen in the figure. The standard deviation(about 2.5% F-measure) is comparable to the difference between systems. To avoid this problem it is necessary to either completely specify the split, so eliminating the variance or take the mean of n splits (thus reducing the SD by a factor of √n ). In the active learning experiments, we took a split that achieved a score close to the average of the 20 random splits to minimize the variance in scores.
Figure 5: Frequency distributions of F1 scores after 20 random 90%/10% splits on train and test with entropy (left) and edit distance + entropy (right) active learners
In this paper, we have analyzed the possibility of domain adaptation on big datasets and found that an accurate adaptation between the domains of NYC transportation datasets is very feasible. We used semi-supervised learning to train our model, and used active learning to adapt it to a new domain. Using a fraction of labeling effort, we were able to achieve similar performance compared to using all datapoints on the new domain. We used a combination of semi-supervised and active learning to improve the performance of taxi and Uber data. We used semi-supervised learning for training the classification model. We then turned our semi-supervised approach into an active learner, which further increased the performance of our system. The active learner using a combination of edit distance and entropy performed the best. Overall, the process of bootstrapping followed by active learning achieved a F1 score of 71.26% (86.52% of supervised scores) using only a tiny fraction of the dataset, showing domain adaptation of big datasets is able to yield comparable performance to supervised methods.
Computers Materials&Continua2018年10期