Lasheng Yu and Xiaopeng Zheng
(School of Computer Science and Engineering, Central South University, Changsha 410083, China)
Abstract: The individualization of education and teaching through the computer-aided education system provides students with personalized learning, so that each student can obtain the knowledge they need. At this stage, there are a lot of intelligent tutoring systems. In these systems, students’ learning actions are tracked in real-time, and there are a lot of available data. From these data, personalized education that suits each student can be mined. To improve the quality of education, some models for predicting students’ next practice have been produced, such as Bayesian Knowledge Tracing (BKT), Performance Factor Analysis (PFA), and Deep Knowledge Tracing (DKT) with the development of deep learning. However, the model only considers the knowledge component and correctness of the problem, ignoring the breadth of other characteristics of the information collected by the intelligent tutoring system, the lag time of the previous interaction, the number of past attempts to a problem, and situations that students have forgotten the knowledge. Although some studies consider forgetting and rich information when modeling student knowledge, they often ignore student learning sequences. The main contribution of this paper is in two aspects. One is to transform the input into a position feature vector by introducing an auto-encoding network layer and to carry out multiple sets of bad political combinations. The other is to consider repeated time intervals, sequence time intervals, and the number of attempts to simulate forgetting behavior. This paper proposes an adaptive algorithm for the original DKT model. By using the stacked auto-encoder network, the input dimension is reduced to half of the original and the original features are retained and consider the forgetting memory behavior according to the time sequence of students’ learning. The model proposed in this paper has been experimented on two public data sets to improve the original accuracy.
Keywords: LSTM; knowledge of tracking; DKT; stacked autoencoder; forgetting behavior; feature information
Computer-assisted education can ensure the openness of world-class teaching resources and reduce the increasing cost of learning. Knowledge tracking uses time-based modeling of students to accurately predict the next performance of students and capture clear needs. According to this feature, learning resources can be suggested to students according to their personal needs, and what is predicted to be too simple or too difficult can be skipped or postponed.
The learning process of students is often manifested as a potential learning model. According to the students’ reaction to the new problems, the students’ learning of these problems can be updated. Then, the model uses a series of questions the students have tried before and the correctness of each question to judge the student’s situation on a new problem. There are two famous models, Bayesian Knowledge Tracing (BKT)[1]and Performance Factors Analysis (PFA)[2]. Because they can capture students’ progress with reliable accuracy, the model has shown success in predicting accuracy and has been extensively explored and applied to the tutoring system[3-5]. BKT uses Bayesian networks to learn four parameters, each of which is just a component of skill, while the PFA model uses logistic regression to make predictions.
Nowadays, deep learning has been achieved in many fields due to the effect of machine learning[6]. However, the features extracted by deep learning are difficult to explain to a large extent due to their complexity, unlike BKT and PFA that try to combine interpretability with estimation. Based on the method of deep learning, a deep knowledge tracking model (DKT)[7]is proposed. The DKT model uses the latent variable of a recurrent neural network (RNN) to represent the knowledge state of students, which is a low-dimensional dense vector. The BKT model can express the students’ knowledge more abundantly than the BKT model, thereby improving the performance of knowledge tracking. In the field of knowledge tracking, there are few studies on forgetting[8-10]and other studies are on students using richer student information[11]. These studies simplify the forgetting information depending on the sequence of the past. For the extended DKT model considering forgetting memory, only the number of attempts[8]or the delay time between two interactions are considered[9].
In this article, the Intelligent Tutoring System (ITS) was used to collect information to reach a large number of features such as teaching aids, time spent on individual tasks, and the number of attempts and to incorporate information related to forgetting and propose an extended DKT model.
The main contributions of this work include:
1) Multiple features are used to explicitly model forgetting behaviors to consider learning sequences and forgetting behaviors and study how the combination of various past information affects performance.
2) More features are incorporated to improve the accuracy of DKT model predictions including contact tags, reaction time, number of attempts, and first practice.
The rest of this paper is organized as follows. Section 1 reviews related work on student modeling techniques. Section 2 presents the proposed DKT model combining rich features and forgetting features. Section 3 describes the datasets used in our experiments. In the last section, the experiment is summarized.
Traditional knowledge tracking models such as BKT and PFA have been extensively explored and applied to the actual tutoring system. BKT uses Hidden Markov Model (HMM) to model students’ knowledge. The behavior of students is expressed as a group of binary variables, which shows whether students have mastered it or not. PFA calculates the probability of correct answers based on the number of successes and errors in the data. PFA considers the interaction of the same knowledge, and does not interact with the entire sequence.
Since deep learning models perform better than traditional models in areas such as pattern recognition, the DKT model can simulate students’ knowledge more accurately than the above-mentioned traditional models. DKT uses an RNN to simulate students’ knowledge, which has good applicability in time series related tasks. Although the DKT model is a sequence model, it is also a BKT model, but the DKT model uses the hidden variables of RNN to represent the student’s indicated state, and RNN is a low-dimensional M-based vector; the BKT model of students’ knowledge state is represented by a binary (master the knowledge point, not master the knowledge point). In this way,DKT model can express students’ knowledge more reasonably than the BKT model, thereby improving the accuracy of predicting students’ future performance. After the DKT model, there are more and more researches on knowledge tracking models based on deep learning[10, 12-16]. Although their method extends the DKT model, they did not use feature engineering to process the extra information, and they did not pay attention to the cross temporal forgetting behavior. The DKT model uses a cyclic neural network to represent the latent state and time dynamics. When a student completes a task, it will use the information of the previous time steps or questions to make better leg-breaks for future performance.
There are several entry points for improving knowledge tracking tasks[17], and considering forgetting behavior is one of the directions. An early study on forgetting[18]showed that memory capacity decreases exponentially with time, and forgetting can be avoided by increasing the number of repeated attempts. In knowledge tracking, it is obviously unreasonable if the fact that students’ knowledge status changes over time is not considered.
The original BKT model does not consider the forgetting behavior. Khajah et al.[8]extended the BKT model to estimate the probability of the number of times students conducted in the past time. But their algorithm does not consider the problem of interval time. In the research based on the PFA model, there are two models about forgetting behavior. Pelánek added a time factor and used it to determine the probability[9]; Settles and Meeder[14]extended the model by using the near-death forgetting curve. Although these methods consider forgetting-related issues, they ignore the sequence of interaction with other knowledge.
The proposed method extends the DKT model. The DKT model considers the dimensionality reduction extraction of feature engineering and the entire interaction sequence of students.
Existing public data sets such as ASSISTments include clickstream records and student files. Whenever a student answers a question in the ASSISTments system, a click stream record will be generated, and the student profile describes the student’s background information and summarizes his click stream data. Each clickstream record contains 64 attributes. In the BKT model[1], the student’s knowledge state is estimated. Every time a question is answered, the knowledge state of the student’s cognitive skills will be recalculated. In addition, it can also reflect students’ influence behaviors[19], such as confusion, depression, boredom, and concentration.
The student file contains 11 attributes, provides students’ educational background information, and summarizes students’ use of the above-mentioned auxiliary and learning indicators, including student abilities, emotional state, etc. Although clickstream data provides a large amount of interactive learning information, it is extremely complicated in analyzing the relationship between their time changes and choices. Therefore, our model analyzes the personal data of students, not clickstream data.
Since students can distinguish STEM and non-STEM very well, our main task is to comprehensively extract the status of each indicator of the student instead of just extracting the average indicator. To fully understand a student’s instruction status, BKT is not a good choice, because it can only evaluate the instruction status of the skills the student has already answered. For skills that have not been answered by students, BKT cannot be applied. Therefore, if BKT is used to extract students’ expected indication status for each skill, it will result in missing values. Therefore, a deep knowledge tracking model (DKT) was used, which can estimate the indication status of all skills at the same time, and is better than many traditional KT models without too many features, such as BKT and PFA. DKT indicator status and student profile were combined as the feature value of the prediction model.
Our model is based on DKT, which is the basis of all in-depth knowledge tracking models at this stage, and has enough scalability to easily merge multiple input sources and determine their interrelationships.
The traditional DKT task can be expressed as a supervised learning problem: given a student’s past interactionsx0, …xt, the next interaction can be predicted during the student’s performance. The interactionxt= (qt,at) is defined as a tuple containing the skill idqtof the student’s constant ten questions at timetand a label indicating whether it is correct,andatis embedding matrix. In our model, the first response time (the time of trying or asking for help in the ASSISTments data of the system), the number of attempts required for each item, and receiving prompts have been added to our features. Then feature selection and dimensionality reduction have been used in our model. After these features were converted into categorical data, they were expressed in the form of large-point coding and sparse vectors. Automatic coding were used to convert high-dimensional data into low-dimensional representative codes that can be used to reconstruct high-dimensional input and reduce the dimensionality without losing too much important information.
Feature engineering is a very creative part, which can better extract effective information. Since the input is consistent with the representation in the DKT model, features were transformed into classified data, thus simplifying the output without losing too much information. After the feature was converted into classified data, it was represented by point coding and coefficient vector, and all features were connected to construct the input vector. Since DKT is RNNs, too much training time will be increased if the dimension is too high. Therefore, stacked auto-encoders was adopted to solve this problem without loss of performance. In this way, the dimension was reduced and the task of feature extraction was completed layer by layer without losing too much important information. The final feature is more representative. The hidden layer was used as the input of the model and was stacked in this way. Hidden layers were used as input to the model and stacked in this way. In the experiment, the dimension was reduced to half of the input size, and the structure is shown in Fig. 1.
Fig.1 Representation of a stacked automatic encoder that compresses features into smaller dimensions
The forgetting behavior was introduced into the DKT model, and the embedding vector and state vector in the original DKT model were changed according to the forgetting feature of the link. The students answered correctly over a long period of time, and the model updated its skills in the direction of more knowledge.
The new DKT LSTM model(Fig.2) incorporates more functions and auto-encoder reset, connects the encoding featureOto the input layer, adds the forget featureet, which quantifies the previously related information: repetition time interval, sequence time interval, and an integrated component.ytrepresents the next prediction result.
Fig. 2 The new DKT LSTM model
DKT uses various RNNs such as Elman RNN[20]and long short term memory (LSTM)[21]to simulate the transformation of students. The indicated state of the student is represented by a hidden state of RNN. Student’s indicated statehtcan be divided into two processes: modeling students’ knowledge and predicting students’ performance.
The input vector was constructed by thermal coding of individual features, wherevtrepresents the input vector of each student’s contact result,etis the motion label, andctrepresents the correctness.
C(et,ct)=et+(max(e)+1)·ct
(1)
The model adds previous information to the model. In order to model the process of students, it uses a trainable embedding matrixato calculatevtinstead of assigning random values. The forgetting parameteretis a multi-hot vector that combines three features. Each is expressed as a heat vector, and then spliced. Before entering the RNN module, the embedding vectorvtis first integrated with the additional informationo
(2)
(3)
Therefore, taking into account the student’s response to the skills and forgetting related information, the student’s instruction status is updated. If students answer the questions correctly, then the model will develop towards a better mastery.
The training details are as follows. The stacked autoencoder was trained hierarchically in advance. The performance of each connection was predicted, but only one was supervised, because there was only one label per time step. Just one LSTM layer was used with 200 hidden nodes, and the exit probability was 0.4 during training. In the process of modeling students’ knowledge and predicting students’ performance, the integral function was added to the model to consider forgetting behavior.
The learning parameters of our model are the embedding matrixaof the interaction vectorxt, the weight of the RNN module, the predicted weight matrix, and other parameters. The next question ofqt+1that minimizes the id of the skill is answered with the correct prediction probability, and the standard cross entropy between the real labels atat+1is true to learn together:
(4)
whereat+1is an embedding matrix indicating knowledge growth based on training,δindicates one-hot code.
The two public data sets used, ASSISTments and Open Learning Initiative (OLI), are the data recorded by the learning platform during the learning process. Table 1 shows the information of these two data sets. The ASSISTments 2009-2010 data set is based on mastering skills to build this mobile phone. In the initial report results, three problems were found that inadvertently exaggerated the performance of DKT, so a newer version was used here. The OLI Statics F2011 data set comes from the engineering statics course of Australian University. Since this is a time series algorithm, it does not consider students whose records contain less than two synchronized times.
Table 1 Statistics of ASSISTments2009-2010 and Statics2011 data sets
5x cross-validation and results were used to evaluate the AUC. There are many possible combinations of features, but the values here discuss some of the most effective ones. In the two datasets in Table 1, the merging of features or the introduction of forgetting behavior is superior to the original DKT model and BKT model. In the ASSISTments 2009-2010 data set, the AUC value increased from 0.7235 to 0.7309 when the forgetting behavior was added, and to 0.7306 when the input characteristics were automatically encoded using time/correct, action (time of the first attempt), and attempt (whether to ask for help or not). It was also attempted to train it directly by using one-hot coding. The results showed that the time efficiency was extremely low or even unfeasible after three features were introduced at the same time.
In this article, the traditional DKT model was extended and a new DKT model is proposed that fully considers forgetting characteristics and multiple features. Student-level cross-validation and pass AUC and R2 were used. There are multiple combinations of features and the combined features of the forgotten features were added. In the two data sets in Table 2, the model of combined features is better than the original DKT model, and the use of autoencoders also shows higher performance. It is supported to reduce the dimensionality. Without the use of autoencoders, it is almost impossible to combine all features.
Table 2 Test results
Compared with the previous DKT model, our model is more in line with the actual process with more effective feature information and forgetting features to improve the accuracy of the algorithm.
Journal of Harbin Institute of Technology(New Series)2022年4期