How can an educational chatbot’s feedback influence human attention?

Educational chatbots have been shown to be useful assistants in computer-supported learning settings. However, how does feedback of an educational chatbot impact on the learner’s attention? Thus, this paper proposes a study to measure changes in human attention when learning by using an educational chatbot Liza that is intended to improve human reasoning ability. In total, 18 participants participated in the study and had a conversation with Liza. During the interaction with Liza, the attention of the study participants was measured using a mobile electroencephalogram (EEG) device. Three findings have been determined based on statistics methods. First, it was found that there was a significant attention effect occurring in 54% of the times, after the educational chatbot showed feedback and the attention measurement took place over the length of a task. Second, when differentiating the type of feedback, positive feedback had a significant effect in 71 of these 137 cases (51.82%) and negative feedback had a significant effect in 66 of the 137 cases (48.18%). Third, statistical results showed that there was no significant difference in attention at the significant level of 0.05 during the 10 seconds before, and 10 seconds after positive feedback is received. Similar is the case for negative feedback.


Introduction
In 2020, the COVID-19 pandemic disrupted the world.A majority of communication suddenly had to take place digitally, whether it was over e-mail, phone calls or video calls, reducing in-person contact to stop the spreading of the virus.This also meant that secondary and tertiary education had to switch from a traditional face-to-face setting to distance learning.Teachers had to adjust to the new challenges that came with online learning.On the other hand, students also struggled with the new form of learning settings, facing more distractions when they are in the comfort of their own home, as well as the overall deterioration of performance that comes with digital learning environments (Junco & Cotton, 2012).Teachers therefore need to ensure that the students pay full attention to the online lesson, since attention has been identified as one of the key factors for learning performance, aiding the learning process by helping students to focus on important information (Al'Omari & Balushi, 2015, p. 685).
How can teachers in an online learning setting determine the learner's attention?In order to determine whether a learner is attentive, there have been successful attempts that measured the physiological parameters of students with an electroencephalogram (EEG), deriving the level of attention (Chiang et al., 2018).While it certainly is possible to implement such an attention measurement device in an educational chatbot, the measurement of attention takes multiple steps and can be rather costly, making it an inadequate choice for an educational environment.Therefore, to "wake up" the attention and thus, the learning performance of a student, another useful instrument could be giving adaptive feedback during learning.
Teachers give feedback during a lesson, knowing of the importance of feedback in education since it can double the average student performance (Hattie & Timperley, 2007, p. 83).From the technical point of view, an educational chatbot may also adapts the pedagogical strategy of teachers and should give feedback at appropriate time points.A research gap is whether an educational chatbot's feedback would possibly influence the attention of a learner and thus, her learning performance.This paper proposes a research question: How can an educational chatbot's feedback influence human attention?
For this study, an experiment is conducted, in which participants talk to an educational chatbot that has been developed to improve human reasoning.During the entire conversation, the participant solves numerous tasks while the attention is being monitored with a mobile EEG device.After each solution, the educational chatbot responds with feedback, letting the participant know if they solved the answer correctly or falsely.

State of the art The concept of attention
The definition of attention may have been first coined by William James in 1890 as follows: "It is the taking possession by the mind, in clear and vivid form, of one out of what seems several simultaneously possible objects or trains […] of thought.Focalization, concentration, of consciousness are of its essence.It implies withdrawal from some things in order to deal effectively with others, and is a condition which has a real opposite in the confused, dazed scatterbrained state […]." (James, 1890).
Later on, the definition of attention has been differentiated in four main types (McDowd et al., 1991): Sustained attention, selective attention, alternating attention and divided attention.Sustained attention, also known as "vigilance" or "tonic alertness" pertains to maintaining focus with a moderate level of mental effort over an extended period of time (Oken et al., 2006;McDowd, 2007).
Selective attention is the process of actively selecting focus on one stimulus, including the external environment or internal sources, while filtering out others (Johnston & Dark, 1986).A famous example of selective attention is the cocktail party effect by Colin Cherry in 1953, which describes the ability to focus on one conversation while multiple are happening around (Lindsay, 2020, p. 2).
Alternating attention is the ability to switch back and forth between tasks that require different cognitive processes (Sohlberg & Mateer, 1987).
Finally, divided attention, commonly known as "multi-tasking", is the activity of processing more than one stimulus at one time or reacting to multiple stimuli simultaneously.According to the definition for different types of attention, sustained attention is most highly related to learning because learners need to maintain their focus on a specific learning context over a period of time (Esterman & Rothlein, 2019, p. 174).

Attention in online learning
Maintaining attention is crucial to learning.Various studies demonstrated that attention has an impact on achieving a better learning performance and thus being essential in a learning environment (Al'Omari & Balushi, 2015;O'Connell et al., 2009).However, whereas computer-supported learning technologies come with a considerable number of benefits, it is undeniable that they also come with disadvantages.The downside is the sheer number of possible distractions, e.g., a notification of a new e-mail appearing on the screen, while a student is using a computer.Through an online survey, Junco and Cotton (2012) showed that the more college students engage with digital learning environments, the worse their performance will be.In a meta-analysis, Delgado et al. (2018) reviewed a research that compared reading on paper to reading on digital screens: the census was a clear advantage for paper, discussing that digital environments are not suitable for deep comprehension and learning.Since most computer-supported learning environments rely on quick interactions, students might find it hard to maintain their attention for a reading comprehension on a digital device (Delgado et al., 2018, p. 20).Also Lodge and Harrison (2019) analyze the role of attention in the online learning environment and claimed that the easy accessibility of information negatively impacts the capacity to learn.
Although the lack of attention has been identified as a side-effect of online learning, and thus, impacts negatively on the learning performance, both students and teachers perceive online learning as useful environments (Liaw et al., 2007).

Attention and feedback
Feedback is a kind of information we receive after an action has been performed (Nelson & Schunn, 2008).In a learning setting, feedback can be given by a teacher or a classmate.
Feedback can be divided into two categories: positive and negative feedback.Positive feedback is received when an action has been performed correctly, vice versa, negative feedback is given.Feedback resulting in learning could be explained by the behaviorism theory.One of the pioneers of the behaviorism learning theory was Thorndike (1927) who asked subjects in an experiment to draw a four-inch line while being blindfolded.He discovered that when not receiving any sort of feedback, the subjects did not improve and therefore have not learned; whereas subjects that have received feedback (simply stating if their drawing is wrong or right) gained a considerable accuracy for drawing lines.
Furthermore, research focuses on the impact of positive or negative feedback on the learning performance.Brackbill and O'Hara (1958) successfully test their hypothesis, that humans learn faster when receiving both reward (positive feedback) and punishment (negative feedback), rather than just reward.However, this study did not include a punishment-only control group.Meyer and Offenbach (1962) found that the punishmentonly (with negative) control group led to a faster improvement in learning performance.
While some studies (Brackbill & O'Hara, 1958;Meyer & Offenbach, 1962) show the effect of feedback on learning to be overall improving, with negative feedback having a greater impact, Kluger and DeNisi (1996) criticize this assumption.In a meta-analysis study, Kluger and DeNisi (1996) find that feedback has a high variability of effects, depending on types of learning and continuity of feedback intervention.

Human attention using educational technologies
Research studies with regard to human attention using educational technologies pursue mainly two goals (Table 1): 1) supporting the teacher with an attention recognition system, and 2) observing changes of attention during a lecture.
Arbel and colleagues (2020) conduct a study with twenty young adults who had to perform a variation of learning tasks while their attention was measured via eye-tracking.
Any change of attention was calculated through the two main eye-gaze's measures: proportion fixation time and fixation probability.If a task was solved correctly, the subject received visual positive feedback of three green checks; if solved incorrectly, three red Xs appeared.In their study, they concluded that attention of the subjects changed when feedback occurred, with the change being greater when negative feedback has been received (Arbel et al., 2020, p. 6).
Liu and colleagues (2013) build a system supporting the teacher in recognizing the level of student's attention in distance learning, using data from a mobile EEG device.Zalatelj and Košir (2017) pursue a similar research, however, in traditional classroom settings, estimating a student's attention level by monitoring their body posture and gaze during a lecture with the help of a motion sensor.Even though different measurement methods of attention have been used, both recognition systems (Liu et al., 2013;Zalatelj & Košir, 2017) have a similar accuracy with regard to identifying attentive states.
Other researchers attempt to compare students' attention in computer-supported learning against text-based learning.Whereas two research groups find attention is higher in teaching methods that either include a tool such as PowerPoint or digital maps (Sezer et al., 2015), or a chemical demonstration (Bunce et al., 2010).Ni et al. (2020) suggest that attention is greater when students learn with text-only material compared to material that contained videos or graphs.
Only one relevant study that measures attention values while a participant is talking to a chatbot was found.Bitner and Le (2021) validate the algorithm of a mobile EEG device to determine attention of a subject through the method of an experiment.In the study, 27 participants were instructed to interact with the pedagogical agent Synja, teaching the user concepts of the programming language Java, while simultaneously wearing the mobile EEG device.The authors report only weak evidence with only 5 participants showing a higher attention value when answering correctly.In addition, the study suggests that the average attention score leading up to a correctly answered question is higher than when answering a question incorrectly (Bitner & Le 2021).

Methods to measure attention
In a traditional classroom, attention is often differentiated by on-topic and off-topic activities (Keller et al., 2020).Teachers may recognize if a student's attention is on-topic or off-topic by observing the body language of their students.While there are some clear signals that point to on-topic attention, such as a student nodding in response to a teacher's presentation, other signals may indicate to both on-topic and off-topic behaviors.For example, a student staring outside the window can either mean, that their attention is offtopic, or that they're thinking about a question the teacher just asked (Keller et al., 2020).
It is thus up to teacher, to interpret these signals correctly.When an educational chatbot takes the role of a teacher, different measurement methods can be used to predict the attention's level of a student.
There are qualitative and quantitative measurement methods of attention.The focus of this paper will be the quantitative methods that can be categorized into indirect and direct types.The indirect measurement relies on observing at an individual, e.g., body movement and eye-tracking.Eye-tracking is the "most widely used tool for measuring visual attention" (Mancas & Ferrera, 2016, p. 22).The direct measurement method of attention is based on physiological responses of individuals, e.g., brain signals.The human brain runs our body functions and our thinking.When the brain tells our body what to do, millions of neurons (brain cells) are activated, producing a current flow (Teplan, 2002).In order to measure electrical brain activity, electroencephalogram (EEG) can be applied.Multiple electrodes are placed on the scalp of a subject in order to record electric flows through the brain.There are five widely recognized brain waves, which are differentiated by their unique frequency range (Teplan, 2002): Delta δ (0.5-4 Hz), Theta θ (4-8 Hz), Alpha α (8-13 Hz), Beta β (13-30 Hz), and Gamma γ (>30 Hz).
Studies have proven the correlation between sustained attention with a decrease of Alpha wave activity.Ray and Cole (1985) conduct an experiment involving college students, who are asked to complete several tasks over a course of three days.The study reports higher parietal Alpha in tasks that do not require attention compared to those that do, thus a decrease of Alpha during attentional tasks.O'Connell and colleagues (2009) also conclude that an increase in Alpha band activity corresponds to a decrease of attention.In a rather recent study, the widely accepted correlation of Alpha waves and attention has been investigated again and the authors rightfully claim that "correlation is not causality" (Bagherzadeh et al., 2020, p. 577).In the study, the subjects were divided into two groups, one that learned how to manipulate the Alpha power in the left hemisphere and the other one the right hemisphere.After completing a task, the researchers found that a higher Alpha power in the left-group led to a higher attention over the right-group, thus supporting the causal relationship of attention and Alpha synchrony.
While Alpha band power is agreed to be correlated with attention, other EEG band powers are controversially found to be in relation with attention.Aftanas and Golocheikines (2001) suggests a link between attention and Theta activity.By observing the EEG during different mediation phases, they found that subjects with little experience in meditation had an absence of Theta activity, since they may have anxious feelings towards not maintaining a meditative state, thus having a narrow attentional focus.Pavlygina et al. (2011) also report that Delta activity increases while problem-solving and thus, an increase of attention.Contrary to the previous finding regarding Theta activity, Oken et al. (2006) find that there are two measures that often correlate with a bad task performance: increasing Theta and decreasing Beta.This result that decrease in Beta activity during inattentive behavior is in accordance with the report of Linden et al. (1996).

Methodology
Since there is a research gap on the effect of feedback given by a learning system regarding learner's attention, the research question "How can an educational chatbot's feedback influence human attention?" is of high relevance.Regarding the possible effect of feedback on attention, the following hypotheses will be tested: H0: Neither feedback nor negative feedback has an impact on human attention.
H1: Positive feedback will lower the level of attention while negative feedback will improve it.
The H1 hypothesis is supported by the research of Arbel et al. (2020) in psychology.To reject either the H0 or H1 hypotheses, it is mandatory to measure two variables: the feedback of the educational chatbot and the human attention of a person that is talking to the chatbot.
To measure these two variables, an experiment must be carried out.

Materials
The experiment consists of a conversation between the educational chatbot LIZA and study participants in a laboratory environment, while the participant wears NeuroSky's MindWave Mobile 2. The attention levels (10,20,30,40,50,60,70,80,90,100) of the user are provided using the Attention Meter prebuilt algorithm of NeuroSky 1 .
Educational chatbot: LIZA Wartschinski et al. (2017, p. 249) developed Liza because teaching human reasoning skills requires repeated explanations and examples, and since the resources of a human teacher are limited and this topic is not usually offered at school, an educational chatbot can serve this purpose for teachers.
The web-based educational chatbot Liza 2 was programmed using Java.The knowledge base is referred to as "Story Store" and contains the human reasoning tasks that are categorized in seven topics.The seven topics are: Bayesian reasoning, the Law of Large Numbers, the Gambler's Fallacy, Wason's Selection Task, Covariance Detection, the Sunk Cost Fallacy, and Belief Bias in Syllogistic Reasoning.The tasks were taken from wellknown psychology studies that have shown validity in literature (Wartschinski et al., 2017).
A lesson begins with Liza greeting the user and introducing herself and her aim.Afterwards she starts the training phase by first announcing one of the seven topics with a short explanation (if the user wants to have an explanation), followed by an example case the user has to solve (see Figure 1).
If the task has been solved correctly, Liza gives the user a short, positive, and enthusiastic response (Figure 2).However, if solved falsely, Liza responds in a concerning way, nonetheless still encouraging the user (Figure 3).
At the end of the conversation, after 14 human reasoning tasks have been solved (two for each reasoning topic), Liza gives the user a summary of their conversation, giving a final feedback on the user's abilities.Liza has been thoroughly designed and programmed to overcome the mentioned common design challenges, mainly working task-oriented while using common social behavior cues (Wartschinski et al., 2017, p. 251).Upon the user's request, Liza gives hints to solve the task.If a task has been solved incorrectly, she offers show emotions like pride, joy, and curiosity, as well as disappointment, sadness, and insecurity (Wartschinski et al., 2017, p. 252).An evaluation study for Liza was conducted, in which the participants were divided into a treatment and a control group; the treatment group was talking to the agent, whereas the control group only read short texts about each topic.After conducting the experiment, the results showed that the treatment group had an overall better performance than the control group; the participants also noted that they liked working with Liza and had a positive experience with her.For her success in teaching, high usability and implemented feedback algorithm, Liza was chosen for the role as the educational chatbot in this paper.2014) and the collection of rational tasks of Larrick (2004).
The Sunk Cost Fallacy describes the tendency to continue investing time or money, not recognizing a bad outcome.For example, Liza might create a situation where the user accidentally bought tickets to two different plays for the same day; not being able to return the tickets, the user must decide which play they want to see: the one that costs more but is less entertaining; or the cheaper one, that sounds more promising.The right decision is the second one since the money has been already spent and cannot be retrieved.However, since it is a fallacy, most people would choose to go to the more expensive play.
The Gambler's Fallacy occurs when people observe a series of events and think that the events that occurred more frequently will appear less frequent in the future.To improve the users' skills, Liza will explain a situation of past events, and asks the user what the next event might be.To make this less abstract: Liza explains a situation, in which a coin was tossed a few times.Each time, the coin showed the same side.For the next toss, Liza asks the user which side (head or tails) will most likely appear.According to the Gambler's Fallacy, the user will say tails since heads has been leading in the previous events.The right answer however is 50:50, because past events cannot change future events.
Regression to the Mean is the effect an extraordinary sample has on the understanding of a whole population.It happens when a large or small sample occurs, followed by a sample that is closer to the average of the population.For example, there are two hospitals, one having more births per day than the other (100 vs. 10).Liza then asks the user, which hospital is more likely to have higher rate of male births on a specific day.Most people would assume the answer to be the hospital that has 100 births per day, even though the right answer is the hospital with less births, since having more births means that a great deviation is less likely to happen.
The last reasoning topic is the Base Rate Fallacy and occurs during problems with conditional probabilities.To train this complex reasoning topic, the users are calculating the chances of a certain event happening.For example, the user must solve a problem where they determine the chance of a person having a specific disease upon receiving a positive test result.Liza gives the user numbers about the occurrence of the disease in the population and the false-positive/false-negative-rate to help the user solve the problem.Most people tend to find these sorts of tasks rather difficult and therefore solve it incorrectly.The reason is that people ignore parts of statistical information as it confuses them.
In the version implemented for this study, for each of four reasoning topics, four reasoning tasks were generated.Therefore, Liza will give 16 tasks (four for each reasoning topic) in total.The four tasks of each reasoning topic are not asked successively, but alternatively: a Sunk-Cost-Fallacy task is followed by a task of Gamblers Fallacy, which is followed by a task of Regression to the Mean, which is lastly followed by a Base Rate Fallacy task.After this, the cycle is repeated three more times.
Also the language used by Liza has been translated into German in the new version (see Figure 4), because the target group of study participants would be native German speakers and the focus of attention will be on the learning context, not the language.This is also supported by a study conducted by Liu and colleagues (2013), in which the attention of students during a lecture was recorded.The students were reading a text in English, which was not their native language.The study found that the Delta waves were more active during the times, when the reading took place (Liu et al., 2013).Since an increase of Delta activity positively correlates to increasing attention, any factors that could lead to a misinterpretation of results were removed.
The research question being investigated can only be properly evaluated if the subject receives both positive and negative feedback.Since the four reasoning topics vary in their difficulty, it can therefore be assumed, that every participant will receive both positive and negative feedback, which makes Liza an ideal educational chatbot in this study.

NeuroSky's MindWave Mobile 2
Due to its simple usability and accuracy, the MindWave 4 Mobile 2 headset had been used to record the attention of the study participants.The headset's reference and ground electrodes are on the ear clip and the EEG electrode is on the sensor arm.The EEG electrode is put on the forehead above the eye, i.e., the FP1 position of the frontal lobe (Teplan, 2002).
The MindWave Mobile 2 uses Bluetooth to connect to other devices, in this case a laptop.
To collect the actual physiological data, OpenVibe

Participants
Human reasoning affects everyone and is not limited to a specific group of students or persons in general.Therefore, almost anyone can participate in this experiment.When it comes to attention and age, there is an assumed and proven decline of attention with a higher age (Zanto et al., 2011).It was therefore necessary to set an age limit.While various research find a decline in sustained attention when comparing very young people to old people, e.g., Zanto et al. (2011), a recent study took a deeper look at more specific age groups (Lufi & Haimov, 2018).Lufi and Haimov (2018) choose 496 participants, divided into eight age groups, spanning ten years; the youngest group started at 12 years.The participants had to complete a computerized math test, that assessed attention.In their results, they found that there was an increase of attention in the first three age groups (12 -39.99), and a steady decline after that.The peak attention level has been found in the age group 30 -40 (Lufi & Haimov, 2018).Therefore, the age limit for this experiment was set at 40.In order to covering such an age limit of 40 years, we invited friends and family relatives to participate in the study who desired to improve reasoning skills in addition to university students.
Another important aspect when it comes to attention is possible diseases that might affect attention.The most known is attention deficit hyperactivity disorder (ADHD).ADHD comes with a "pattern of inattention […] that interferes with functioning or development" (Battle, 2013, p. 59).To identify whether a person suffers of ADHD, the Diagnostic and Statistical Manual of Mental Disorders (Battle, 2013) describes a diagnostic of nine symptoms; an adult that meets at least five of these is very likely to have ADHD.Thus, all participants had to fill out a questionnaire to identify a possible attention disorder.Having

Experiment procedure
The experiment took place at an office room in 12043 Berlin (Figure 6).Over the course of two weeks, the participants were asked to come to the building to undergo the experiment.Before each experiment, the MindWave Mobile 2 device was wiped clean with a damp cloth and a new battery was placed in to ensure that it would last for the entire session.The educational chatbot was run and ready to start a new conversation.
Additionally, the room was aired for 10 minutes and anything the participants might touch was disinfected, to reduce the spread of COVID-19.The experiment itself consisted of four phases.
Introduction Phase.In this phase, the participants were introduced to the experiment and its goal, however, they were not explained about the types of attention.The participants had to read and sign a consent form agreeing to their data being used, as well as answer a pre-questionnaire.The pre-questionnaire consisted of two parts.In the first part, the Fig. 6 Experimental setup participants had to state their age, degree of education (completion of secondary education, bachelor's degree, master's degree or higher, or other) as well as their current employment.
Following this, the participants had to answer questions to identify a possible attention disorder.This part consisted of six questions in accordance with the Diagnostic and Statistical Manual of Mental Disorders (Battle, 2013).With the help of a Likert-Scale (never, rarely, sometimes, often, very often) participants had to rate how often they have difficulties with concentrating on a task or conversation.After completing the questionnaire, the participants were further instructed on the conversation with Liza, explaining what exactly Liza teaches and to respond by simply typing the answer in the intended text field.Furthermore, since some of Liza's tasks involved mathematical problems, the participants were given a calculator, as well as a pen and paper to write down any thoughts they had while solving a task.In total, this phase lasted about 10 minutes.
Resting Phase.After introducing the participants to the experiment, the MindWave Mobile 2 device was placed on their head.Then, the OpenVibe program for recording EEG data was started.Once it was ensured that everything was running smoothly, the participants were allowed to rest for 5 minutes.We decided to have 5 minutes for resting because the short break lasts commonly between 3 to 10 minutes for sustained attention tasks.This was mainly done so the participant can get used to the new environment and to relax (Schumann et al., 2022), so that the brain activity was calmed before having a conversation with Liza.Overall, this phase lasted about 10 minutes.
Experimental Phase.In this phase, the participants had their conversation with Liza to solve reasoning tasks.Based on previous experiments with Liza (Le & Wartschinski, 2018), we estimated 60 minutes at maximum for solving the 16 reasoning tasks.The real time however depended on how quickly or slowly the participants solved the tasks.
Post-Experiment Phase.After having a conversation with the educational chatbot Liza, the participant was allowed to take the MindWave Mobile 2 device off.The participant then had to fill a post-questionnaire that consists of similar questions as in the prequestionnaire, and questions estimating their own attention level (very low, low, average, high, very high).entries.However, since the attention information was calculated by the eSense algorithm at each second, the redundant data in the csv file was thus eliminated to contain one data entry in each second interval.After that, the file was enriched with conversation data and time stamps such as when a new task was introduced or when feedback has been given.

Data collection and analysis methods
The data analysis consists of multiple parts: 1) the changes of attention directly before and after feedback has been received; 2) the changes in attention over an entire task after receiving feedback; and 3) the difference between pre-and post-test questionnaire.The statistics method two-tailed t-test will be applied on these three analyses, since the hypothesis of this work is undirected, meaning attention can either decrease or increase after receiving feedback.
Changes in attention directly before and after receiving feedback.Feedback is given sixteen times during the entire conversation (one for each task solved).Consequently, it is necessary to determine how many data points will be evaluated for each time feedback is received.The following points were considered when thinking about the right amount of data points: when a task is introduced to a participant, the participant must first read and understand the question, and then solve it.After the user gives an answer, Liza either responds directly with the feedback, or asks the user to estimate their chances on having answered the task correctly.Both reading and thinking will affect the level of attention, overwriting any effect Liza's feedback might have.Therefore, for the direct comparison, a timeframe of 10 seconds before and after receiving feedback will be considered.With this timeframe, the mean of attention level, both before and the after receiving a feedback, was calculated and stored in two separate tables: one for positive feedback and one for negative feedback.Then, a paired t-test was conducted for each table.A paired t-test, or correlated t-test, is used when the mean difference in one subject is measured for a before and after comparison and when the two means are dependent (Xu et al., 2017).
Changes in attention over a task after feedback is received.The changes in attention during task-solving was also analyzed, to observe if a chatbot's feedback has an impact beyond the initial 10 seconds after receiving it.Q represents the moment Liza asks the question, and F the feedback after the question has been solved, with "Qn →" describing the entire task-solving process for the nth task.Then, one comparison element will be from when feedback Fn is received up to when feedback Fn+1 is received, with an entire tasksolving of Qn+1 in between.For each t-test, the before and after will be compared: The recording of each set consists of multiple seconds, from Liza introducing the task to the participant answering it.In almost all cases, the recordings of two consecutive sets are not the same length, and thus, will not have the same amount of data points.Therefore, an independent t-test will be used for each comparison.
Pre-test and post-test questionnaire.Both the pre-test and the post-test questionnaire provide a subjective rating for the learner's attention and can be used to compare with statistics data.The participants had to estimate their attention level for each question they solved with another Likert scale (very low = 1, low = 2, average = 3, high = 4, very high = 5).The actual level of attention, that was recorded using the MindWave Mobile 2 device during the task-solving, was also assigned with scores ("10 -20" = 1, "20 -40" = 2, "40 -60" = 3, "60 -80" = 4, "80 -100" = 5).Then, a paired t-test was conducted, to determine if the self-estimated level of attention correlated with the recorded level of attention.

Results
During the recording of participant 10, there was a significant amount of time where no measurement took place (16 minutes in total).In the case of participant 6, there was a significant loss of information during 6 questions.Data of participant 10 and 6 were discarded.Participants 13, 14 and 15 also had corrupted data.Participants 13 and 14 each answered one question with "I do not know", resulting with Liza not giving feedback, but asking the participants if they would like to know the answer.Participant 15 had a significant time of no measurement in two questions.Only data regarding those questions were discarded, resulting in a total analysis of 16 recordings.

Changes in attention
10-second comparison.Here, both the results for the negative and positive feedback in the 10-second comparison are presented in two separated tables, the positive feedback in Table 2 and the negative feedback in Table 3. n indicates the amount of positive/negative feedback received, x is the mean of the 10 seconds before feedback is received and y is the mean of the 10 seconds after feedback is received.x − y is the mean difference.
The p-value was not smaller than the significance level of 0.05 in any of the 16 recordings, indicating no significant difference in attention during the 10 seconds before, and 10 seconds after positive feedback is received.Thus, the H0 hypothesis cannot be rejected.
Table 3 shows the results of the paired t-test for incorrectly solved tasks, followed by negative feedback.
Similar to positive feedback, the H0 hypothesis cannot be rejected in any of the 16 recordings, because none of the t-differences is significant at a significance level of 0.05.Comparing changes in attention over the length of a task.Independent t-tests have been calculated for comparing the changes of attention over two tasks, after receiving positive feedback.Since the independent t-test was done for each task of each participant, it resulted in total 252 t-tests (see Appendix Table 5).
In total, the H0 was rejected in 137 of all 252 cases (54.36%), meaning that in these cases, the p-value was smaller than 0.05.When looking at the type of feedback, positive feedback had a significant effect in 71 of these 137 cases (51.82%) and negative feedback had a significant effect in 66 of the 137 cases (48.18%).The H1 hypothesis can be accepted in 88 cases of these 137 cases.When it comes to positive feedback, the predicted effect of the H1 hypothesis has been found in 52 cases.Respectively, the predicted effect of negative feedback has been found in 36 cases.However, all these instances do not occur evenly over the participants, with respect to the H0 hypothesis.For participant 1, the H0 hypothesis was rejected in 5 cases, for participant 13 in 6 cases, for participants 8 and 9 in 7 cases, for participants 11, 15, 16 and 17 in 8 cases, for participants 4, 5, and 7 in 9 cases, for participants 2, 3 and 14 in 10 cases, and lastly for participants 12 and 18 in 12 cases.The H1 hypothesis reached a support rate in more than 50% percent in 15 participants, with 80% rate being the highest for participants 1, 2 and 3. Participant 4 could only support the H1 hypothesis in 40% of the cases.

Pre-and post-questionnaire
Table 4 shows the results of the t-test for comparison between pre-and post-questionnaire.
"Attention disorder" describes the paired t-test of the six pre-and post-questionnaire questions that are used to identify a possible attention deficit, whereas "self-report" shows the results of the self-estimated attention level compared to the recorded attention level.
For the first part, the "attention disorder", there was no indication of a significant difference at a significance level of 0.05.However, for the comparison of the self-report and recorded data, there were several participants (2, 3, 5, 8, 9 and 15) who indicated significant difference (i.e., p-values smaller than 0.05) between the self-reported and recorded attention levels.A deeper analysis for the self-reports of participants 1,4,7,11,12,13,14,16,17 and 18 who had non-significant difference between self-reported and recorded attention levels was required.Of those participants, only participant 14 had a matching low self-reported and recorded attention level in task 12.However, the p-value of that task (and following that task) was higher than 0.05, and thus, the matching of low attention level was not significant.

Discussion and limitations
To observe the influence a chatbot's feedback has on human attention, an experiment was conducted, in which the human attention during a conversation with an educational chatbot was measured.In general, the results of the experiment were inconclusive and neither hypotheses can be fully accepted or rejected.When it comes to finding a reason for why a hypothesis might have been rejected, it is important to keep two main findings of current research in mind: sometimes, feedback simply has no effect and human attention has no fixed state, as it can fluctuate at any time (Bunce et al., 2010;Kluger & DeNisi, 1996).
The hypothesis H0 can neither be fully rejected or accepted, since both the 10-second and task-comparison show contradicting results.There is no significant increase or decrease in attention after receiving any kind of feedback, thus supporting the claim of the H0 hypothesis, that neither positive nor negative feedback has an influence on human attention.
There are a few possible reasons that could explain the absence of an effect.In the case of negative feedback, when a task has been solved falsely, Liza offers the participant the correct solution.If the participant accepts, they will read the solution in those 10 seconds, otherwise Liza introduces a new task.Except for one case, participants always accepted the offer of a solution.Therefore, negative feedback might not have had an impact in the 10 seconds, since the participants were busy reading the correct solution, before solving a new task.In the case of positive feedback, Liza goes straight to a new task.The lack of influence could be explained by the fact that the participant did not have enough time to process the feedback, since a new task of a different reasoning topic was immediately introduced.Another possible reason for a lack of effect of both positive and negative might be that the time frame of 10 seconds was simply too small to see a difference, which becomes apparent when comparing the attention over two tasks.In the task-comparison, the H0 hypothesis is rejected multiple times, roughly in half of all comparisons.These results are consistent with the research regarding the effect of feedback, being that feedback can be both influential (e.g., Arbel et al. (2020); Hattie & Timperley (2007)) and noninfluential (Kluger & DeNisi, 1996).The simplest reason for this phenomenon is that the feedback of a chatbot does not have an equal influence on the attention of human, as humans do not react untimely to stimuli.This reason is further supported by the results of this work.Participant 1, for example, had a significant reaction to 5 of the 16 pieces of feedback received, whereas participant 12 had a reaction to 12 of the 16 times of feedback.
Regarding the H1 hypothesis, the study shows no conclusive result regarding the rejection or acceptance of it.In 64% of the comparisons, the predicted effect occurred, meaning that positive feedback was followed by a decrease of attention, and negative feedback by an increase of attention.The predicted influence of positive feedback is more dominant, with a decrease in attention occurring in 73.24% of the times, when a participant has received positive feedback.This shows, that the positive feedback of an educational chatbot can decrease the attention of a human, and should therefore only be used when necessary.
Nonetheless, the H1 hypothesis could not be accepted in every case.This might be due to the variance of difficulty across the reasoning topics.The Base Rate Fallacy was declared the most difficult one, and participants often did not want to solve these kinds of tasks.
Receiving negative feedback in these situations, where the participant already expected to fail will not have the same influence as when the participant is unsure whether they have solved the task correctly.Consequently, receiving positive feedback in those situations could lead to an increase of attention, as the participant did not expect this, and therefore wants to pay more attention to the following tasks.In contrast, the Sunk Cost Fallacy has been found to be the easiest reasoning topic.These tasks use "either/or" questions.If the participant answered the first Sunk Cost Fallacy question falsely, they easily worked out the answer for the next task.Here, negative feedback might increase the attention, as the participant would have expected their answer to be incorrect.However, this instance works in favor of the H1 hypothesis.
This study had several limitations and thus, the results of this study require a careful consideration.First, due to the small number of participants, it was not possible to implement a control group.All participants received feedback, and while there has been a significant influence of feedback in some cases, it is not sufficient to say that this effect only comes from the feedback of a chatbot.Second, in some cases, Liza was not able to understand the answer of the participant.Furthermore, Liza interpreted the input of participant sometimes falsely, e.g., when the participant asked for help while solving a task, Liza interpreted "help" as an answer to the task.Third, many participants noted that Liza did not give enough time to solve the task.When solving a task, while thinking about a solution, Liza is waiting for the learner's solution.If one minute passes without a solution, there is no input from the participant, since they are thinking of the correct solution.Liza interprets in the situation of no learners' and letting them know that they can take as much time as they need and offering help.She does this multiple times, until the participant answers.As commented by one participant, this led to frustration, and a disruption of the thinking process.These technical issues require to improve Liza' ability to interpret user's answers more accurately.
The recorded and analyzed data relies solely on the accuracy of NeuroSky's algorithm that is a black-box and thus, we do not know how the algorithm was developed.Therefore, a concurrent validation of this algorithm may help to compare results provided by the NeuroSky's algorithm.

Conclusion and future work
This paper proposed a study to answer the research question "How can a chatbot's feedback influence human attention?".In the state of the art, feedback was found to overall improve learning performance.To analyze the relationship of positive and negative feedback to learner's attention, an experiment was carried out.18 study participants had a conversation with an educational chatbot for improving human reasoning, while their attention was constantly measured with a mobile EEG device.
This study results in three contributions.First, it was found that there was a significant attention effect occurring in 54% of the times, after the educational chatbot showed feedback and the attention measurement took place over the length of a task.Second, when differentiating the type of feedback, positive feedback had a significant effect in 71 of these 137 cases (51.82%) and negative feedback had a significant effect in 66 of the 137 cases (48.18%).Third, statistical results showed that there was no significant difference in attention at the significant level of 0.05 during the 10 seconds before, and 10 seconds after positive feedback is received.Similar is the case for negative feedback.Results of this study recommend us to measure attention over a longer period.
While this research has produced results which are inconclusive, that does not mean that the question of how educational chatbots affect human attention is unanswerable.The results of this paper points in two directions, which are both equally acceptable: 1) feedback of an educational chatbot has no effect; 2) feedback of an educational chatbot can influence human attention, with positive feedback being a slightly larger driver.Despite the lack of a control group, and the inconclusive results, this paper was able to provide a first step into the understanding of how educational chatbots can influence human attention.
If an educational chatbot is used to ease the workload of a teacher, developers must accept that feedback will not have the desired effect on every student equally.Until the research question can be fully answered, it is recommended that an educational chatbot should not solely rely on feedback as a tool to control the attention of a student.
Based on this work, further questions have been raised: Is the feedback of an educational chatbot really the cause of the change, or is there just a correlation?Are there other methods a chatbot can use to influence human attention, that do not involve the constant measurement of physiological data?
Education is an important factor in everyone's life.It is crucial that all teachers, human or computer, take the role of teaching seriously and know about the effects of their teaching tools.Although this work was not able to find a clear answer to the proposed research question, future research in this field is still necessary.

Appendix: Results of the task comparison
The following table contains the entirety of results of the task comparison.The column "Means of Feedback" contains the mean of the before and after, as described in the data analysis in Section "Data Collection and Analysis Methods"."Type" stands for the type of feedback, either positive (p) or negative(n)."Results of the t-test" contains the t-statistic, degree of freedom (df) and the p-value.If H0 is marked with an "X", the H0 hypothesis was rejected for this feedback; if H1 is marked with an "X", the H1 hypothesis was accepted.Fn stands for the nth feedback given, with n = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}.If an entire row is marked with a minus (-), there was not enough data to be analyzed.

Fig.
Fig. The educational chatbot Liza

For
the study being presented, a different version of Liza has been developed.This version was programmed in Python with the help of RASA 3 , an open-source machine learning framework for conversational chatbots.Instead of seven topics, the simpler version of Liza for this study only provides four topics (Sunk Cost Fallacy, Gamblers Fallacy, Regression to the Mean and Base Rate Fallacy).The following examples and explanations are directly taken from Liza's knowledge base and serve as an understanding of what exactly Liza teaches, and what the participants had to solve.All reasoning tasks have been validated in psychology studies and based on rational thinking composite by Toplak et al. ( 5 has been used.OpenVibe is an opensource software for Brain-Computer-Interfaces and can be used to acquire, filter, and analyse brain signals at real-time.The signal acquisition consists of multiple steps.After connecting the MindWave Mobile 2 with the laptop via Bluetooth, it needs to be selected by the OpenVibe Acquisition Server.Here one can choose the age and gender of the subject, as well as what kind of physiological parameter should be measured, e.g., attention.The whole EEG power spectrum (low and high Alpha; low and high Beta; Gamma and mid Gamma; Delta and Theta) are recorded.The MindWave Mobile 2 collects data at a frequency of 512 Hz, however both the eSense Attention Meter algorithm and the power spectrum are only calculated at a frequency of 1 Hz.Once this is done and the recording is started, the OpenVibe Designer (Figure 5) comes into play.The OpenVibe Designer is where the processing pipeline (e.g., filtering or visualisation) of the brain signals takes place.For this study, the following components were used: an acquisition client, that

Table 1
Student's attention in educational environments

Table 2
Results of paired t-test for the direct influence of positive feedback

Table 3
Results of paired t-test for the direct influence of negative feedback

Table 4
Results of paired t-tests for pre-and post-questionnaire

Table 5
Complete results of task comparison