1 Introduction
How people allocate attention is a crucial aspect of human behavior. It dictates the degree to which different information is weighted in guiding behavior. Attention is sometimes measured indirectly by inferring it from choice data or response times (RT). But increasingly, attention has been measured more directly using eye-tracking. Eye-tracking makes use of the eye-mind hypothesis: people generally look at the information that they are thinking about (Reference Just and CarpenterJust & Carpenter, 1984) (though not always).
The use of eye-tracking has become an important tool in decision science, and behavioral science more generally, as it provides a detailed representation of the decision process (Reference Mormann, Griffiths, Janiszewski, Russo, Aribarg, Ashby, Bagchi, Bhatia, Kovacheva, Meissner and MrkvaMormann et al., 2020; Reference Wedel and PietersWedel & Pieters, 2007). It has been used to understand the accumulation of evidence in sequential sampling models of choice (Reference KrajbichKrajbich, 2019), context effects in multi-attribute choice (Reference Noguchi and StewartNoguchi & Stewart 2014), strategic sophistication in games (Reference Polonio, Di Guida and CoricelliPolonio et al., 2015), selfish vs. pro-social tendencies in altruistic choice (Reference Teoh, Yao, Cunningham and HutchersonTeoh et al., 2020), truth telling and deception behavior (Reference Wang, Spezio and CamererWang et al., 2010), and simplification strategies in multi-attribute and multi-alternative choice (Reference Arieli, Ben-Ami and RubinsteinArieli et al., 2011; Reference Fiedler and GlöcknerFiedler & Glöckner, 2012; Reference Payne, Bettman and JohnsonPayne et al., 1988; Reference Reeck, Wall and JohnsonReeck et al., 2017; Reference Reutskaja, Nagel, Camerer and RangelReutskaja et al., 2011; Reference Russo and DosherRusso & Dosher, 1983; Reference Russo and RosenRusso & Rosen, 1975; Reference Shimojo, Simion, Shimojo and ScheierShi et al., 2013). In addition to applications in decision research, eye-tracking is widely used in other areas of psychology such as emotion recognition (Reference Pell and KotzPell & Kotz, 2011) and reading (Reference RaynerRayner, 2009), as well as areas outside of psychology such as advertisement (Reference Pieters and WedelPieters & Wedel, 2004) and driving behavior (Reference Nakayasu, Miyoshi, Aoki, Kondo and PattersonNakayasu et al., 2011).
A challenge to the continued growth of eye-tracking research is the shift of behavioral research from brick-and-mortar labs to the internet (Reference Goodman and PaolacciGoodman & Paolacci, 2017). This shift has been accelerated dramatically during the COVID-19 pandemic. While online data collection has many advantages (e.g., speed, affordability), it has so far not been used to collect eye-tracking data in behavioral research.
However, there is reason for hope. Eye-tracking has garnered a lot of interest in the domain of human-computer interaction. For example, gaze-aware games can improve the gaming experience by providing timely effects at the gazed location (Reference Majaranta, Räihä, Hyrskykari, Špakov, Klein and EttingerMajaranta et al., 2019). Consequently, researchers in computer science have been working to improve the algorithms to determine gaze location (e.g., WebGazer, Papoutsaki et al., 2016; Smartphone eye-tracking, Valliappan et al., 2020; TurkerGaze, Xu et al., 2015).
Here, we capitalize on these recent advances to investigate the possibility of bringing eye-tracking research online. We start with WebGazer, a JavaScript toolbox that was developed to monitor peoples’ eye movements while on the internet (Reference Papoutsaki, Sangkloy, Laskey, Daskalova, Huang and HaysPapoutsaki et al., 2016). Until now, it has not been used in behavioral research, except in one methods article demonstrating some basic gaze properties (Reference Semmelmann and WeigeltSemmelmann & Weigelt, 2018). In that article, the authors used an extensive calibration and validation procedure that occupied approximately 50% of the study time. That article also found that WebGazer’s temporal resolution is relatively low and inconsistent, but left it unclear what caused these problems and whether they can be solved. Here, we show that these temporal aspects of WebGazer can indeed be substantially improved.
Another set of issues with online eye-tracking concerns the requirements on the user/subject’s side. In the lab, researchers control the computer and camera quality, the lighting, the subject’s positioning, etc. Online, researchers have little control over these things. Therefore, we seek to establish basic requirements and develop simple procedures for subjects to follow in order to maximize data quality. It is also important that subjects understand that they are not being recorded and so there are no privacy violations as the images and video do not leave the subject’s computer.
An advantage of online eye-tracking is that it lowers the bar for researchers to use eye-tracking in their own work. To further improve accessibility, we seek to ease the programming requirements for using WebGazer in behavioral experiments. To that end, we integrate WebGazer into a user-friendly, open-source psychology toolbox called JsPsych (Reference de LeeuwDe Leeuw, 2015). JsPsych is built on JavaScript, includes a library of commands for behavioral experiments, and also allows for integration of JavaScript-based libraries such as WebGazer. This addresses potential concerns about the difficulty of incorporating WebGazer into existing behavioral paradigms.
To illustrate these issues and our solutions, we conducted a simple online value-based experiment on Amazon Mechanical Turk (MTurk). We aimed to replicate the robust links between gaze and choice that have been documented in the literature (e.g., Amasino et al., 2019; Reference Ashby, Walasek and GlöcknerAshby et al., 2015; Reference FisherFisher, 2017; Reference Ghaffari and FiedlerGhaffari & Fiedler, 2018; Reference Gluth, Kern, Kortmann and VitaliGluth et al., 2020; Reference Krajbich, Armel and RangelKrajbich et al., 2010; Reference Pärnamets, Johansson, Hall, Balkenius, Spivey and RichardsonPärnamets et al., 2015; Reference Sepulveda, Usher, Davies, Benson, Ortoleva and De MartinoSepulveda et al., 2020; Reference Sheng, Ramakrishnan, Seok, Zhao, Thelaus, Cen and PlattSheng et al., 2020; Reference Shimojo, Simion, Shimojo and ScheierShimojo et al., 2003; Reference Teoh, Yao, Cunningham and HutchersonTeoh et al., 2020). In particular, we used the same experimental paradigm as Krajbich et al. (2010), and replicated empirical findings about the role of gaze in value-based decisions. To our knowledge, we are the first to replicate an eye-tracking decision-making task online. Notably, this experiment took just a couple of days to run, in contrast to standard eye-tracking experiments which typically take several weeks to run. In the supplementary material we provide a template experiment and our experimental materials.
We also note that online eye-tracking is potentially a useful tool for all online researchers, as it can be used to ensure that study subjects are humans and not computer algorithms, i.e., “bots” (Reference Buchanan and ScofieldBuchanan & Scofield, 2018; Reference Buhrmester, Kwang and GoslingBuhrmester et al., 2011). We hope that this work will help facilitate the continued growth of both eye-tracking and online behavioral research.
2 Method
2.1 Subjects
125 subjects from Amazon MTurk participated in this study. Of these, 49 successfully passed the initial calibration + validation and completed the study. We required subjects to be located in the United States and have a 95% or higher HIT approval rate. In addition, we required subjects to have a laptop with a webcam.
2.2 Privacy
Given that WebGazer uses subjects’ webcams to monitor their gaze location, privacy concerns naturally arise. Therefore, it is important to note, and to highlight for subjects, that the webcam images are processed locally and never leave the subjects’ computers. What leaves their computer is the output of the WebGazer algorithm, namely horizontal (x) and vertical (y) coordinates of where WebGazer thinks the subject is looking at a given point in time. In this study, subjects saw themselves live on screen prior to the calibration procedure. This was to help them position their heads optimally. Researchers who are concerned that their subjects may be wary of privacy could disable this feature by turning off the “showVideo” option but leaving the “show face overlay” and “show face feedback box” on when they implement the calibration function. That might somewhat impede calibration, but it might reduce subjects’ apprehension as they start the experiment.
2.3 Experimental software/materials
The experiment was programmed in JavaScript, based on the jsPsych and WebGazer libraries. To improve WebGazer’s temporal resolution we removed some seemingly unnecessary computations that occur in each animation frame of a webpage. The original code calls the getPrediction() function at every animation frame to load the measured gaze location. This step is necessary when providing gaze-contingent feedback, but otherwise just consumes computational resources. These extra computations appear to gradually degrade WebGazer’s temporal resolution.
To deal with this, we modified the loop() function for each animation frame to avoid the getPrediction() call when possible (for the case we just need face tracking data to draw face overlay, the CLM tracker is called separately, and similarly for pupil features needed in the face feedback box). In addition, we also used the recently added ridge thread regression method, which reduces computational demands.
We used Heroku (a cloud platform; https://www.heroku.com) as our server-side support for the experiment.
2.4 Task
2.4.1 Recruitment and initial preparations
We asked subjects to close any unnecessary programs or applications on their computers before they began. Also, we asked them to close any browser tabs that could produce popups or alerts that would interfere with the study (see Fig. S3). Once the study began, subjects entered into full-screen mode.
Before subjects began the calibration/validation process, we provided detailed instructions about how to position themselves. We first showed them instructions from Reference Semmelmann and WeigeltSemmelmann & Weigelt (2018). For example, they should sit directly facing the webcam to ensure full visibility of their face. We also added several tips we learned from the pilot study. In detail, we asked subjects to 1) use their eyes to look around the screen and avoid moving their head; 2) keep lights in front of them rather than behind them so that the webcam could clearly see their faces; 3) avoid sitting with a window behind them (Fig. S2).
After reading the instructions, subjects saw a screen where they could position themselves appropriately using the live feed from their webcam. Once they were properly positioned, they could advance to the calibration and validation stage.
2.4.2 Calibration + validation
Subjects next had to pass an initial calibration + validation task (Fig. 1A). At the beginning of the calibration, a video feed appeared in the top left corner of the screen. Subjects could use this video feedback to adjust their position and center their face in a green box in the center of the video display. Once properly positioned, subjects could press the space bar to advance to the next step.
Next, subjects saw a sequence of 13 calibration dots appear on the screen, each for three seconds (Reference Semmelmann and WeigeltSemmelmann & Weigelt, 2018). The task was simply to stare directly at each dot until it disappeared.
Next, subjects entered the validation procedure. The validation procedure was essentially identical to the calibration procedure, except for the following differences. Each validation dot lasted for two seconds. Within those two seconds, WebGazer made 100 measurements (one every 20ms). Measurements within the first 500ms were removed to account for gaze transitions. Each measurement was labeled as a hit if it was within X pixels of the center of the dot (X increased with each failed calibration/validation attempt, see below). If at least 80% of the measurements were hits, we labeled the dot as valid, and it turned green. Otherwise, the dot turned yellow (in the validation instructions, we told subjects to try to make every dot turn green). Out of 13 validation dots, if the valid dot proportion was at least Y, the experiment proceeded.
Subjects had three chances to pass this initial calibration + validation task. With each new attempt, we raised the pixel threshold (X) for a hit and lowered the valid-dot threshold (Y). In particular, the pixel thresholds (X) were: 130px, 165px, and 200px; the valid-dot thresholds were: 80%, 70%, and 60%. If a subject failed the calibration + validation three times, we compensated them with 50 cents and ended the experiment.
We adopted this procedure to give poorly calibrated subjects a chance to reposition themselves and try again, while also acknowledging that some subjects might not be able to sufficiently improve their setup to pass the most stringent requirements. This also allowed us to assess if initial calibration attempt(s) predicted any of the later results (see Supplementary Note 1).
2.4.3 Hypothetical food choice task
After passing the initial calibration and validation, subjects proceeded to the choice task (Fig. 1B). This paradigm was initially used in Krajbich et al. (2010) to study how gaze influences value-based decisions. Subjects first rated their desire for 70 snack food items on a discrete scale from 0 to 10. Subjects were told that 0 means indifference towards the snack, while 10 indicates extreme liking of the snack. They could also click a “dislike” button if they didn’t like a food item. Subjects used the mouse to click on the rating scale.
After the rating task, subjects were recalibrated and validated. They were eye-tracked for the remainder of the study.
Next, subjects began the binary choice task. 100 trials were randomly generated using pairs of the rated items, excluding the disliked items. Subjects were told to choose their preferred food in each trial. They selected the left option by pressing the left arrow key and the right option by pressing the right arrow key.
Between trials, subjects were either presented with a fixation cross at the center of the screen or, every ten trials, with a sequence of three red validation dots. In the latter case, the first two validation dots appeared randomly at one of 12 possible positions, while the last dot always appeared at the center of the screen. For each of those validation dots, the pixel threshold was set at 130px with a threshold of 70%, and the presentation time was 2 seconds. A recalibration would be triggered if subjects failed more than four validation dots in two successive intertrial validations.
After 50 trials, subjects were given the option to take a short break. After the break, they were recalibrated and validated.
2.5 Data cleaning
Out of 49 subjects, 48 subjects’ data were fully received. One subject’s data were only partially received, with 32 choice trials.
To ensure good data quality for the analysis linking gaze to behavior, we checked the intertrial validation pass rate and excluded subjects who failed too many. As mentioned above, the pixel threshold was set at 130px with a threshold of 70% for each validation dot. A subject’s pass rate was their fraction of valid intertrial dots. The mean pass rate was 0.6 (SD = 0.26). There were 35 subjects with pass rates higher than 0.45 (M = 0.73, SD = 0.16), eight subjects with pass rates between 0.3 and 0.4 (M = 0.36, SD = 0.04), and six subjects with pass rates below 0.2 (M = 0.15, SD = 0.08). For those subjects with pass rates between 0.3 and 0.4, we identified the longest intervals that did not include two consecutive complete validation failures (six consecutive missed dots). If those intervals contained at least 20 behavioral trials, we included those trials in the analysis (see Fig. S5). In particular, we included 50, 40, and 20 trials from three additional subjects. Our initial analysis plan would have only included the 35 subjects with pass rates higher than 0.45, but to better match the sample size from the 2010 study, we decided to additionally include these three additional subjects. Thus, 38 subjects were included in total.
We also excluded individual trials based on RT and dwell times. We removed trials with RTs shorter than 0.4s or longer than 10s, and trials with potentially problematic fixation data as follows: 1). The gaze measurements were always at the center of the screen. 2). The sampling interval was longer than 200ms (10 times larger than expected). After these exclusions, the mean number of trials was 80 (SD = 27).
2.6 Stimuli and ROI definition
Each food image was 450px by 320px. We defined AOIs in terms of the percentage of the screen size. Gaze within 25 to 75 percent of the screen height and 5 to 45 percent of the screen width were considered the left AOI, while gaze within 25 to 75 percent of the screen height and 55 to 95 percent of the screen width were considered the right AOI. These AOI definitions were chosen before analyzing the data.
As a robustness check, we also tried defining AOIs in pixels, adding 90px horizontal buffers and 54px vertical buffers to the edges of the images. There were no qualitative differences using this alternative AOI definition.
2.7 Computer resolution/browser usage
Subjects’ screen widths ranged from 1280px to 2560px and screen heights ranged from 719px to 1440px. Out of 49 subjects who passed the initial calibration, 45 of them used Chrome (33 used version 85; 10 used version 84; 1 used version 77; 1 used version 75), and 4 of them used Firefox (version 80).
3 Results
3.1 Basic setup and data quality
To begin, it is worth briefly describing a standard eye-tracking procedure in the brick-and-mortar lab. Typically, the eye-tracking camera is situated either below or above the computer screen, between the screen and the subject (Reference Schulte-Mecklenbeck, Kühberger and JohnsonSchulte-Mecklenbeck et al., 2019). The subject is seated, often with their head immobilized in a chinrest (though not always). Subjects are instructed to try to keep their heads still during the experiment. Before the experiment begins, subjects go through a calibration procedure in which they stare at a sequence of dots that appear at different locations on the screen (Fig. 1A). A subsequent validation procedure has the subject look at another sequence of dots, to establish how well the eye-tracker’s estimate of the gaze location aligns with where the subject is supposed to be looking (i.e. the dots). During the experiment, validation can be repeated (to varying degrees) to ensure that the eye-tracker is still accurate.
With WebGazer we used a similar procedure, with some qualifications. First, before signing up for the experiment, we required subjects to be using a laptop with a webcam, and to be using an appropriate web browser (see Methods). We also asked them to close any applications that might produce popups. We had no control over the subject’s environment and we could not immobilize their head, but we did provide them with a number of suggestions for how to optimize performance, including keeping their heads still, avoiding sitting near windows, keeping light sources above or in front of them rather than behind them, etc. (see Methods). Subjects had three chances to pass the calibration and validation procedure, otherwise the experiment was terminated, and they received a minimal “showup” fee (see Methods).
During the experiment, we incorporated a small number of validation points into the inter-trial intervals, rather than periodically having a full procedure with many validation points. This step allowed us to evaluate data quality over time; in future experiments this step could be skipped or replaced with ongoing calibration points. We did recalibrate halfway through the choice task. The time interval between the calibration at the beginning of the choice task and the second calibration was 5.39 minutes on average (SD = 2.66 mins).
Prior work has documented the spatial resolution of WebGazer (Reference Semmelmann and WeigeltSemmelmann & Weigelt, 2018). They established that, shortly after calibration and validation, online precision is comparable to, but slightly worse than that in the lab (online: 18% of screen size, 207px offset; in-lab: 15% of screen size, 172px offset). However, an unresolved issue is whether that spatial resolution persists as time goes on.
To assess spatial resolution over time, we examined the hit ratio for validation dots as the experiment went on. For each measurement, we calculated the Euclidean distance (in pixels) between the recorded gaze location and the center of the validation dot. If this distance was below a critical threshold (see Methods), we labeled the measurement a hit, otherwise we labeled it a miss. The hit ratio is simply the proportion of hits out of all the validation measurements (see Methods). Aside from an initial drop shortly after each calibration/validation, the hit ratio remained quite steady over time (Fig. 2A; mean hit ratio as a function of trial number: = −0.00048, se() = 0.00021, p = 0.028). Table S3 shows the mean/median hit ratios for every intertrial validation.
A second, potentially more serious issue is temporal resolution over time. Eye-tracking setups often come with dedicated computer hardware due to the required computations. With online eye-tracking, there is no second computer and we have little control over subjects’ hardware. If the computations overwhelm the subjects’ hardware, the temporal resolution may suffer dramatically.
To assess temporal resolution over time, we examined the average time interval between gaze estimates made by WebGazer as the experiment went on. As we feared, an earlier pilot experiment revealed that the time interval between estimates increases dramatically over time, from 95ms (SD = 13ms) in the first ten trials, to 680 ms (SD = 64ms) by the halfway point (13.20 min (SD = 3.55 min)). This decreased back to 99ms (SD = 12ms) after recalibration but then increased to 972ms (SD = 107ms) by the end of the experiment. This kind of time resolution is unacceptable for most behavioral work.
However, with some modifications to the WebGazer code (see Methods) we were able to reduce computational demands. As a result, the time interval between estimates in our main experiment remained steady at 24.85ms on average (SD = 12.08ms) throughout the experiment (Fig. 2B). This time resolution is comparable to many in-lab eye-trackers currently on the market and in scientific use (Reference Carter and LukeCarter & Luke, 2020).
To further quantify spatial resolution, we also examined the initial validation data from another WebGazer study using the same calibration and validation procedure (N=83, details reported elsewhereFootnote 1). Here, we summarize the sample mean and sample deviation for each validation dot (Table 1). We found offsets in the range of 181.20 px – 263.70 px. We also calculated a confusion matrix to examine how often WebGazer estimated the incorrect validation dot (Fig. 3). These results indicate that the spatial precisions are mostly consistent across the validation dots, with some exceptions at the corners of the screen (as is also common in the lab). In particular, the validation dots at the corners of the screen had significantly larger offsets than the other dots (mixed-effects regression of offsets on the validation dot position (at the corner vs. not at the corner): = 40.51, p = 0.014).
3.2 Analysis of the dataset
To verify the quality of online eye-tracking, we sought to replicate the robust links between gaze and choice that have been documented in the literature (e.g., Krajbich et al., 2010; Reference Krajbich and RangelKrajbich & Rangel, 2011; Reference Shimojo, Simion, Shimojo and ScheierShimojo et al., 2003). We used Krajbich et al. (2010)’s binary choice experiment as a basis for comparison (Fig. 1B). This experiment was originally run with an eye-tracker with comparable time resolution of 20ms. In that version, subjects first rated 70 snack foods, then in 100 trials decided which of two snack foods they would prefer to eat. Our online version of that experiment was identical except for the particular stimuli, the number of trials, and the fact that the decisions were hypothetical.
In the original experiment, accuracy rates for rating differences of {1, 2, 3, 4 ,5} were {0.65, 0.76, 0.84, 0.91, 0.94}; in the MTurk study they were {0.65, 0.79, 0.87, 0.90, 0.92}. Thus, despite being hypothetical, decisions in the MTurk study were very similar in quality. Response times (RT) in the original study declined with absolute value difference from 2.55s to 1.71s. Similarly, RTs in the MTurk study declined from 1.42s to 1.17s, though they were significantly shorter than the original study, as indicated by a mixed-effects regression of log(RT) on absolute value difference and a dummy variable for the online study ( =−0.91, se( = 0.03), two-sided p = 10–16). While MTurk respondents were considerably faster in their decisions, they still exhibited the expected relationship between difficulty and RTs (mixed effects regression of log(RT) on absolute value difference: =−0.026, se() = 0.004, two-sided p = 10–9). Other behavioral analyses can be found in Supplementary Note 2.
Next, we turn to the eye-tracking data. Key relationships that we sought to replicate here include: 1) correlations between dwell times and choice: subjects will be biased towards choosing the option they have looked at more; 2) the effects of individual dwell: the duration of the first dwell will be positively correlated with choosing the first-seen item; 3) last fixation bias: subjects will be more likely to choose the last-seen option.
The first analysis models the choice (left vs. right) as a function of rating difference (left−right) and total dwell time difference (left−right) over the course of the trial, using a mixed-effects logistic regression. We found a strong significant effect of relative dwell time ( = 0.57, se() = 0.14, two-sided p = 10–5), even after accounting for item ratings (Fig. 4A). This result is highly consistent with the original study.
We also examine heterogeneity in this relationship, using individual-level logistic regressions. Twenty-six (68%) subjects exhibited positive coefficients (12 were significant at two-sided p < 0.1). This is comparable, though somewhat less consistent than in the original in-lab dataset (Fig. 5).
The second analysis examines the effect of individual dwells. Here we model the choice (first-seen vs. other) as a function of the rating difference (first−other) and the duration of the first dwell, again with a mixed-effects logistic regression. We again find a significant effect of the initial dwell time ( = 0.43, se() = 0.22, two-sided p = 0.04), even after accounting for the item ratings (Fig. 4B). Again, this result aligns well with the original study.
The third analysis examines the effect of the final fixation location. Here we model the choice (last seen vs. other) as a function of the rating difference (last seen−other), again with a mixed-effects logistic regression. We find a strong significant intercept term ( = 0.24; se() = 0.06, two-sided p = 10–5), indicating a bias to choose the last-seen item (Fig. 4C). However, this last-fixation effect is smaller in this dataset compared to the original dataset.
One noticeable difference between this dataset and the original in-lab results (Reference Krajbich, Armel and RangelKrajbich et al., 2010) is in the duration of the average dwell (lab: 576 ms (SD = 380 ms), MTurk: 380 ms (SD = 291 ms)). However, this may reflect that RTs were considerably shorter in this experiment than in the lab experiment. The average dwell time, as a fraction of RT, was comparable between the lab (M = 0.25, SD = 0.15) and MTurk (M = 0.29, SD = 0.19) experiments.
4 General discussion
We have presented an attempt at online eye-tracking in behavioral research. Online data collection is increasingly common, especially during the COVID-19 pandemic. This should not be a barrier to studying visual attention.
Although there are some options available for online eye-tracking, none have been adopted by behavioral researchers. Some software (e.g., TurkerGaze) requires extensive programming knowledge. Other software such as Realeye (https://www.realeye.io) is not open source and can be very expensive to use. In general, when trying to build an online eye-tracking experiment there are several features to consider: 1) The flexibility of stimulus presentation (is it possible to adjust the paradigm/software for different experiments?) 2) The difficulty of the experimental programming (does the implementation of the paradigm/software require extra expertise?) 3) The retrieval of the eye-tracking data (can the data be retrieved and stored in a useable format?) 4) The accessibility of the resources (is the software/paradigm open-source?).
We assessed these dimensions with our toolbox and found that it performs well on all these dimensions, as it provides total flexibility, is integrated in user-friendly jsPsych, stores the eye-tracking data with the other behavioral measures, and is open-source.
An important issue that we addressed in this study is the amount of calibration and validation required to run a successful experiment. In prior work, calibration and validation has taken up to 50% of the experiment time (Reference Semmelmann and WeigeltSemmelmann & Weigelt, 2018). However, with our modifications, we found that it is possible to get by with less, as there appears to be little to no degradation in spatial or temporal precision over time, at least on the time scale of our experiment. In our study, the mean fraction of the time subjects spent in calibration or validation was 40%, but we likely could have gone lower. Moreover, we found that most subjects were able to pass the initial calibration in their first attempt, minimizing the time that they spend on calibration and validation (see Supplementary Note 1). Going forward, we would suggest assigning a single calibration + validation phase at the beginning of the study (to screen out unusable subjects). Occasional inter-trial validation dots may also be useful as a measure of data quality, or alternatively inter-trial calibration dots may be useful to improve data quality. Of course, the amount of calibration should depend on the spatial precision required. If there are more areas of interest (AOI) then more calibration may be necessary.
Along those lines, one unresolved issue is how many distinct AOIs can be effectively used online. Here we used a simple design with two AOIs. Based on WebGazer’s spatial precision, we estimate that one could use four to six AOIs without any degradation in data quality (Table 1). The average distance between the true and measured gaze locations is ∼200 pixels (or ∼20% of the screen size), which means that with more AOIs, gaze in one AOI might start to register in another AOI. This is certainly worse than what one would get for a typical subject in the lab, but we believe it can still be useful for many applications. Presumably, better data analysis methods could be used to filter out spurious observations, if one needed more AOIs.
Another issue is how far the time resolution can be pushed. Here we went with 50 Hz, which seemed to work well. Most common webcams have a sampling rate of around 50Hz (e.g., Logitech C922 camera with 60Hz sampling rate) and so that is likely the limit on temporal resolution. For studies requiring better temporal resolution, in-lab eye tracking is still likely necessary.
Notably, visual angle, which is one common measure reported in eye-tracking studies, is not available with the current toolbox. However, WebGazer does detect users’ faces using the clmtrackr library (a face fitting library; Reference Mathias, Benenson, Pedersoli and Van GoolMathias, 2014) and then extracts the eye features (Reference Robal, Zhao, Lofi and HauffRobal et al., 2018). It should therefore be possible to calculate a subject’s distance to the screen, and from that estimate visual angle. Future research should attempt to address this issue.
We validated the toolbox by replicating Krajbich et al. (2010)’s in-lab study. We replicated important links between gaze and choice. However, it is important to note that the extent to which the online data were in line with the original data varies among the hypotheses.
Though we found the last fixation effect (i.e., subjects were more likely to choose the last-seen option), the difference in the size of the effect relative to the original data (and subsequent replications; see Smith & Krajbich 2018, 2021) is substantial and does warrant future investigation. However, we doubt that the eye-tracking technology is responsible for this difference. Notably, Our MTurk subjects made their decisions much faster (1.3s) than in the lab (2.2s). Additionally, in a follow-up study (Table 1), we investigated attentional effects in a slightly different domain (political choice) and found a much higher rate of looking at the chosen option last (∼70–75%, very much in line with prior in-lab results). Therefore, we suspect that the difference in the last-fixation effect that we observed is a product of the subject population rather than the toolbox. Going forwards, it will be important to compare eye-tracking results with different subject populations, using this toolbox.
Previous research has documented the advantages and disadvantages of conducting behavioral research online (i.e., Reference Mason and SuriMason & Suri, 2012). We would like to highlight several benefits of online eye-tracking compared to in-lab eye-tracking. First, tasks on MTurk allow many subjects to participate in the study simultaneously. In contrast, in-lab eye-tracking studies typically are one-on-one sessions, with one subject and one experimenter in the laboratory (but see Hausfeld et al., 2021). Therefore, collecting data in the lab is time and labor intensive. We completed data collection in three days, while it would take weeks to collect the same amount of data in the lab. Second, the low cost of online eye-tracking is also another distinct advantage, as it requires no special hardware on the experimenter’s side and the software involved is all free and open access.
On the other hand, there are some limitations to the online approach (e.g., Ahler et al., 2019). One issue is with the number of subject exclusions. In a typical lab study, only a small number of subjects are excluded. For example, our in-lab comparison study (Reference Krajbich, Armel and RangelKrajbich et al., 2010) only excluded 1 subject out of 40. Meanwhile, in the online study, we excluded over half of the subjects. However, this comparison is somewhat misleading. Online, most exclusions were done before the experiment even began; subjects could not begin the experiment until they passed hardware checks and then the calibration/validation. In the lab, subjects who cannot be calibrated or who simply fail to show up to their scheduled session would normally not be counted as “exclusions”, they would simply not be mentioned. So, while we might be concerned about potential selection effects where we are only studying people who are less concerned about privacy, have good laptops, are able to position themselves properly, follow directions, and have eyes that are easily detected by the algorithm, there are also similar concerns in lab experiments where we are studying only college students who are motivated enough to sign up for a study, show up to their session, and follow directions. While selection biases are obviously not ideal, the biases here are probably no different than a typical MTurk study, and certainly more representative than a typical study with university students (Reference SmartSmart, 1966).
Additionally, online studies in general suffer from higher rates of attrition. Researchers have found that up to 25% of MTurk respondents are suspicious or fraudulent, e.g. bots (Reference Ahler, Roush and SoodAhler et al., 2019). Given that we cannot observe our subjects nor control their environment or hardware (aside from requiring a laptop with a webcam), it is not surprising that we have lots of exclusions. We would argue that what matters is the final number of subjects, rather than the fraction of recruited subjects.
On a related point, one common issue with online studies is ensuring that subjects are human and not computer “bots”. Researchers have developed ways to filter out bot data after the fact (Reference Dupuis, Meier and CuneoDupuis et al., 2019; Reference Permut, Fisher and OppenheimerPermut et al., 2019) or to use extra items to screen out bots during the study (Reference Buchanan and ScofieldBuchanan & Scofield, 2018). The problem with the former approach is that it requires assumptions about how these bots will respond. Savvy Mturk users might be able to program bots that violate those assumptions. The latter approach is more similar to ours, but it typically requires subjects to exert extra effort that is irrelevant to the task, and these extra measures may also be defeated by savvy programmers. WebGazer provides a simple way to ensure that subjects are human beings, without any additional questions or statistical tests. While it is surely not impenetrable, faking eye-tracking data would be no small feat.
In summary, we see a lot of promise for online eye-tracking, even beyond the COVID pandemic. While it is by no means perfect, it provides a fast, accessible, and potentially more representative way to study visual attention in behavioral research. We look forward to seeing the ways in which researchers take advantage of this opportunity.