‘AI beats radiologists’ and ‘Algorithms outperform doctors’ are examples of the excited headlines that promise that artificial intelligence is drastically changing healthcare. Today, when advocates of deep learning aim to illustrate the powers of AI for image and pattern recognition in big data sets, advances in medical imaging are often a key example. These developments have also turned the field of radiology into a central site to study configurations of humans and machines in the development of new norms and forms of automation and artificial intelligence. As I will show, radiologists have been at the forefront of research into technologies and computational techniques prospected to fundamentally alter the work of trained, highly skilled medical professionals since the 1950s. Such promises of radical transformation through technological innovations are long-standing, but also equivocal. Evidence from experimental, laboratory settings shows that AI-supported image recognition in X-rays may help to detect a suspicious area in a scan more quickly and can classify abnormalities with more precision than experts.Footnote 1 However, as yet there is no conclusive proof of their efficacy in clinical practice.Footnote 2 The extent to which radiologists will actually benefit from integrating new image recognition software into their everyday work routines remains unclear. Contemporary studies of algorithms trained on databases of chest X-ray images found that such models may exacerbate existing gender and racial biases and lead to more disparities in care.Footnote 3 Moreover, recent attempts to use deep-learning algorithms to detect COVID-19 in chest X-rays have failed to deliver impactful results.Footnote 4 According to some authors this technological overpromise has caused a ‘credibility crisis’ for machine learning in medicine.Footnote 5
Despite these uncertainties about future benefits, both popular news reports and professional discussions about AI and medical imaging abound with bombastic metaphors expressing the extraordinary promise of big-data analytics, machine learning and deep learning to improve healthcare. The notion of ‘augmented medicine’, for example, envisions data technologies as extensions of human medical professionals to improve clinical practice.Footnote 6 ‘Deep medicine’ conjures the dream of a ‘total archive’ of health data to detect valuable patterns: correlations beyond the capacities of human perception and cognition.Footnote 7 Artificial intelligence is said to give rise to the ‘robot radiologist’, a figure that represents (anxieties about) fully automated medical image recognition, replacing human radiologists in the foreseeable future.Footnote 8 The trope of the ‘centaur radiologist’ conjures a synergistic image of human plus computer harmoniously combining human skilfulness with the newest AI technologies.Footnote 9 Collectively, these currently pervasive imaginaries bind together intersecting promises. In radiology, AI is thought to be able to make inferences about medical data that go beyond human interpretations, speed up routine tasks and free up time for meaningful patient contact, alleviate radiologists from tedious and repetitive work, help radiologists cope with a data deluge of records and medical images, and make care more affordable by being more cost-effective. Perhaps most importantly, however, artificial intelligence is envisioned as a technological aid to radiologists, who need this assistance because their capacities for perceiving and interpreting images are imperfect – human professionals inevitably make mistakes. AI's central promise in medicine is to reduce, or even eliminate, human error.
This article traces the historical emergence of the idea of the fallible radiologist, a conception shaped by attention to new techniques and technologies that also prepared the ground for promises of automation and artificial intelligence in medicine. My genealogical account foregrounds this oscillation or interplay between ideas about human capacities and the potential of technological aids: histories of artificial intelligence, I emphasize, are also histories of imaginaries of human (in)competences. New conceptions of flawed radiologists created space for solutions by new computational techniques and technologies. Focusing on claims about improving the reading of X-rays, I analyse professional discourses that were initiated in the US but also occurred in dozens of other countries between the mid-1940s and the early 1960s. My genealogy of the flawed expert radiologist begins with an analysis of reports on ‘observer variability’ or ‘observer error’ in mass screening campaigns for tuberculosis and pneumoconiosis (coalminer's ‘black lung disease’). These reports alarmed researchers with high levels of disagreement and inconsistency in the assessments of X-ray images, and also stimulated a search for solutions – often, but not always, mixing technological aids with human expertise. I show first how these debates led to the articulation of the radiologist as a fallible observer and then how, in the 1950s and 1960s, statistical and logical analyses of diagnosis, and the design of many technological ‘aids’ and techniques to assist diagnosis, reframed the fallible observer as a suboptimal decision maker. The final section considers one particular suggested solution. Numerous reports published in the 1950s provided evidence that a ‘double reading’ or ‘dual reading’ of X-rays by the same or another radiologist could diminish the number of overlooked anomalies. To understand why this human solution to the ‘human factor’ of error in radiology was hardly implemented, while promises of technological aid persisted, I will argue that we need to consider social and professional status in this field of changing labour, and expertise under question.
‘The hard fact of their own unreliability’: radiologists discover observer variability
‘Why not X-ray before marriage?’ an American health officer wondered in a medical journal in 1950, considering premarital chest X-ray films as a routine procedure to guard against the spread of tuberculosis.Footnote 10 With cheaper, mobile diagnostic imaging facilities, X-raying had become ubiquitous. By the mid-1940s, mass miniature radiography services that produced small-sized photofluorograms had emerged around the world to serve mass health surveys (predominantly to catch cases of tuberculosis), including pre-employment screenings (for medical personnel, food handlers and schoolteachers, for example) and population-wide medical examinations.Footnote 11 Public institutions were amassing immense numbers of images of citizens’ chests. The Veterans Administration, for example, kept a minimum of two on file for each US soldier – one on entry and one on leaving. Giant databases of images and medical records were growing at impressive speed.
With an eye to the rapidly rising number of X-ray images taken during the Second World War, the administration of veterans’ affairs sought to determine which type of X-ray technology was diagnostically the most efficient. This was particularly important because diagnostic decisions were usually made on the basis of a single miniature X-ray. While patients in clinics received multiple tests, in mass screening programmes diagnosis depended on very limited observations in a population of subjects, many of whom were symptom-free. Did a routine small-sized photofluorogram, stereo-photofluorograms, a roentgenogram negative on paper or a celluloid film perform best?Footnote 12 Diagnostic efficiency was defined as those images that produced the lowest amount of under-reading (misses) of X-rays with evidence of tuberculosis as well as the lowest amount of over-reading (finding a false positive).Footnote 13 A team of radiologists, lung experts and a ‘biostatistician’ (a specialty that rose to prominence in the 1930s) set out to compare the various machines and procedures. The answers to their questions were published in the Journal of the American Medical Association in 1947 and left the scientific community bewildered.
Not one X-ray method, not even the relatively expensive celluloid X-ray image routinely used in hospital clinics, appeared to allow better performance over others in finding tuberculosis cases. A far more significant result emerged: when hundreds of image evaluations made by five readers were compared, these experts appeared to have a very high degree of disagreement and inconsistency. In 1949, a follow-up report authored by the radiologist Henry Garland from the University of California, San Francisco, demonstrated beyond doubt that ‘reading the shadows’ was a fickle affair.Footnote 14 Together, the reports presented proof of inter-individual and intra-individual variability – in about 30 and 20 percent of the cases respectively, readers differed from other readers or from their own previous evaluations. Researchers immediately realized the dangerous upshots: ‘every day many persons throughout the country are being informed that their chests are free from disease when, in point of fact, they probably are not (and vice versa). This results in false security on the one hand and needless alarm on the other’.Footnote 15
While there had been ‘a tendency to assume that roentgenology is an exact science and that the objectivity of the medium defied error’, the Journal of the American Medical Association editors commented that these ‘astonishing’ reports pointed to an incredible amount of inaccuracy made by trained X-ray observers.Footnote 16 These sensational findings on ‘observer error’ or ‘observer variability’, as researchers started calling these discrepancies, spurred a flurry of similar X-ray error investigations in a number of countries, including immediate replication studies in Denmark and the Netherlands.Footnote 17 Through the reports by Garland and others, it now appeared that radiologists had ‘blind spots’. They had affinities for detecting particular types of lesions, needed (individually varying) eye-resting periods to be able to discern shadows, and could not agree whether a lesion should be classified as ‘soft’ or ‘hard’. It even appeared that the ‘attitude’ of the observer (their ‘optimistic’ or ‘pessimistic’ outlook) might influence the interpretation. Many reports found similar percentages of variability, results that were of a ‘disturbing magnitude’, ‘extremely disappointing’, ‘disheartening’ and ‘shocking’ to researchers when they realized the ubiquity of what could now simply be called the ‘error problem’.Footnote 18 It also appeared that there was no easy solution: when a group of highly experienced radiologists read and reread a set of survey films, they kept on arriving at the same degree of variability – a ‘baffling’ result.Footnote 19
Healthcare professionals undertaking occupational radiological surveys were especially keen to assess reliability in reading thousands of images. A 1949 study in the British Journal of Industrial Medicine revealed serious incongruities examining survey images of coal miners’ lungs in south Wales.Footnote 20 Readers could not agree which images should be regarded as normal, and which showed ‘certifiable’ pneumoconiosis (also known as black lung disease), which would merit compensation under the Workmen's Compensation Acts. Similarly, researchers assessing coal workers in France, Belgium, the Netherlands and the UK found many reader divergences and proposed a joint meeting to start an international system of standardized classification of chest X-rays.Footnote 21 In 1958, the International Congress on Medical Radiophotography included a separate section on observation errors, including presentations from Brazil, Finland, Romania and Poland on ‘the human factor’.Footnote 22
Through the 1950s, radiology became well known for its study of the observer error. This was not because errors of inconsistency and disagreement were unique to X-ray interpretation, but only because, as radiology researchers emphasized on various occasions, radiology lent itself more readily to quantitative evaluation of the degree of error and more precise data were available.Footnote 23 X-ray images provided suitably stable records – as one researcher remarked, not as ‘flexible’ as records of patient history, not as ‘evanescent’ as actual examination of the body – that could be subjected to multiple readings.Footnote 24 While the results were disconcerting, the project of quantifying error and proposing standards also afforded the discipline a certain objectivity. Researchers commended Garland for ‘trying to lay a scientific foundation under roentgen diagnosis’.Footnote 25 Nevertheless, radiologists were often incredulous of the statistical evidence of their mistakes. Garland noted, ‘One has to test oneself on a study of this kind to become fully aware of his own fallibility in this regard’.Footnote 26 As another researcher put it, ‘only those who have themselves made duplicate readings of a series of films can come to appreciate the hard fact of their own unreliability’.Footnote 27
This study of ‘inherent error’ in radiological observations spurred new scrutiny for the enduring problem of the ‘personal equation’ in scientific observation, or the ‘human factor’, as it was now more commonly called, pointing to a broader and long-standing epistemic problem of objectivity in medical science.Footnote 28 Inconsistencies in diagnostic observations were not limited to radiology but were made acutely visible through this particular disciplinary lens. Looking back at a decade of observer error research in 1959, Garland sketched a longer lineage of studies on error from the 1930s, which showed that medical professionals made mistakes in diagnosing emphysema, for example, or in the level of malnutrition in children. They erred in interpreting electrocardiograms and histologic readings, in recording patients’ medical histories and in counting red blood cells.Footnote 29 When more than fifty clinical laboratories were asked to test the same standard solutions they came back with different results.Footnote 30 A decade of recording errors by radiologists reframed these accounts from the past two decades as part of a common and pressing problem of observer variability, to which definite solutions had not yet emerged.
Variabilities in scientific observation had previously drawn the attention of philosophers and sociologists of science, notably Ludwik Fleck and Michael Polanyi. In 1935, Fleck, who was trained in microbiology, famously noted variabilities and uncertainties in scientific observation (puzzling views through the microscope, for example) to argue that individual observers were conditioned by a socially mediated thought style.Footnote 31 What could be observed depended upon observers’ membership of collectives of researchers – ‘thought collectives’ – cultural constellations in particular. Polanyi in turn emphasized the role of tacit knowledge as an integral part of scientific knowledge formation, a learned understanding based on intuitive apprehensions that could not easily be articulated or formalized. The reading of chest X-rays aptly illustrated this latent dimension for Polanyi, who gained experience in evaluating such images as a medical officer during the First World War. Trained observers, like himself, could not but ‘make sense’ of such pictures, he argued; reading had become a form of ‘personal knowledge’.Footnote 32 For Fleck and Polanyi, variability in observations could be understood by attending to social and individual learning processes that influenced processes of perception integral to scientific knowledge.Footnote 33 Yet the framework of ‘observer error’ emphasized in the 1950s foregrounded a subtly different epistemic attitude to scientific observation, focused on increasing accuracy by protocolizing, standardizing and formalizing X-ray reading. Radiologists framed observer variability as a problem – to which researchers proposed new technological and human solutions in the 1950s.
Reducing and taming errors: the imperfect radiologist as intuitive statistician
How were medical professionals to mitigate these inevitable errors in observation? One approach to the ‘error problem’ in medical practice that several mentioned was to start cultivating a greater attentiveness to the issue, for example by teaching courses on the ‘factors affecting our judgement’ to radiology students.Footnote 34 Yet from the mid-1950s onwards, other solutions started to come to the fore. The problem of observer error changed shape with and through developments in statistical theory, operations research, cognitive psychology and computer research, across a number of research sites and communities. Two Americans played a key role: Lee Lusted, a radiologist and radar specialist, and Robert Ledley, an engineer (specializing in dental prosthetics) and computer expert. Their work transformed the fallible trained observer, especially the radiologist, into a suboptimal medical decision maker who could be assisted by technology.
In the early 1950s, Lusted witnessed the ‘error problem’ at first hand. As a radiologist in training at the University of California in San Francisco, he was one of many volunteer X-ray readers in one of Garland and Yerushalmy's early 1950s observer variability investigations.Footnote 35 As he would later recount, these studies prompted him to investigate the possibility of technological solutions to improve interpretive accuracy in medical diagnosis. Lusted sought to implement in medicine his interests in computing and wartime expertise with engineering (radar) communication technologies. Historian Joseph November's incisive historical analysis of biomedical computing recounts how Ledley and the engineer Lusted started collaborating on a project of computerizing diagnosis, influenced by a shared background and interest in operations research, an applied science that brought a ‘procedural rationality’ to a range of disciplines from military missions to production processes and the development of early computer programs.Footnote 36 Both researchers believed that medical practitioners would greatly benefit from assistance by computers, for example in automatic data processing or calculating diagnostic probabilities. Yet in order to start building computers that could assist doctors, first the activities of doctors needed to be formalized; that is, redescribed in a potentially computable language. Ledley and Lusted modelled doctors’ actions with future automation in mind.
The first and arguably most ambitious project they undertook was a formalization of the diagnostic reasoning process, published in the 1959 Science article ‘Reasoning foundations of medical diagnosis’, which would become widely cited and discussed.Footnote 37 This study provided mathematical descriptions of the complex reasoning process of diagnosis, a process at the basis of a doctor's ‘feeling about the case’.Footnote 38 Should a five-week-old infant with throat tumours receive X-ray therapy or surgery? The authors divided this problem into separate parts, describing it in the language of logical equations, probability computations and statistics, as well as calculations they drew from decision analysis (referencing work on game theory and decision making by authors such as John von Neumann, Oskar Morgenstern, Duncan Luce and Howard Raiffa).Footnote 39 Applying this novel combination of mathematical techniques allowed for a separation of the ‘strategy problem’ (arriving at a medical diagnosis and the optimum strategy for treatment on the basis of probability calculations and statistics) and the ‘values judgement problem’ (calculating the trade-offs given certain moral, ethical, social and economic considerations). These techniques were meant to aid the physician, though the authors conceded that they also added new learning responsibilities. Physicians’ tasks would become more complicated; they would have to study more and would need to be assisted by computers. Yet computers could never take over physicians’ duties, the authors assured, they would simply make diagnosis more rigorous, i.e. more scientific.
Lusted and Ledley's ‘logical analysis of medical diagnosis’ redescribed the fallible observer as a medical professional involved in a complex diagnostic reasoning process. The minds of doctors were thought to work somewhat analogous to this proposed logical model: doctors seemed to perform computational tasks ‘subconsciously’ or on an ‘intuitive’ level.Footnote 40 However, this intuitive diagnostic reasoning was also thought to be suboptimal; doctors were underachieving in their logical capacities and in need of computational assistance.Footnote 41 In turn, the same mathematical techniques (calculations for probability, statistics and utility) thought to be at the foundation of the physicians’ mind were also embodied in the (electronic) computational tools proposed to help them. Ledley and Lusted mentioned various (proto)types of diagnostic slide rule, mechanical correlator and punch card system, invented in France, the UK and the US since the mid-1950s, ‘to assist the logical faculties’ of doctors.Footnote 42 The historian Gerd Gigerenzer has described this analogical zigzag movement between tools (inferential statistical techniques as well as computational devices) and theories (of the workings of the (medical) mind) as a ‘tools-to-theories’ movement, characteristic of the period since the 1940s.Footnote 43 Statistical tools for testing hypotheses in a variety of inferential statistics approaches were considered ‘in a new light as theories of cognitive processes in themselves’.Footnote 44 Looking at the doctor described by Ledley and Lusted in 1959 shows the important influence of this vision of the mind as ‘intuitive statistician’ in the 1950s.Footnote 45
Radiologists, too, were thought to be intuitively reasoning experts. Developing logical principles for medical practice, Lusted proposed that radiology could serve as a ‘testing ground’ to further theorize and formalize various steps of this decision-making process and ultimately decrease observer error.Footnote 46 To do so, Lusted redescribed the actions of radiologists as a step-wise process: producing information on X-ray film, seeing the film (a physiological process), perceiving relevant aspects of the film, and diagnostic decision making. Each step could be formalized and also improved, at least hypothetically, by automation procedures for which Lusted mentioned some early prototypes. Radiologists’ systematic decision making, for example, could be linked to the digital coding of X-rays. If the pattern of a tumour could be noted through a binary ‘1’ and ‘0’ code, visualized by black and white squares, this tumour profile could be read by someone who did not have any medical training, but merely needed to ‘understand the code’.Footnote 47 In this pattern-reading example (based on an early prototype by radiologist Gwilym Lodwick in 1954), a ‘coded analysis’ was thought to enable an interpretation of a roentgenogram with the fewest errors, regardless of an observer's expertise.
Lusted was enthusiastic about these technological approaches to aid observation, even if evidence of increased accuracy was not yet in. Discussing a discontinued 1956 investigation into a screening device for mass miniature films – a pattern recognition apparatus to scan chest images and separate the normal from the abnormal – he noted that it was not the resulting machine that interested him but rather how the process of making it would require an understanding of the logic and probability principles underlying chest film interpretation. While an interest in new devices and procedures had brought Lusted to biomedical computing, ultimately the prospect of creating standards and thereby also getting a ‘firm grasp of the principles’ behind observing patterns on X-ray images was most important.Footnote 48
In Lusted's approach, which I have described as being at the forefront of the development of medical decision making as a field, we can discern two main and intersecting approaches to improving observer error. One was the aim to ‘reduce error’. Even recognizing that not all mistakes could be eradicated, experimenters worked on designing more precise X-ray technologies and improving aspects of faulty diagnostic reasoning. Second, radiology researchers also approached diagnosis by what could be described as ‘taming error’, a strategy that regarded radiologists’ false negative and false positive findings as unavoidable and aimed to monitor their relative occurrence.Footnote 49 The scanning device described by Lusted allowed researchers to calculate and determine an optimal ‘operating point’ on a statistical curve between too many false positives (warnings for lungs that were in fact normal) and too many false negatives (missed abnormalities).Footnote 50 Ultimately, a focus on ‘error’, both in reducing and in taming error, shaped a considerably positivist view of X-ray reading: the search for an optimal procedure to extract ‘truth’ from an image.Footnote 51
These two intersecting approaches – reducing and taming error – also corresponded with two interconnected approaches to the radiology observer. First, the radiologist as ‘intuitive statistician’ seemed to be based on an individualizing approach, viewed as a single mind calculating probabilities and trade-offs. Yet on second view, individual doctor's observations went beyond the singular, since in Ledley and Lusted's 1959 vision each probability calculation of an individual case would feed into a data collection of the most current statistics. Beyond the diagnostic punch card aids of the mid-1950s, Ledley and Lusted now imagined something much bigger: a ‘central health computing and records service’. They envisioned a data-sharing network between local hospital computers and a central research computer, through which the central node would continuously be fed with new statistics and automatically drop older ones, allowing for calculations on the basis of the most current trends – the computer would ‘learn by experience’.Footnote 52 With this vision of a networked diagnostic calculation model, Ledley and Lusted had connected individual patient diagnosis to a population scale.
Although widely noted, Ledley and Lusted's logical and technological model had hardly solved the ever-present problem of variability in interpreting X-ray images. Around 1960, new studies showed that observer variability remained a problem, especially in mass survey work.Footnote 53 Beyond ‘logical analysis’, researchers involved in mass X-ray imaging were looking for concrete measures to improve the accuracy of their procedures reading hundreds of thousands of images. At the scientific department of the UK National Coal Board, for example, researchers aimed to derive a ‘quantitative measure of the accuracy of the reading process’.Footnote 54 Attempting a precise mathematical description of the process of recognition and interpretation of the X-ray images would be ‘impracticable’, they explained.Footnote 55 Instead, they were looking for tangible directions: when could a common reading be taken as ‘definitive’? Would a second or even a third reading increase diagnostic accuracy? Was it possible to devise a practical, human solution to the ‘human factor’?
Doubling the fallible trained observer: dual X-ray reading and negotiating expertise
The problem of observation variability proved tenacious. Even if expert observers had plenty of time for perceiving and interpreting, and were well-rested and provided with the most precise images, a considerable number of disagreements and inconsistencies persisted. Garland's first report had already suggested a simple potential solution. Following up in 1950, Yerushalmy examined ‘dual reading’ as a way to decrease error.Footnote 56 Also called ‘double reading’, this meant performing a second independent interpretation of an X-ray completely separate in time from the first reading, either by a second observer or by the same observer on a second occasion. While there was a danger that second opinions would merely multiply the errors (more false positives and false negatives), Yerushalmy's research demonstrated that dual reading decreased errors. It was also cost-efficient, he reasoned, because the expenses of a missed case would be much greater than the costs of multiple readings.Footnote 57 Subsequent studies in various countries predominantly agreed with these findings.Footnote 58 However, the procedure did not seem to be implemented widely. Writing in The Lancet in 1955, two UK radiologists lamented, ‘Nearly nine years have elapsed since the presentation of the first paper on this problem, and the chief conclusion – the importance of a second reading – is still neglected in this country’.Footnote 59
The reasons for the puzzling failure to take up what seemed a sensible human solution to the fallible observer are multifaceted and I want to speculate on a number of them. First, the prospect of increasing – possibly almost doubling – the observer workload may have discouraged many professionals for logistical and economic reasons.Footnote 60 Moreover, increasing the number of workers also spotlighted the thorny issue of expertise in the X-raying workforce. Because even expert radiologists now appeared to be fallible observers, the ‘error problem’ drew heightened attention to the evaluation of individual performance and also to the demarcation of radiological expertise. Reports on observer error frequently showed a score of unreliability between individual trained radiologists or groups of experts (for example, a study by Garland and A.L. Cochrane compared American and British experts) as well as between experts and other, non-radiology trained readers, such as chest experts or swiftly trained-up mass-survey readers.Footnote 61
While some researchers regarded lower-skilled readers as too unreliable, others emphasized that with adequate training, the proper level of experience could be obtained quite easily. Dutch tuberculosis researcher W.R. Griep argued in 1955 that inexperienced chest experts could reach a ‘maximum reliability’ after about three years of training.Footnote 62 Perhaps such readers-in-training could even start to work almost immediately: his research showed that a procedure of dual reading with two inexperienced readers could bring observer errors to a level equal to two experienced readers. Maybe, Griep speculated, it was not experience that mattered most, but the reader's character. Drawing conclusions from a test of (merely) five observers, he wondered whether the numbers of over-reading (erroneously calling an image suspicious, i.e. false positive) might have to do with the character of the female specialists tested for this investigation. While the male reader ‘does dare to decide if an affection he sees is important or not’, he noted, the two women exhibited an attitude that ‘you never can tell’.Footnote 63 Though Griep's invocation of a gendered aspect in male courage versus female caution is just a fleeting remark, it is telling for a broader negotiation of (X-ray) reading expertise in light of non-physician and women workers entering new (or shifting divisions of) tasks in medical and laboratory fields.
While women were already ubiquitously employed in mass X-ray survey work and radiology clinics as clerks and radiographers (operators who worked with patients and machines to produce the X-ray images), in the mid-twentieth century non-physician readers, including women, were increasingly considered a potential source of cheap labour to serve expanding survey facilities.Footnote 64 Outside radiology, other disciplines similarly grappled with a demand for workers who could process and read massive numbers of images. For example, in novel population-wide campaigns to detect cervical cancer, a new division of mostly female ‘screeners’ was educated to catch suspicious pap smears among thousands of microscopic slides.Footnote 65 In physics research, ‘scanning girls’ were trained to handle and interpret large numbers of particle track images.Footnote 66 In both examples, reading work was modelled (and discursively framed, i.e. ‘feminized’) to align with an affordable female workforce by restructuring work processes and continuous performance evaluation.Footnote 67 In contrast, in the field of (survey) radiology, the rise of non-physician readers remained contentious. Radiologists continued to protect their professional ownership of diagnosis against replacement by paramedical reading personnel. While the need for more and possibly cheaper non-physician personnel was frequently voiced – especially in the context of a call for double reading – the practice of interpreting X-rays was demarcated as the task of a medical specialist.Footnote 68
Economic and sociocultural considerations about the cost-efficiency of medicine and the esteem of non-physician (including women) workers thus influenced how solutions to observer variability were conceived and realized, and dual reading could not be aligned with principles and practices in radiology and medicine in the 1950s. Yet I want to suggest that there was another, arguably more fundamental, reason why the practice was not taken up. Dual reading emerged as a solution from an experimental and statistical framing of radiological practice, shaped by a ‘laboratory’ imitation of what scientific (medical) observers do. In the 1950s, the experiments by Garland, Yerushalmy and other investigators simulated X-ray viewing practices under experimental conditions, and emphasized recording individual researchers’ interpretations, providing numbers that would feed into statistics of variability. This experimental frame could not fully describe, however, the way mass X-ray workers viewed and interpreted images in messy real-world situations.
Dual-reading research makes this discrepancy between model and real-world, laboratory investigations and actual readers looking at pictures of lungs sharply visible. In 1960, researchers working at the National Coal Board aimed to outline uniform procedures for taking and reading radiographs.Footnote 69 With mass-X ray examinations of workers at no less than twenty-five coalmines, standardized methods were necessary to produce accurate data to investigate the progression of black lung disease.Footnote 70 Attempting to keep observer variability in check, the researchers proposed a triple reading process (dual reading by one medical officer, and a third reading by another). While the statistical analysis of ‘dual reading’ required that a first and second reading be wholly independent, the model could not contain the fact that doctors tended to remember a previously seen image – especially the ‘doubtful’ films – even several months after the fact. ‘Double reading’ hardly matched actual reading practice. Reflecting on their experimental model, the researchers commented, ‘it is apparent that it does not provide an entirely realistic representation of the reading process on any particular film’.Footnote 71 To reduce observer error, researchers started investigating models of ‘joint discussion’, which appeared to reduce inconsistencies better than separate dual readings between different observers.Footnote 72 Gradually, variants of collegial joint discussion about uncertain images were strengthened and reframed as forms of ‘conference reading’ taking place in ‘referee conferences’ and became more explicitly implemented in working routines.Footnote 73 Such practices did not amount to the fully fledged programme of dual reading proposed as a solution to the problem of ‘observer error’. Instead, practices mitigating uncertainty had already evolved outside the investigative experimental sites of radiology research from which the paradigm of the ‘error problem’ emerged.
The recurring discovery of the error problem
Through the ‘error problem’, as it was shaped in the field of radiology in the mid-twentieth century, a new epistemic position was foregrounded: the expert as a fallible trained observer. This expert needed to be monitored and assisted to reduce mistakes and stay within a statistical range of acceptable inconsistencies and disagreements, prompting a search for different measures and aids to discipline error. My analysis of these faulty X-ray readers contributes to historical research on scientific observers and historical epistemology in the mid-twentieth century. Historians Lorraine Daston and Peter Galison have pointed to the emergence of a new epistemological position in the 1930s: a focus on ‘trained judgement’ in the interpretation of visual records in scientific practice by the ‘trained expert’ who has developed a capacity ‘to synthesize, highlight, and grasp relationships in ways that were not reducible to mechanical procedure’.Footnote 74 In Daston and Galison's account, the early decades of the century are characterized by a shift from a focus on creating scientific graphs and images conceived as having a self-evidential nature according to an ideal of ‘mechanical objectivity’ towards an emphasis on the necessity of the trained eyes of experts to identify and judge the characteristics of these records. Daston and Galison describe the ‘trained expert’ as someone who ‘embraced instruments, along with shareable data and images, as the infrastructure on which judgment would rest’.Footnote 75 My genealogy of the imperfect radiologist and the fallible trained observer shows that some trained experts were viewed as inevitably in need of help, and notions of judgement were themselves increasingly shaped in terms of statistical analyses that counted variations between observers, beyond the individual trained expert.Footnote 76
At the turn of the 1940s, the notion of the imperfect trained observer took shape with and through the statistical monitoring of observer errors in experimental (investigative) set-ups. I have argued that this framework also shaped how solutions to the error problem, such as technological aids for automated image recognition and procedural changes such as ‘dual reading’, could be imagined. Comments on inter-observer variability were not new in the 1940s, but refracted earlier observations on scientific observations by philosophers and sociologists of science Ludwik Fleck and Michael Polanyi, who considered skilful, intersubjective, intuitive and unaccountable elements as integral dimensions of scientific observation. In contrast, I have shown that the framework of the fallible observer prompted a statistical and experimental solution: a second reader could potentially decrease error. Yet research in dual reading hardly modelled the reality of mass X-ray observations and radiological practice, obscuring a more complex reality in which readers could jointly discuss what they saw on the image and, as Fleck and Polanyi suggested, draw on a complex process of collective training and experience. My analysis also suggests that despite the repeated lament that dual reading had not been implemented, there were already collaborative practices in radiology, hiding in the shadows.
Today, the figure of the fallible expert observer has regained significance. Advocates of AI in radiology are refocusing attention on the persistent issue of reader variability and propose artificial intelligence as a promising and fitting technological solution to this issue of human error.Footnote 77 On second view, however, this ‘rediscovery’ of radiology's error problem is not new but is another instance of its continuous return. Each decade, so it seems, a group of researchers revisits the ‘Achilles heel’ of the discipline, to point back to the first reports around 1950 and conclude that the problem of human error in radiological observation is multifaceted, complex and persistent.Footnote 78 Over the past seventy-five years or so, the recurring recognition of the error problem is closely tied to returning suggestions for reducing error.
My historical analysis shows how, in the decade after the first reports on inter-observer variability, technological solutions to automate X-ray reading may have received as much attention as procedural solutions to mitigate error, if not more, which implies increasing and restructuring human X-ray reading labour. After the 1950s, novel computer-assisted procedures for X-ray reading (based on processes of standardization and automated pattern recognition) have continued to be much publicized, from computer-aided analysis of X-ray images in the 1960s to computer-aided diagnosis (CAD), computer-aided detection (CADx) and the use of artificial neural networks from the 1970s to the 2000s. By looking back to the early days of the debate on improving X-ray image reading, my analysis helps us to understand how the promise of a technological solution to a pressing error problem could be sustained vis-à-vis (an allegedly) daunting and overly expensive human fix. Considerably less publicized is the associated recurring discovery of dual reading. A recent (2018) meta-review again suggests that double reading in radiology is beneficial, but also adds a familiar caveat: ‘the benefit of double reading must be balanced by the considerable number of working hours a systematic double reading scheme requires’.Footnote 79
Historicizing the paradigm of the ‘error problem’ reveals the rhetorical negotiation between two poles – an up-and-coming image-reading technology in relation to a suboptimal human. My analysis also opens up a view of the way this pairing has taken a new turn: in the 1990s, computer-aided detection started to be proposed as a way to realize double-reading procedures, particularly in the context of mass screening. At that point, CAD became envisioned as ‘viable cost-effective alternative to double reading by radiologists’.Footnote 80 With strategic modesty, technology was positioned as the not-yet-perfected but feasible assistant to the inevitably imperfect X-ray-reading expert. ‘AI as second reader’ is the present-day version of this diplomatic conceptual emplacement, which has helped to sustain interest in computer-aided and AI-supported solutions at a time when evidence that such methods improve clinical accuracy and efficacy is still pending.Footnote 81 This vision of AI as a potentially cheaper double reader, I demonstrate, has long been in the making.
My historical account of observer errors in the field of X-ray reading helps to contextualize the bombastic contemporary trope of the ‘centaur radiologist’, mentioned at the beginning of this article. Half-man and half-AI, the centaur radiologist conveys the image of an effective doctor–warrior that harmoniously combines human skilfulness with the newest AI technologies to ‘find patterns in data that are beyond humans’ abilities’.Footnote 82 As I show, this heroic image obscures a more mundane and modest application of AI as an allegedly more affordable aid in optimizing the reduction of errors. At the centre of this development, starting in the late 1940s, is the imagination of a fallible radiological expert.
Acknowledgements
I would like to thank the editor and two anonymous referees of the BJHS for their insightful reviews of earlier drafts. Special thanks to Richard Staley and the editors of this issue on Histories of AI for their expert conceptual and editorial advice and for welcoming me at the Mellon Sawyer Seminar on Histories of AI at the University of Cambridge in 2021. This article has also benefited from generous reflections on a draft by members of the Maastricht University Science, Technology and Society Studies research group – special thanks to Joeri Bruyninckx for his careful comments. Additionally, I would like to thank the radiologists at the Trefpunt Medische Geschiedenis Nederland, in particular Kees Simon and Frans Zonneveld, who responded kindly, swiftly and extensively to my questions and provided vital professional reflections. I thank Eddy Houwaart for important feedback in early stages of developing this text. Thank you Alex Campolo for perceptive exchange on notions of optimization and error. My research was generously supported by the Dutch Research Council (NWO) as part of the RAIDIO research project grant (number 406.DI.19.089). Special thanks to RAIDIO team member Sally Wyatt for important advice on drafts, as well as to Annelien Bredenoord, Jojanneke Drogt, Karin Jongsma, Megan Milota and Shoko Vos.