The Body Audible: From Vocal Biomarkers to a Phrenology of the Throat

This article is part of the following series:

How do mental illnesses sound? What are the stakes of using machines to render the signs of psychiatric suffering audible? These questions drive the teams of psychiatric and engineering professionals I study. They also animate my own ethnographic inquiries into listening as a gendered, racialized form of labor and care, and the politics of framing mental health care in the United States as a technoscientific – and semiotic – problem. 

Located across U.S.-based universities and supported by federal, academic, and philanthropic funders, these research teams are building automated voice analysis technologies for psychiatric screening, a form of clinical judgment prior to diagnosis and treatment. They marshal artificial intelligence (AI)-enabled pattern recognition to seek out “vocal biomarkers” of mental illness: subtle, sonic signatures of psychopathology supposedly expressed in speech but undetectable by even the most expert of human ears. Like their peers and competitors, the teams’ long-term goal is to make screening more accurate and efficient by shifting the interpretive focus from content (what people say) to form (how they say it). Contrary to my own assumptions about the seamlessness of automated listening devices—like the body-less “assistant” of the Amazon Echo that carries out commands without a touchable interface—my fieldwork revealed that the embodied, interactional dimensions of listening are central to automating psychiatric screening.  

My interlocutors are part of a growing community of mental health care researchers in the U.S. who eschew the hypothesis-driven methods of biomedicine in favor of data-driven approaches from computer science and engineering. Champions of this movement, called computational psychiatry, reject their field’s attachment to the Diagnostic and Statistical Manual of Mental Disorders (DSM), citing the absence of biological indicators of mental illness. The DSM’s criteria are arrived at through statistical analyses of patient-clinician interviews and symptom reports, rather than through laboratory-based assays. For these reasons, computational psychiatrists insist that the DSM groups people together in imprecise ways. As a neuropsychiatric researcher at Massachusetts General Hospital recently put it, “[i]n psychiatry we don’t even have a stethoscope…It’s 45 minutes of talking with a patient and then making a diagnosis on the basis of that conversation. There are no objective measures. No numbers” (Adam 2020). 

On the contrary, proponents of computational psychiatry propose gathering as much observable data as possible from patients, regardless of that data’s relationship to conventionalized diagnostic criteria: anything from gait to galvanic skin response. With enough data and computational acumen, they argue, novel connections between behavior and the onset of mental illnesses will flicker into view, leading to more reliable diagnoses and treatments grounded in biological, etiological mechanisms. 

Some extend this argument–and the linkage between quantification, biology, and accuracy–further. Echoing former National Institute of Mental Health (NIMH) director Thomas Insel (2017), they assert that the expertise and ethos of Silicon Valley are uniquely capable of addressing U.S. psychiatry’s public health challenges: too many sick and underserved people, and nowhere near enough time, money, or personnel to support them. In my book manuscript, I delve into how computational psychiatry and prior attempts at computerization take for granted that diagnoses also operate as vectors of social control.

Here, however, I want to focus on the perceived epistemic and public health deficits of contemporary U.S. psychiatry that catalyze researchers to try their hand at automating psychiatric screening, rather than diagnosis. Under conditions of late capitalism, screening plays a key role in sorting the well from the unwell, determining who is deserving of professional, clinical attention, detention (in either a psychiatric hospital or a prison), child custody, and more. While the suggestion that AI might make these choices easier is seductive, the historical and ethnographic record demonstrates that automation in the name of efficacy tends to deepen, not mitigate, inequities that fall upon racialized, gendered, and colonial fault lines (Hinton 2016; Eubanks 2018; Hicks 2018; Benjamin 2019; McIlwain 2019; Glabau 2020). My research suggests that automating the sensory, decision-making work in the administration of psychiatric medicine risk replicating similar inequities.

The notion of a psychiatric vocal biomarker–that the voice conveys biologically meaningful signs–likewise has a sociocultural backstory, converging around models of language from North American mental health care and engineering. In the context of U.S. mental health care, according to what E. Summerson Carr calls “the ideology of inner reference” (2010: 4), speakers can agentively express and render transparent the contents of their mind through semantic meaning. This also means that speakers can jam the mind-meaning channel through lying, concealment, or miscommunication. The desire for vocal biomarkers upholds the ideology of inner reference, taking it seriously as a dilemma to overcome. 

Signal processing—a subfield of engineering indebted to information theory, telephone engineering and D/deaf education (Mills 2011; Sterne and Rogers 2011)—offers a resolution, reorienting the locus of meaning from linguistic significance to acoustic form, and from immaterial mind to material brain. Like some of my interlocutors, engineers trained in signal processing treat speech as a biomechanistic signal that can be mathematically analyzed to reveal information about the neuronal source that produced it. 

In theory, because vocal biomarkers index the faulty neural circuitry of mental illness, they are agnostic to language difference, speaker intentionality, and semantic, sociocultural meaning. Neurobiological essentialism and language universalism collide. Together, these models situate the body via “the grain of the voice” (Barthes 1977) as loci of truth beyond conscious control and apprehendable through mechanical mediation (Lempert 2019; see also Inoue 2018). This is the same cultural imaginary of the polygraph—the notion that psychic ephemera can be converted into a knowable once transferred from one medium (bodily) into another (numerical) (Dror 2001). 

In the messiness of day-to-day practice, however, idealized models of mind, body, language, and machine unwind. Joining the interdisciplinary teams as a research assistant, I learned that researchers must first amass a stockpile of hand-labeled speech data in order to begin developing a voice analysis system. This entails a chain of delicate, time-consuming communicative events. Team members must recruit and screen research participants willing to have their voices recorded and studied by people they may never meet. They must draft and adhere to institutionally mandated ethical protocols for conducting research on human subjects. They must craft interactional settings in which the participant feels comfortable producing speech, drawing on sociocultural scripts for establishing rapport, encouraging trust, and performing receptive empathy. When brain scans are conducted, they must help participants into the scanner bed, tucking them in with a blanket and coaxing them to lie still and speak clearly. And finally, they must store, listen to, judge, and tag thousands of individual audio files. Whatever the eventual algorithmic system “recognizes” in speech depends upon these aggregated, subjective, and never-neutral choices (Lawrence 2018; Thakor 2018; Novle 2018; Benjamin 2019). The end product is an amalgamation of human and machine, constructed just as much of tacit knowledge and gut feelings as bits, wires, and silica. 

Working alongside my interlocutors attuned my ethnographic listening to the self-reflexive critiques they made of their own projects and the techno-optimism surrounding AI and mental health care. They taught me to be vigilant of the depoliticizing tendencies of computational psychiatry (Semel n.d.), which pins illness in individual bodies. Guided by scholarly and activist work on the close coupling of social services and the U.S. prison industrial complex (Roberts 2001; Hatch 2019; Akua Mensah and Kaufman-Mthimkhulu 2020), and by trenchant calls for liberation by the Black Lives Matter protests of spring 2020, my research currently investigates how voice analysis technologies sustain the U.S. mental health care system’s carceral logics of capture and containment, disproportionately harming BIPOC, queer, trans, disabled, unsheltered, and non-U.S. citizen individuals. 

In this same vein, those who use, develop, and benefit from voice analysis technologies beyond psychiatry must scrutinize the unintended byproducts of voice analysis technologies produced alongside their use-cases. Parsing through some of the patents for “COVID-cough detection” software, for instance, I find the claim that COVID-19 has distinctive aural contours nested within claims that gender, ability, race, ethnicity, and country of origin are also detectable in the voice. 

If these associations are not untangled, the makers and stewards of even the most benign-seeming voice analysis technologies run the risk of legitimizing a phrenology of the throat: the reproduction of scientific racism and other modes of domination through the materiality of the voice. The invocation of phrenology is not so much an accusation as it is an invitation. It is a call to foster what Simone Browne (2015, 116) calls “critical biometric consciousness”: generative but critical excavations of the histories, political economies, lives, and schemes of classification that line the bottom of algorithmic systems.

Beth Semel is a Postdoctoral Associate in Anthropology at the Massachusetts Institute of Technology, where she serves as the Associate Director of the Language and Technology Lab. Her research explores the intersection of communication sciences, computing, biomedicine, care, and control in the contemporary U.S. Her current project traces the development of machine listening technologies for psychiatric and other forensic applications. She received her PhD in History, Anthropology, Science, Technology and Society at MIT. Twitter: @bethmsemel


Adam, David. 2020. “Machines can spot mental health issues–if you hand over your personal data.” MIT Technology Review, August 13. <> Accessed August 17, 2020.

Akua Mensah, Stellah and Stephanie Lyn Kaufman-Mthimkhulu. 2020. “Abolition Must Include Psychiatry.” Disability Visibility Project, July 22. <> Accessed August 25, 2020.

Barthes, Roland. 1977. Image–Music–Text. Stephen Heath, trans. New York: Hill and Wang.

Benjamin, Ruha. 2019. Race After Technology: Abolitionist Tools for the New Jim Code. Cambridge, UK: Polity Press.

Browne, Simone. 2015. Dark Matters: On the Surveillance of Blackness. Durham: Duke University Press.

Carr, E. Summerson. 2010. Scripting Addiction: The Politics of Therapeutic Talk and American Sobriety. Princeton: Princeton University Press.

Dror, Otniel. 2001. “Counting the Affects: Discoursing in Numbers.” Social Research 68 (2):357–378.

Eubanks, Virginia. 2017. Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor. New York: Saint Martin’s Press.

Glabau, Danya. 2020. “The Dark Matter of Digital Health.” Public Books, April 14. <> Accessed June 30, 2020.

Hatch, Anthony Ryan. 2019. Silent Cells: The Secret Drugging of Captive America. Minneapolis: University of Minnesota Press.

Hicks, Mar. 2018. Programmed Inequality: How Britain Discarded Women Technologists and Lost Its Edge in Computing. Cambrige, MA: MIT Press.

Hinton, Elizabeth. 2016. From the War on Poverty to the War on Crime. Cambridge, MA: Harvard University Press. 

Inoue, Miyako. 2018. “Word-for-Word: Verbatim as Political Technologies.” Annual Review of Anthropology 47: 217-232.

Insel, Thomas. 2017. “Join the disruptors of health science.” Nature 551: 23-26.

Lawrence, Halcyon. 2018, “Siri Practices Accent Bias.” Conference Presentation, Your Computer is on Fire: Critical Perspectives on Computing and New Media, Stanford University, February 9.

Lempert, Michael. 2019. “Fine-Grained Analysis: Talk Therapy, Media, and the Miscroscopic Science of the Face-to-Face.” Isis (110) 1: 24-47.

McIlwain, Charlton. 2019. Black Software: The Internet & Racial Justice, from the AfroNet to Black Lives Matter. Oxford, UK: Oxford University Press.

Mills, Mara. 2011. “On Disability and Cybernetics: Helen Keller, Norbert Wiener, and the Hearing Glove.” differences 22 (2-3): 74-111.

Noble, Safiya Umoja. 2018. Algorithms of Oppression: How Search Engines Reinforce Racism. New York: NYU Press.

Roberts, Dorothy. 2001. Shattered Bonds: The Color of Child Welfare. New York: Civitas Books.

Semel, Beth. n.d. “Listening Like a Computer: Attentional tensions and mechanized care in psychiatric digital phenotyping.” 

Sterne, Jonathan and Tara Rogers. 2011. “The Poetics of Signal Processing.” differences 22 (2-3): 31-53.

Thakor, Mitali. 2018. “Digital Apprehensions: Policing, Child Pornography, and the Algorithmic Management of Innocence.” Catalyst 46 (1): 1-16. 

One reply on “The Body Audible: From Vocal Biomarkers to a Phrenology of the Throat”

Could the violence be in the very act of framing? Something is wrong in the manner by which treatment shapes/impacts the human who often is just wanting to realize some truths. Once, if possible the client begins to grasp the complexities and corrupted nature of the pharma feed, then will the brakes be applied to give pause in the human mind if not the big data operation?

Comments are closed.