Researchers use text analytics to uncover the lives of 19th-century women

Erin O'Hare
April 24, 2019

About 20 years ago, a certain type of book kept popping up in Alison Booth’s research.

The English professor, who studies narrative theory and the Victorian era, noticed that biographies of women were being written long before feminist critics thought they were, biographies that go into detail about women’s lives, “making the case for women’s importance, women’s variety, women’s historicity,” she says.

“For centuries, biographers have even kept records of rather ordinary women who mattered in their own community without being famous,” Booth says, “yet we often learn about history as if women rarely had a part in it.”

Calamity Jane and the Wildcats book cover
Calamity Jane and the Lady Wildcats is a biography that is part of the CBW collection and the BESS schema created by the DSI.

Booth located about 1,200 of these books—collective biographies of, say, nurses, adventurers, assassins, writers—via online catalogs and in 2004 published a book, How to Make It as a Woman: Collective Biographical History from Victoria to the Present, on her findings. She then worked with UVA’s library to create a digital finding aid for these books hiding in plain sight on many libraries’ open shelves, to shed light on women’s stories that have been historically kept in the dark.

“As I worked on it further, I wanted to go past having a two-dimensional resource for finding out about these books, to thinking about what is really in them,” says Booth, who is also co-director of the Scholars’ Lab, a digital humanities center at UVA.

Booth is working with the Data Science Institute (DSI) and Institute for Advanced Technology in the Humanities (IATH) on the Collective Biographies of Women Project, which uses data to glean new information.

DSI Master of Science in Data Science (MSDS) students Sakshi Jawarani, Murugesan Ramakrishnan, and Varshini Sriram, advised by MSDS Program Director and professor Rafael Alvarado, are working on a capstone project with Booth to build the Biographical Elements and Structure Schema (BESS). The BESS uses human editors to identify selected features, such as word choice (i.e., “crying,” or “traveling”), in each paragraph of the books in CBW’s database to figure out how to apply machine learning methods that can aid in both editing the corpus and in interpreting the results of the BESS annotations.

“Text problems are very prevalent in data science these days,” says Ramakrishnan, “and they are really interesting, really complex problems to solve. There is so much textual data in the world, and not just in the Humanities, but in areas like business, finance, and social media—from conducting sentiment analysis of Amazon reviews, to studying patterns of hate speech on Twitter.”

Booth's collection, in effect, is a labeled corpus of textual data of the kind that data scientists analyze with natural language processing and text mining techniques for a variety of projects. 

One task on which the CBW capstone project is focused is to build a classifier that takes advantage of existing CBW markup in order to predict the annotations that might be applied to other paragraphs from the large number of as yet unlabeled biographies. Another task is to use unsupervised methods to surface semantic patterns in the corpus as a whole that might not be apparent to the human reader.

florence nightingale
Florence Nightingale, founder of modern nursing, cared for soldiers during the Crimean War. Nightingale is featured in four books in the CBW collection.

For example, an editor might read three versions of a Florence Nightingale biography. In the first version, the event “crying” might occur in the fourth paragraph; in the second version, “crying” might occur in the fifth and tenth paragraphs; in the third version, Nightingale may never cry or move her audience to tears. The editor tracks this data in a stand-aside XML document (vetted later by a second editor), which allows for a comparison of what different authors choose to exclude or include in a narrative of Nightingale’s life, and to document how the persona of Nightingale alters in various versions over time.The extensive controlled vocabulary, not only for events but also for narrative technique, goes beyond word search and sentiment analysis, which might always consider “crying” to be negative, whereas it can sometimes signal empathy or joy.

For her part, Booth says, “I’m excited to see what they come up with,” and to see how this type of mid-range reading, as she calls it, can lend new insights into literary—and data—analysis.