Written by Hilde De Weerdt.

The datafication of everything we do while we are online, carry our phones, fill out forms, make payments, or simply pass by traffic or security cameras is reshaping how governments and businesses make decisions and how all aspects of our lives including health care, education, sports, and housing are organized. These changes did not come about as a result of digitization or the mere conversion of analogue information into binary code. They are now becoming visible and debatable as the outcome of new uses of, often individually generated and personal, data gathered by different organizations and of new ways of combining and analyzing such data.

To take a relatively innocuous example, by correlating different datasets on search terms and the spread of the flu, and by establishing what search terms are used frequently in times and places where the flu spread in recent years, data scientists have been able to predict when and where the flu is spreading and ensure faster intervention. The more chilling effects of big data on privacy, liberty, and equality are equally well documented.

The impact of big data and the transformative effect of digital media on contemporary Chinese politics and society have received increasing attention in modern China Studies as shown in Danie Stockmann’s recent survey of the field. With the digital turn in China Studies has also come a problematization of how to conduct research with digital methods, addressing such issues as access, the ethics of data use and publication, assessing the scope and nature of datasets, and collaboration in data analysis.

For the study of pre-twentieth-century Chinese humanities, the issues are to a large extent different. Access to the extensive and long written and material record has been facilitated by large-scale digitization projects in the 1990s and 2000s. These were undertaken by government institutions and companies in China, Hong Kong, and Taiwan as well as libraries, museums, archives, religious organizations, and some unique individuals in East Asia and throughout the world. Access is by no means complete, universal, or affordable, but, overall, tens of thousands of texts can be read or searched in full-text format, relatively free of ethical concerns.

Moreover, the philological bend of Chinese humanities research has also led to the development of platforms in which new modes of analysis coming out of the digital revolution are adapted to the conditions and interests of students and researchers working on Chinese history, literature, art, medicine, philosophy, or religions. I propose below that such interlinked infrastructures can enhance the strengths of humanities scholarship and also provide an avenue for experimenting with new kinds of questions at smaller and larger scales.

The Limits of Digitization

In the 1990s, the first large textual databases became available. The twenty-four dynastic histories and the electronic edition of the Siku quanshu were first installed on a small number of individual desktop computers which could only be consulted in the reading rooms of select university libraries. Even though the number of such textual databases has grown significantly since then, their user interfaces and functionality have remained almost unchanged. Digital Heritage Publishing and Beijing Erudition Digital Research Center, the two leading commercial publishers of databases of pre-twentieth-century Chinese materials, provide rather basic search functionality, outdated results management, limited and poor reference tools, and unacceptable limitations on the export of search results and the texts included. (The lack of academic standards in documentation concerning bibliographic information, upgrades, or advanced searching techniques is also staggering.)

At a time when researchers should be able to collect and work with materials across vast and expanding textual archives, database design is still steering them to limit their search by genre, author, title, or, in the case of local gazetteers, by place. Publishers have made no effort to provide data discovery, visualization or text analysis tools that would convert unmanageable numbers of hits into structured, faceted results. The packaging of text collections in separate databases has furthermore led to the proliferation of databases and suggests that the vendors of these products are still operating in the age of CD-ROMs [Figure 1]. Libraries are now paying substantial subscription fees for our equivalent of big data, but have accepted severe restrictions to their use.

HDW1

[Figure 1] The proliferation of databases. 北京爱如生数字化技术研究中心, Beijing Erudition Digital Research Center, 2001 –

Modeling Research Practices

Until recently then, the digitization rush has led to a research ecology in which databases are neatly put on shelves, side by side, without integration or linkage of their contents. Some researchers, meanwhile, have developed datasets, tools, and platforms, and linked them in ways that both serve to allow intelligent access across large and diverse textual repertoires and aim to remain close to the reading processes and research flows of students and researchers. Let me use MARKUS, the platform I co-designed with Brent Ho and with the support of European Research Council and Digging Into Data grants, as an example.

MARKUS is a reading and text analysis platform with a wide range of functionality including: automated tagging and identification of personal and place names, official titles, and time references in classical Chinese; manual and batch tagging of user-supplied keyword lists in all languages and creation of custom tags; generation of keywords based on text analysis (keyword clipper); flexible filtering of tagged content; linking to a range of online reference tools including geographical and biographical databases and language and domain-specific dictionaries for online reading; online note-taking; export to wide range of formats including html and TEI to ensure interoperability; automatic export of tagged content and linked data from China Biographical Database to visualization platforms for exploration and analysis of tagged content in the associated VISUS visualization interface (maps, network graphs, tables, timelines, pie charts, tagclouds); linking to textual databases such as Donald Sturgeon’s Chinese Text Project for easy import of broad range of texts; and machine learning to improve accuracy and recall for large corpora.

MARKUS grew out of a methodology devised to examine communication networks in imperial Chinese history as recorded in notebooks and correspondence. In order to examine the social makeup, geographic scope, and intellectual orientation of such networks, I first manually tagged all informants, linking their appearance in the text to the China Biographical Database edited by Peter Bol, Michael Fuller, et al. On the basis of the database I had thus created in my sources alongside those obtained in biographical and geographical databases, I argued that twelfth- and thirteenth-century authors claimed cultural and political status in information networks that connected hundreds of contemporaries spread across the Southern Song territories and that increasingly gave voice to lower-level elites. The articulation of such networks occurred at a time during which a structural transformation took place in the dissemination of information relating to the state.

We took the further step to generalize this methodology and simplify the laborious steps of extracting, merging, and visualizing data in separate packages into a linked platform in which a large part of the annotation and visualization can be undertaken automatically. MARKUS is designed and continues to be developed to model existing research flows, allowing for flexible switching between markup, reading, exploration, analysis, and annotation.  As shown in Digital Perspectives on Imperial Chinese Political History this works for humanities research at various scales. Recent work by Margaret Wan, Michael Stanley-Baker, Chu Ming-kin, Xiong Hueilan and Hsu Yahuei, who have used MARKUS in research on Chinese fiction, medical history, private correspondence, urban architecture, and art catalogues has further shown how the included digital methods can be fruitful across disciplines.

HDW2

[Figure 2] Automated and manual tagging in MARKUS.

HDW3

[Figure 3] Map and timeline view of tagged content (direct link from MARKUS and VISUS to Palladio)

In Conclusion: Towards a Linked Research Infrastructure for East Asian Studies

The large Chinese text databases created in the 1990s and 2000s and the prosopographical and geographic databases that originated during the heydays of quantitative social history have so far mostly led parallel lives, and have by and large remained isolated from innovations in digital scholarship. This has prevented scholars in all disciplines to take advantage of the big data of the past. MARKUS is but one step in the integration of different kinds of databases for scholarly use, but it suggests that, through collaboration across disciplinary and professional boundaries, researchers can participate in the design of digital tools that best suit their interests and also gain access to methods that are otherwise limited to small numbers of experts only.

Despite the attempt to customize tools as much as possible to researcher feedback, challenges surely remain. These include the dearth of humanities-specific visualization tools and mistaken expectations of researchers about the functionality and capability of digital tools. Working digitally also requires an adjustment in scholarly habits, a tolerance for experimentation and failure, for instance, and the acceptance of a certain measure of inaccuracy and messiness when working at elevated scales. The benefits to the philological scholarship and hermeneutic traditions that are at the core of the humanities since early modern times are by now well-known; the blogposts gathered here further testify to the innovative research they are making possible predominantly by examining groups of people, sets of texts, or style figures across large corpora of texts.

HDW4

[Figure 4] Digital methods such as tagging allow us to examine political parties or other collectives across digital text corpora. This network graph shows Yuanyou party members as they were associated in texts authored by their contemporaries. They are clustered according to the density of ties within subgroups. The graph is based on co-occurrence data obtained about 309 persons in the work of 2,231 authors in 56,969 documents. Only those pairs who co-occur at least 9 times are shown.

Hilde De Weerdt is Professor of Chinese History at the Leiden Institute for Area Studies. She is the author of an intellectual history of the civil service examinations, titled, “Competition over Content: Negotiating Standards for the Civil Service Examinations in Imperial China (1127-1276)” (Harvard University Asia Center, 2007). Her research focuses on the question of how social networks shaped Chinese politics. Her interests in intellectual and political history, information technologies, social networks, and digital research methods have also led to her involvement in several comparative and digital humanities projects including “Communication and Empire: Chinese Empires in Comparative Perspective” (funded by the European Research Council, 2012-17) and “DID-ACTE: Digging into Data: Automating Chinese Text Extraction” (funded by the UK Arts and Humanities Research Council, the Joint Information Systems Committee, and the National Endowment for the Humanities, 2014-2016). She is the co-editor of “Knowledge and Text Production in an Age of Print–China, Tenth-Fourteenth Centuries” (Brill, 2011). Her most recent book, “Information, Territory, and Networks: The Crisis and Maintenance of Empire in Song China” (Harvard University Asia Center, 2015), takes a fresh look at the question of how the ideal of the unified territorial state took hold in Chinese society. Image credit: Hilde De Weerdt.