Written by Paul Vierthaler.

The digital humanities (DH) encompass a variety of methodological innovations that have become increasingly popular among humanists. Although DH is proving quite effective as a means of studying human culture, scholars of imperial Chinese studies in the western academy are still coming to grips with what mass digitization and the prospect of quantitative analysis means for the field at large.

Historians of China were the first in the field to realize the potential of computer-aided research. They began creating large databases of historical information amenable to quantitative analysis very early. In the 1970s, Robert Hartwell was among the first to realize computers’ potential, when he first began storing prosopographical data in a database. Among literary scholars, the first digital efforts focused on digitizing texts, but the transformed documents were often not as clearly amenable to digitally aided analysis.

In recent years, historians of imperial China have readily engaged with both social network analysis and GIS (geographic information systems). The value of conceptually and geographically mapping historical information was clear at an early stage, and sophisticated software has made this analysis easy to perform. In combination with accessible geographical and social data, available through places such as the University of Michigan’s China Data Center, Harvard University’s Center for geographic analysis, or data derived from databases like the China Biographical Database Project, there has been a flowering of projects that effectively leverage these methods.

Although DH has already reached a fairly high degree of sophistication within imperial Chinese historical studies, it has yet to significantly permeate imperial Chinese literary studies. There are many reasons as to why this might be the case. There is currently not a universally adopted character encoding for Chinese. Unicode is emerging as the standard encoding, but any given document might instead be encoded in a simplified character standard such as GB18030 or GB2312, or a traditional character standard such as Big5. This makes applying software designed for western languages exceedingly difficult.

However, the largest technical difficulty relates to processing natural language documents. Many quantitative methodologies that show promise in literary studies in other linguistic contexts are hampered by the lack of an accurate parser for texts written in classical Chinese. There is a fundamental computational difficulty in automatically detecting “words”. In western languages, words are usually defined by white space or punctuation. The lack of punctuation and whitespace in imperial Chinese texts represents a significant barrier to entry. Computer scientists and linguists have already developed reasonably accurate methods to parse modern Chinese texts into words, often relying on supervised machine learning algorithms (see, for example, the Stanford Word Segmenter), but these methods do not work well for classical texts. Quite recently, scholars in China have begun to crack the secrets of parsing classical works using unsupervised machine learning algorithms.[1] In the meantime, most researchers depend on n-gram based models.

Additionally, many projects, both historical and literary, depend on access to fully digitized Chinese texts. The digitization of large historical archives and literary corpora are a critical part of the digital humanities. Such projects have been ongoing for many years, particularly in Asia. The Scripta Sinica database dates back to the 1980s. The electronic version of the Siku quanshu Complete Library in Four Branches of Literature was first published in the late 1990s. These digital archives allow scholars to use large corpora of imperial texts for very basic analysis, such as keyword searches. These projects are incredibly valuable, but their structure often hampers more creative digital projects. The texts within these databases largely exist behind paywalls, making many intensive quantitative analyses difficult to perform.[2]

Easy and free access to out-of-copyright materials is a critical, and necessary, condition before the digital humanities sees wider-spread adoption in imperial Chinese studies. Fortunately, the field is trending toward open access and digital copies of out-of-copyright works are becoming freely available. While early endeavors, such as the University of Virginia’s Chinese Text Initiative, digitized a limited selection of classical texts, many now make very large corpora easily accessible. Some of these projects are based in China (wenxian.fanren8.com), while others are based in the Western world (www.ctext.org). Still others are in development, such as the Ten Thousand Rooms project, that promise to further democratize the creation of, and access to, high-quality digital copies of Chinese literary and historical works.

Increased access to resources is a key component to the current growth of DH in imperial Chinese studies, but other factors play a similarly important role. The development of analytical tools that make research faster, more accurate, and sometimes even open up new avenues of research, is an important trend in the digital humanities in all fields, and imperial Chinese studies is no exception.  These new tools are expanding research possibilities and are offering new insights into questions that have been under active investigation for many years.

Dictionaries are the most widespread of these tools, and electronic editions have rapidly developed. For example, Pleco allows users to easily search many dictionaries, including the Hanyu da cidian, using traditional and handwritten inputs. It also enables searches using optical character recognition using the phone’s camera and includes text-to-speech capabilities.

Some of these tools perform dual functions as reference and analytical tools. The open source China Biographical Database (CBDB), which developed out of Hartwell’s earlier databases, is a prime example. Although it contains biographical information of hundreds of thousands of historical figures in China, and can be used as a simple reference work, it also enables fundamentally new modes of analysis. For example, it can export historical social networks into network analysis software, allowing scholars to study the shifts and nuances of imperial social and cultural ties at scales impossible in the past.

Other tools combine functions that appeal to both historians and literary scholars. The MARKUS project, centered at Leiden University, represents an important example of this type of work. This automatic markup tool allows users to extract names, geographical locations, and other features from imperial Chinese texts. It also links to other digital tools such as the CBDB and ctext.org. In doing so, it provides an easy-to-use platform for scholars new to the digital humanities.

While in the past, DH in imperial Chinese studies has lagged behind other fields, it is now rapidly developing. Social network analysis and GIS digital mapping technologies already occupy an established space within imperial digital humanities. More recently, text analysis of large imperial textual corpora has entered the picture. In spite of the technological difficulties, scholars are now starting to use advanced natural language processing and text mining techniques, such as topic modeling and stylometric analysis to leverage large sets of texts, many of which are available on the internet.

The digital humanities have a bright future in imperial Chinese studies. More scholars are becoming conversant in the variety of analytical possibilities these technological developments make available. We are slowly seeing more original research that applies digital analysis in dissertations and articles. Soon it will find an established place among more traditional modes of scholarly analysis.

Paul Vierthaler is currently the 2015-2016 Digital Humanities Postdoctoral Fellow and a Visiting Assistant Professor at Boston College. He specializes in Ming and Qing dynasty literature and the technical digital humanities. His current research focuses on late Ming and early Qing literary representations of recent events, late Imperial print culture and history, genre analysis, and authorship studies. Image credit: CC by yuyasekiguchi/Flickr.

Note: a previous version of this article mistakenly named Charles Hartwell as the creator of the database. It was actually Robert M. Hartwell (1932 – 1996).


[1] A paper on this new method is in revision. Ke Deng, e-mail message to author, December 14, 2015.

[2] Scripta Sincia allows free access to portions of its contents.