Chinese internet

Online Encyclopedias, Distributed Oversight and Identifying Censorship

Written by Jason Q. Ng.

In 2008, Baidu’s chief scientist William Chang said, “There’s, in fact, no reason for China to use Wikipedia . . . It’s very natural for China to make its own products.” Today Hudong ( and Baidu Baike ( greatly eclipse the Chinese-language version of Wikipedia despite (or because of) the censorship known to take place on the sites. However, identifying outright instances or patterns in censorship can be difficult due to the (mostly) user-generated nature and oversight of the content. Instead, a project has begun that performs a large-scale comparison of the three services, matching thousands of Chinese-language Wikipedia articles with their in-China counterparts, in order to identify the “content gaps” in the two baike (Chinese for “encyclopedia,” which we use to refer to Hudong’s and Baidu’s online encyclopedias). Censorship—or at the very least anomalies in the generation of content—might be identified by articles that don’t exist, “protected” articles that are not editable by regular users, and by articles that are much shorter than those on Wikipedia China. The reason “might” is emphasized is due to the distributed oversight nature of these online encyclopedias, where not only governments but also companies and users get to play the role of content gatekeeper. This decentralization makes attributing who is responsible for apparent censorship more difficult, a topic which this report will explore in detail by examining how it functions in these online encyclopedias.

As William Chang of Baidu foretold, mainland Chinese netizens have gravitated toward local products such as Hudong and Baidu Baike, leaving Wikipedia China to be edited and read primarily by users in Taiwan, Hong Kong, and the rest of the Chinese diaspora. Today, in terms of raw visitors and article count, Hudong and Baidu Baike dwarf Wikipedia China, which has roughly 700,000 articles versus over 5 million in each of the two baike. Certainly, while China’s sporadic blocking of access to Wikipedia at various points over the past ten years has certainly been a factor in limiting Wikipedia China’s growth among mainland users, Baidu and Hudong’s dominance may be more credited to Baidu’s entrenched position as the dominant search engine in China (thus allowing for cross-site “partnerships” and synergies) and Hudong’s bevy of features built into its custom wiki and social networking platform.

This project began with a question: everyone “knows” Hudong and Baidu Baike, like all Chinese websites, have to restrict certain kinds of content on their websites, but is it possible to empirically prove that censorship is taking place on the sites? Furthermore, what would we consider signs of censorship on Hudong and Baidu Baike?

The most obvious would be the lack of certain articles on topics that are thought to be notable. Thus, using Wikipedia China as a control, we can propose that if an article on say 上海帮 (The Shanghai Gangexists on Wikipedia China, barring censorship, it should exist on Hudong and Baidu Baike, especially since Hudong and Baidu Baike have a much larger library of entries. However, one can’t always attribute the lack of an entry due to censorship since Wikipedia China itself isn’t a perfect control— though entries in Wikipedia China are assumed to be of interest to Hudong and Baidu Baike users, and thus should have articles in those encyclopedias, one must keep in mind that Wikipedia China does tend to have a Taiwanese bend due to its userbase. However, using missing articles as a potential indicator for possible censorship— especially if the article that is missing is a long one— is a reasonable start.

Second, comparisons could also be made between the lengths of the articles in the different encyclopedias. For instance, an article might exist on all three services, but they might be drastically shorter than their Wikipedia counterpart. For instance, the main body of the Wikipedia entry for 艾未未 (Ai Weiwei) is over 20,000 characters long (spaces removed) while the Baidu entry for him clocks in at 2,000 characters and the Hudong one at 3,500— this is despite the fact that article lengths for the Baidu articles sampled in this project are on the whole longer than Wikipedia’s. Discrepancies of the sort in the Ai Weiwei article might simply be a case of greater interest in the topic outside mainland China than within, or, again, it might be another potential indicator of censorship.

Third, some especially sensitive or controversial articles are unable to be edited except by users with much higher privileges than ordinary members. Being unable to change an article is not in and of itself a sign of censorship; for instance, Wikipedia “protects” certain articles to prevent vandalism and pointless back-and-forth “edit wars.” Baidu and Hudong no doubt have similar intentions in mind as well, but what matters here is matter of transparency— while Wikipedia publishes a list of all protected pages, as far as we can tell, no such corresponding list exists for Hudong and Baidu Baike— and authority. Was it the choice of editors and users at Hudong and Baidu to classify certain articles as locked or was the decision made higher? Was there an open discussion of such matters or was a list handed down from somewhere above? Interviewing longtime editors of Baidu Baike and Hudong might provide insight into such questions, but for now, we have some data to start with.

Finally, there are more subtle ways to disrupt access to information, many of which we will no doubt uncover as we continue to sift through the data. One that we’ve noticed is the “failure” by Hudong and Baidu to redirect certain article titles the same way that Wikipedia does. For instance, a search for 艾神, a laudatory nickname meaning “God Ai” for Ai Weiwei, properly redirects to the Wikipedia article for him. Hudong and Baidu Baike don’t perform such re-directs. Again, whether this is a conscious decision or merely an inadvertent one cannot be answered by looking at this one example. However, by looking at such cases in the aggregate one might be able to make a more legitimate claim that something might be going on.

Many new media outlets such as Sina Weibo privilege users with the ability to generate the content that goes on the website— in essence, to be not only their own programmer or broadcaster but also the producer. However, because such websites host their users’ content, they are also in charge of regulating and ensuring that such content complies with all Chinese laws— regardless of how vague such regulations might be. Thus attributing censorship that takes place on these sites can be unclear— is it the government that mandated certain topics are off-limits or is it the company that restricts the content?— an intentional feature of the decentralized system of information control that Chinese authorities have developed.

Censorship is further distributed on the baike because now not only are users their own programmer and producer, but they also serve in an oversight capacity as an editor. Unlike Wikipedia, users who aren’t registered can’t begin editing and creating articles, but for the most part, registered users can edit most general articles, and as they engage with the site longer, they achieve greater and greater levels of ability to edit and oversee the website. Thus, there could are always at least three potential reasons for why an article doesn’t exist, an article is shorter, or an article is locked on Hudong or Baidu Baike: government entities, private companies, or users themselves. Judging whether or not these factors are genuine instances of governmental censorship or due to explainable, organic reasons can be quite tricky. Because of the multiple layers of oversight, what may appear to be outright censorship may be a less malicious (though no less pernicious) case of self-censorship.

By looking not only at what content and data doesn’t exist on baike, but also at the content that does, this project will investigate what knowledge and information is fit for public display. If articles are shorter on Hudong and Baidu, what information do they carry? Does this information reveal anything about the authors’ intentions? By examining the topics and articles that are left visible in these baike and considering the motivations behind those who seek out, view, edit, and approve of these articles, this project hopes to offer a more nuanced view of the typical narratives about censorship in China. Trying to understand what sorts of expressions netizens are making via these online encyclopedias, despite whatever censorship might be taking place, is as interesting as the potential censorship itself. This project will hopefully push us to once again consider the many complexities when discussing information control in environments where oversight of content has been decentralized to companies and users— an environment which makes it increasingly harder to identify traditional instances of censorship.

Jason Q. Ng is Research Fellow at The Citizen Lab, University of Toronto. He is the author of a recent, highly recommended, book, Blocked on Weibo.

