The “fragility” of our knowledge base has resurfaced as a profound concern in China Studies.Footnote 1 Restrictions on on-the-ground research have been steadily increasing over the past years.Footnote 2 Online access has also witnessed increasing controlsFootnote 3 and a decline in the publication of, for example, court verdicts.Footnote 4 Consequently, researchers and scholars are now confronted with a full spectrum of access challenges, which affects both research conducted within China and research conducted from afar using the internet. Restrictions now extend well beyond just a few sources dealing with sensitive topics such as human rights.
This paper focuses on one specific aspect of this “fragility”: policy documents.Footnote 5 China's political system is “text-centred,”Footnote 6 and official documents are the tools that transform abstract ruling ideology into daily politics.Footnote 7 Perhaps because of the relative ease of obtaining policy documents, there is a rapidly emerging field that researches policy change by analysing textual changes in official documents. In recent issues of The China Quarterly, for instance, Abbey Heffer and Gunter Schubert analyse the introductions to policy documents taken from PKULaw, the Peking University database of laws and policies (Beida fabao 北大法宝) to illustrate the increasing use of policy experimentation in contemporary China,Footnote 8 and Yuen Yuen Ang mines central documents to research changes in policy communication.Footnote 9 Scholars also quantitatively use policy documents to examine the implementation of policy in specific domains, such as aging policiesFootnote 10 and the Belt and Road Initiative, among others.Footnote 11
Regrettably, as this paper discusses in more detail below, these studies seldom discuss whether their findings could be affected by changes in the availability of data over time, as opposed to actual changes in policy. The urgency of mitigating missingness has already been demonstrated in other contexts related to official documents from China, such as court judgments,Footnote 12 but similar insights are non-existent in the field of policy.
This paper discusses how researchers can manage variation in data availability when analysing official documents from China. To illustrate the need to reflect on missingness, it first reviews how the existing literature uses policy documents as data. It then discusses how the implementation of China's Open Government Information (OGI) framework can affect availability of data, before explaining the methods it uses to identify and analyse variation. The paper continues by offering empirical evidence of three types of variation. It concludes with methodological strategies and best practices to mitigate these. Altogether, it presents a word of caution when using these data in studies of Chinese policy and politics.
The Importance of Reflecting on Missingness in Policy Studies
Superficially, the emergence of access restrictions in China may seem not to apply to the study of policy documents. PKULaw remains available, without major restrictions, although it requires a subscription to access its full contents and has taken extensive measures to prevent the scraping of content. Government websites, like the State Council's database of central and ministerial-level policies, are mostly available without restrictions, too. This can make policy databases a highly convenient dataset for many researchers.
Perhaps because of a lack of overt restrictions, academic use of policy documents is rarely accompanied by a discussion of data limitations. Few papers (albeit with exceptions) explicitly discuss whether observations could be influenced by the varying availability of data, as opposed to actual changes in the documents. For example, while Yan Nan and colleagues highlight that “the number of aging policies in China increased rapidly since 2000” and suggest that this reflects changing pressures on the government,Footnote 13 they do not discuss the possibility that the Chinese government has increased not the number of policy documents it formulates but only those it actually releases to the public.Footnote 14 Since China's OGI framework only began to take shape in the early 2000s and was formalized nationally in 2008, this is a realistic concern: there are currently nearly 15,000 official central-level documents available on PKULaw that were originally published in 2008, versus only 2,610 for 1990.
Although some researchers attempt to mitigate for variation in data availability by using normalized data in studying changes over time,Footnote 15 normalization as a sole mitigating approach requires that variation is randomly distributed and that there is no variation in transparency for the specific categories they measure. Unfortunately, these papers do not explicitly articulate these limitations nor do they discuss the motivations for their mitigating strategies.
This implicit assumption is a risky one, as observable patterns would dramatically change if authorities, from one day to the next, decided to improve or restrict publication of documents in these categories. While established statistical strategies exist for mitigating randomly distributed variation, this is not the case for non-random variation.Footnote 16 Yet, in the context of archival censorship, Glenn Tiffert finds that omitted articles were not distributed randomly.Footnote 17 In the study of court judgments, scholars have highlighted that missingness affects particular categories more than others.Footnote 18 Although no studies to date have reflected on this in the field of policy, authorities do formulate annual guidelines on what information should be prioritized for (non-)disclosure. In the case of policy experimentation, for instance, some State Council documents call for an increase in publicity of certain government pilots.Footnote 19 This indicates that transparency may fluctuate for this category, which can risk conflating such variation with actual policy change.
This point is not to imply that any current findings are invalid. Heffer and Schubert, for instance, triangulate their findings with qualitative case studies and interviews with officials. Ang notes explicitly that her work should be seen as a “pilot” for what might be possible with these data.Footnote 20 The point is that it is crucial to discuss missingness and develop best practices to mitigate it.
Understanding Variation through the Lens of Open Government Information
To understand how variation might occur, it is crucial to consider the context in which government documents are disseminated – the OGI framework. While myriad studies have focused on the OGI as an object of study, few have examined it in light of the opportunities for analysis granted by information published under the OGI. This section argues researchers need to carefully consider two factors. First, the framework of the OGI ensures that the availability of data has never been consistent. Second, in recent years, the central government has increasingly raised security concerns related to the OGI, which escalates the urgency of research into data missingness.
Formalized nationally in 2008, authorities principally regard the OGI as a means to an end.Footnote 21 This reflects how law in China remains narrowly purpose oriented.Footnote 22 Such aims include resolving principal-agent dilemmas in policy implementation,Footnote 23 fighting corruptionFootnote 24 and informing citizens and businesses about the regulations they need to comply with. As a result, while an increasing number of regulations have institutionalized disclosure of information deemed essential to the greater public, which in many regards falls in line with international practices, there are a great number of broad exemptions from disclosure and little judicial elaboration on their meaning.Footnote 25
The implication is that authorities may interpret vague guidelines or legal norms in accordance with their (shifting) priorities or institutional constraints. In the Chinese-language literature, some scholars have highlighted “diametrically opposed” administrative practices that are rooted in different applications of exemptions related to “work secrets” and “internal affairs.”Footnote 26 Others refer to the general inability and unwillingness of some government agencies to implement the OGIFootnote 27 and the persistence of large gaps between various departments.Footnote 28 In the context of this study, the result is a variation in data availability, which must be mitigated for.
Since the 2020s, the central authorities have increasingly expressed heightened security concerns, which may further compound the variation in data availability. Although concerns with the political risks related to the OGI are not new, two notable changes in policy discourse since the 2020s hint at changing priorities. First, recent documents are downplaying commitments to transparency. In 2023, the State Council amended its Work Regulations and removed two tifa 提法Footnote 29 that had featured in almost all high-level documents relating to the OGI since 2014: to make transparency a fundamental principle of government work and to make disclosure the norm.Footnote 30 Moreover, the regulations now emphasized the goals of and considerations for disclosure (“to disclose according to law”) over disclosure as a principle for its own sake. This changes the nature of the effort. While “transparency” implies a higher principle, the new language emphasizes the instrumentalist nature of the OGI.
Second, documents are expressing heightened security concerns over disclosure. For instance, the emphasis of the 2022 version of the annual “Work priorities for open government affairs” was on improving the OGI confidentiality review system, strictly conducting confidentiality reviews, preventing leaks not just of state secrets but also of “sensitive information” and preventing risks caused by data aggregation.Footnote 31 This was the first time a State Council-level document has mentioned these types of risk. Furthermore, it demanded that authorities “comprehensively consider the purpose, effect, and subsequent impact of disclosure.”Footnote 32 Finally, the document encouraged the development of “scientific and rational” ways to determine the scope of publication, clarifying that authorities should consider disclosing some information to selected stakeholders only.
It is impossible to say, at this time, how these changes in policy discourse will be interpreted and implemented by state agencies. However, when seen in the broader context of information sources disappearing, it appears highly unlikely that they will be completely ignored. In fact, concrete indicators of change can already be seen. Most strikingly, the State Council did not promulgate or publish the 2023 version of its “Work priorities for open government affairs.” This is the first time since 2012 that has not done so. These annual publications are important calls to action and failure to publish them marks a significant departure in transparency practices. More indicators of this change are discussed below in the results section.
Methods
The remainder of this paper empirically identifies and discusses variation in policy transparency, drawing from earlier scholarly precedents in missingness analysis.Footnote 33 Between 2021 and 2023, custom-made web scrapers retrieved around 310,000 policy and policy-adjacent documents from over 80 official websites of national and provincial Party and state organs.Footnote 34 This section provides a brief overview of the methods used; a more detailed discussion can be found in the Appendix.Footnote 35
Variation across time
To analyse variation across time, this paper analyses the serial numbers (fawen zihao 发文字号) of Chinese policy documents.Footnote 36 For instance, we may have access to documents numbered 1–5 and 7–10, but number 6 could be unavailable to the public. I first applied the “German tank problem” to estimate the actual total number of documents (i.e. documents after the last known number), before mapping these numbers to find patterns in transparency over time.Footnote 37
Nationally, this analysis covers the four principal types of state documents: the guofa 国发 and guobanfa 国办发, which represent the high-authority documents issued by the State Council and its General Office, and the guohan 国函 and guobanhan 国办函, which generally include organizational documents. Provincial documents mirror this structure, although not all consistently provide the document numbers of their policies. Hence, I only include provinces with representative data in this analysis.
Variation across policy types and content
To analyse variation across policy types and policy content, this paper analyses policy referrals. Policy documents in China regularly refer to other policies, either to signal alignment with higher-level directives or to indicate future policy releases, even if the higher-level directive has never been made public. Custom scripts parsed these titles from the dataset and cross-referenced them with all published official documents. After tokenizing the titles, the “Fightin’ words” algorithm identified discriminating words for public and non-public documents.Footnote 38 Afterwards, I manually selected the most context-relevant terms for analysis.Footnote 39
Although it is impossible to identify the actual date of publication for those documents of which we do not have a full text, I used the date a policy was first mentioned elsewhere as a proxy timestamp. I applied a dictionary method, whereby each “topic” is defined by a series of keywords, to map patterns over time for different topics.
Variation owing to (dis)appearing documents
To assess the severity of disappearing or deleted documents, another custom script randomly sampled 50 links for each of the source websites (over 4,000 links in total) and verified whether the full content was still available. The sample consists of documents from 2021 exclusively, as this was when scraping started and when data should not be affected by deletion. For each unavailable document, I used the Wayback Machine to determine whether the link was unavailable because of website updates or because the document was individually deleted from the website.
Alongside making documents disappear, authorities can also make documents appear by retroactively publishing documents. To assess this, I automatically calculated the number of days between the issuance of a document, which is when it is formalized but not necessarily released to the public, and its publication. As not all government agencies consistently display issuing dates vis-à-vis publishing dates, I again relied on a subsection of agencies with relatively complete data.
Results
This section discusses the results of the analysis for each type of variation in turn.
Policy transparency is in decline at the top, yet effects are not uniform
While transparency increased significantly in the early-to-mid 2010s, there have been significant steps backwards in more recent years. Figure 1 displays the transparency rates of State Council documents from 2008 to 2022. It shows that disclosure levels of the top-level guofa and guobanfa documents follow an inverted parabolic shape: increasing from 2008 to the mid-2010s but decreasing thereafter. In 2018, disclosure of guofa documents peaked at 88 per cent and then declined after, to 54.5 per cent in 2022. For guobanfa documents, disclosure decreased from 88 per cent in 2020 to 75 per cent in 2022. The disclosure of lower-level documents is consistently inconsistent (discussed below).
Figure 2 repeats this analysis for provincial documents and shows that availability there varies greatly, too. Some provinces consistently report high policy transparency rates; the rates for other provinces only started to climb in more recent years. Similarly, while some provinces issue some han 函 documents to the public, others do not. Furthermore, figures for some of the provinces assessed here show significant decreases in more recent years. In 2022, transparency figures for top-level documents from Henan (yuzheng 豫政), Shanghai (huzhengfa 沪政发), Hubei (ezhengfa 鄂政发) and Guangdong (yuefu 粤府) all dropped to their lowest levels in eight or more years. Nevertheless, these decreases remain minor in comparison with the increase in transparency since 2008.
The key determinant of transparency is a policy's relationship to citizens’ daily lives
Another reason patterns are far from uniform is because of the variation between policy fields and types. Table 1 shows that regulations, plans and guiding documents are associated with disclosure. Meanwhile, internal policy processes such as reports, requests for approvals and evaluations are associated with non-disclosure. This aligns with relevant provisions that require the proactive disclosure of documents that are immediately relevant to citizens’ daily lives but which also contain exceptions for internal processes.
This pattern continues, as shown in Table 2, which shows that topics closely related to people's daily lives are typically more transparent than those related to internal processes, security, the Party and strategy. Moreover, “science and technology,” a topic closely related to ongoing US–China tensions, is also associated with non-disclosure. This is not an artefact in keyword selection: similar terms that are also associated with non-disclosure include “science” (kexue 科学: -2.3), “information technology” (xinxi jishu 信息技术: -2.1) and the Ministry of Science and Technology itself (kexue jishu bu 科学技术部: -2.3).
The most high-profile example is the 14th Five-Year Plan on Science, Technology and Innovation, which has not been released to the public. However, local policy documents since 2021 confirm its existence.Footnote 40 The ongoing technological competition between the US and China is a key driver behind keeping this document out of the public domain. The plan covers many technologies that are subject to geopolitical competition. Furthermore, earlier strategy documents in this field, such as the Made in China 2025 plan, triggered concerns in advanced economies about China's technical capabilities.Footnote 41
In addition to this static picture, Figure 3 presents the transparency levels of different topics over time and demonstrates that the static patterns also hold over time. Topics closely related to people's lives (for example, the environment, education, socioeconomic policy) have witnessed increasing policy transparency rates; however, topics further removed from daily life (for instance, science and technology, state-owned resources, cadre management, international affairs) have decreased since 2014.Footnote 42
The (dis)appearance of policy documents creates variation in document availability
Figure 4 shows the availability of links to policy documents two years after the date they were originally retrieved. Only 80.2 per cent of links were still available two years later; a further 10 per cent were unavailable owing to issues uploading the websites.Footnote 43 The remainder of the links were inaccessible as the documents had actually disappeared.
For 7.7 per cent of all links, the websites had undergone updates to their infrastructure, causing the links to break. Technically, these documents could have been migrated elsewhere, but this is not always the case. One example is a website update by the Ministry of Housing and Urban-Rural Development between autumn 2021 and early 2022. Prior to the update, the website had hosted an extensive archive of local policy documents; however, the website managers did not transfer this archive to the new environment. This led to the disappearance of many documents, such as some of the initial local plans for the social credit system.Footnote 44
Authorities may also delete information on the grounds that it is outdated and no longer relevant to ongoing policy. Policy documents referring to the OGI repeatedly emphasize “cleaning up” outdated and expired content.Footnote 45 While PKULaw has a segment that hosts “cancelled” documents, it is unclear how comprehensive that segment is. Furthermore, it is certain that not all government websites do the same. Where they do not, such documents then disappear from the government websites altogether. Finally, government initiatives can become controversial after publication, which can lead to authorities cancelling the initiative and then attempting to erase all trace of it. For instance, following an online backlash, local authorities scrambled to take down documents that authorized the blacklisting of Chinese citizens who failed to get Covid-tested during the pandemic.Footnote 46 Both types of disappearance relate to information that is consciously deleted from a website and make up about 1.9 per cent of the total links tested here.
The focus on disappearing documents, however, invites a discussion of appearing documents, i.e. those that are released to the public a long time after the policy has been issued internally or come into effect. Figure 5 below displays the mean number of days from issuance of a document to its publication for seven government websites. In some cases, such as for the State Council and the Sichuan government's website, there used to be average delays of one to two years between issuance and publication. However, this delay has since become more standardized, at around 10 to 20 days. This indicates that while retroactive publication used to be a major source of variation in the earlier years of the OGI initiative, it is unlikely to significantly distort findings in more recent years that are based on very large datasets.Footnote 47 Nevertheless, retroactive publication following a long delay still occurs on a small scale. Henan's provincial government, for instance, delayed publication of four documents, which were originally issued in 2017, by two to three years. Hence, it remains important to reflect on the missingness problem in this domain, too, especially for studies using smaller subsets of data.
Mitigating Variation
As this paper demonstrates, there is significant variation in policy transparency and document availability over time. Transparency originally improved between 2008 and the mid-2010s. Today, however, transparency is in decline in several fields, especially in fields where there are related geopolitical tensions. There is also significant variation among types of documents, with top-level policies seeing significantly higher disclosure rates than lower-level documents. Variation among topics appears primarily in the extent to which a topic is related to national security or citizens’ daily lives. Finally, disappearance of documents is a real challenge for research. Thus, studies working with policy data must be open about how they mitigate missingness.
This paper's findings offer several guidelines. The low transparency rates for 2008 across the board indicate that any pre-2008 data are spotty at best; this was also the first year that the OGI regulations were implemented nationwide. Hence, there are fundamental questions surrounding the validity of (quantitative) causal inferences based on policy texts that go further back than this date. For some localities, data are only somewhat representative starting from the mid to late 2010s.
While normalization is a key approach to mitigate missingness across time, variation is not randomly distributed: transparency has increased for certain topics yet decreased for others. Therefore, normalization alone is typically insufficient. Dealing with non-random variation can include controlling for policy type. By selecting only policy types for which disclosure is more standardized (for example, opinions, regulations and plans, instead of notices or reports), there are better chances that findings are not affected by external variation. Similarly, researchers might use this paper's findings on topical or local variation when selecting appropriate case studies. These strategies align with practices used for the study of court judgments, where scholars have recommended avoiding case types that suffer the greatest missingness and where officials have the greatest incentives for selective disclosure.Footnote 48 Another best practice is to combine quantitative inquiry with qualitative research.Footnote 49 Finally, the scholarly community needs to ensure sources remain available despite deletion or access challenges – for instance, by archiving sources through online tools or even creating entirely new archives that are hosted outside China.Footnote 50
This paper's findings have broader implications. First, missingness can be indicative of internal government logics.Footnote 51 Thus, the findings in this paper double as a window into the internal government logics pertaining to policy transparency. More research can be done to add more depth to these findings and further leverage missingness in this and other fields. Second, missingness is not just an issue for policy documents; it affects virtually every study that relies on information sources curated by Chinese authorities. Many of the approaches developed here can also act as a basis for best practices in other fields.
This paper invites broader reflection on the fragility of our knowledge base and the use of convenient datasets in China studies. Policy documents are not propaganda, yet the fact that all these data are available to “us” also suggests that their availability serves a political purpose. The developments highlighted throughout this paper suggests that this curation of information sources is only likely to intensify. Understanding the context in which these sources are produced and what can – and, more importantly, cannot – be learned from them is crucial. While this paper focuses on variation and missingness, it is important to triangulate findings from policy documents (the paper reality) with actual lived experiences. More critical reflection on this is needed.
Acknowledgements
The author is grateful to Rogier Creemers and Florian Schneider for their feedback on earlier versions of this manuscript, as well as to the members of the “Puzzle Group” for their comments. This research is funded by the Dutch Research Council (NWO) under grant No. 406.22.CTW.013.
Competing interests
None.
Appendix: Data and Methods
Data Sources
Data was scraped from over 80 different portals, using the “Open Government Affairs” (zhengwu gongkai 政务公开) subsections of official websites. These include:
It should be noted that the dataset used for this paper is different from PKULaw, which is used in many of the studies cited throughout this paper. Unfortunately, PKULaw has restrictions on automatic retrieval of policies and, in addition, requires a licence to view the full content. This means large-scale analyses of its content are extremely difficult and it was not possible to replicate this paper's analysis to PKULaw. A brief review of data availability between the two data sources suggests that the differences are minor, i.e. in the 1–5% range. For instance:
Some government websites have also started to implement restrictions on automated retrieval. For instance, the Ministry of Foreign Affairs Spokesperson database only allows retrieval of the last 1,000 results for any query.Footnote 52 This could create additional variation between data sources.
Distinctive Word Analysis
To conduct the distinctive word analysis, this paper relied on the Jieba software to automatically tokenize and segment words. Subsequently, it used Jieba to restrict results to only two-or-more character nouns, verbs and adjectives to conduct the analysis. As noted, it finally manually categorized and selected the keywords for display in the distinctive word analysis. This is for three reasons. First, not all keywords are informative. For instance, the keywords most strongly associated with public disclosure are “soliciting opinions” (zhengqiu yijian 征求意见). This is not particularly informative because this practice is, by its very nature, public. Other keywords that are not as informative include terms like “work” (gongzuo 工作), “issuance” (yinfa 引发), “to perfect” (wanshan 完善), etc. Second, some keywords are highly distinctive potentially because of data limitations. One highly distinctive word is “National Tax Administration” (guojia shuiwu zongju 国家税务总局), but this is most likely because this agency is not included in the web scraper. Third, keywords may be related to different functions. For instance, the keyword “opinion” (yijian 意见) specifically refers to the rubric of a document, not to policymaking on opinions.
In selecting keywords for the tables, I followed three guidelines:
1. The keyword must inform the reader about a clear topic or category that is associated with (non)transparency.
2. It must be verifiable that the keyword selected is not an artefact caused by limited data or by the word segmentation tools used.
3. There must be other, similar, keywords that show similar distinctiveness scores.
The full list of keywords and their distinctiveness scores are available in the GitHub repository of this project and can be independently verified.Footnote 53
Dictionary Method
The dictionary method measures topics by the presence of keywords. For this paper, automated scripts coded each document according to whether its title contained one or more keywords related to a topic. In this way, a document could be coded with multiple topics. This is a logical approach, given that many documents are lengthy and can discuss many different topics within their contents. The table below provides examples of the keywords used to code each document. The full code and keyword lists can be found in the GitHub repository for the project.
Vincent BRUSSEE is a PhD candidate at Leiden University. He specializes in the application of natural language processing for contemporary Chinese policy analysis and is the author of Social Credit: The Warring States of China's Emerging Data Empire (2023, Palgrave Macmillan). Previously, he was an analyst at the Mercator Institute for China Studies (MERICS) in Berlin.