Page Protection Software and Dataset

biology_screenshot.png

Example of the English Wikipedia article on Biology which has been protected for long periods of time. Note the "View Source" button instead of "Edit" and the small lock signaling that the page is protected.

Page protection is a feature of MediaWiki software that allows administrators to restrict contributions to particular pages. For example, a page can be “protected” so that only administrators or logged-in editors with a history of good editing can edit, move, or create it.

Protection might involve “full protection” where a page can only be edited by administrators (i.e., “sysops”) or “semi-protection” where a page can only be edited by accounts with a history of good edits (i.e., “autoconfirmed” users).

Although largely hidden, page protection profoundly shapes activity on the site. For example, page protection is an important tool used to manage access and participation in situations where vandalism or interpersonal conflict can threaten to undermine content quality. While protection affects only a small portion of pages in English Wikipedia, many of the most highly viewed pages are protected. For example, the “Main Page” in English Wikipedia has been protected since February, 2006 and all Featured Articles are protected at the time they appear on the site’s main page. Millions of viewers may never edit Wikipedia because they never see an edit button.

Despite it's widespread and influential nature, very little quantitative research on Wikipedia has taken page protection into account systematically. This page contains software and data to help Wikipedia researchers do exactly this in their work.

Because a page's protection status can change over time, the snapshots of page protection data stored by Wikimedia and published by Wikimedia Foundation in as dumps is incomplete. As a result, taking protection into account involves looking at several different sources of data.

Much more detail can be found in our paper Page Protection: Another Missing Dimension of Wikipedia Research. If you use this software or these data, we would appreciate if you cite the paper:

Hill, Benjamin Mako & Shaw, Aaron. (2015) “Page Protection: Another Missing Dimension of Wikipedia Research.” In Proceedings of the 11th International Symposium on Open Collaboration (OpenSym 2015). ACM Press. doi: 10.1145/2788993.2789846

Page Protection Software

Building page protection data is a multi-step and labor intensive process. We have publicly released software in Python and R to do these two steps under the GNU GPL version 3. The software is designed for people already comfortable with working with MediaWiki XML dumps and the tools and software necessary to do this.

You can download the software from our git repository like:

git clone git://projects.mako.cc/protection-tools

Detailed documentation on how to use the software is in available in our README file.

Page Protection Data

protections_over_time.png

Count of pages protected from editing in English Wikipedia over time for all pages and for the article namespace only.

In our paper, we present an analysis of page protection data from English Wikipedia in the dump created in January 2015. You can download the dump files we used from the Wikimedia Foundation dataset archive and at the URLs detailed in the README. Because generating these dumps can be computationally intense, we have published the output of the software above run on the this dump.

You can download the dataset in the following formats:

More Information

For details about the dataset, why it is important, and for examples on how it can be used to come to better findings in Wikipedia research, please read the companion paper.

If you notice issues or bugs in our data or code, contact Benjamin Mako Hill or Aaron Shaw.

Patches and improvements are welcome! Details on how to produce and send a patch using git are online.


ⓒ Copyright Benjamin Mako Hill and Aaron Shaw :: Creative Commons BY-SA :: Updated: Sun Dec 11 17:04:42 PST 2016