Outreachy Internship – Wikimedia Community and Project

In this blog post, I would like to explain what I am doing for the internship in Wikimedia. If you are a free software contributor or newcomer wants to learn more about Wikimedia, or you are an Outreachy applicant thinking about applying to Wikimedia, it is the right place. πŸ™‚

If you have read my previous blog post – Design Specification for Citation Detective, I believe it is hard to have a clear picture of my project, since actually I did not explain much about terms or concepts I learned in the community.

So, let’s start from scratch.

Wikimedia Community

To put it simply, there are three kinds of people participating in Wikimedia (especially for Wikipedia) in different ways: editors, tool developers, researchers.


Most people know anyone can edit Wikipedia, but they probably don’t know Wikipedia has more than 38 million register editors! Here is an interesting article discovers why people want to become volunteer to edit articles on Wikipedia.

Tool developer

Tool developers are a group of people who voluntarily create tools to help editors and other volunteers in their work and provide value to the Wikimedia movement. Toolforge, a hosting environment for developers in Wikimedia, has 2507 hosted tools and 1889 tool maintainers for the present.

Tools such as ClueBot NG is an anti-vandal bot was created to detect and revert vandalism quickly in Wikipedia. “Bots” in the context of Wikipedia means autonomous computer programs, running to keep the encyclopedia in order. A good article to introduce Wikipedia bots can be found here. Developers also create web services like PetScan to help people easily search for an article category in Wikipedia.


Researchers in the Wikimedia have clear goals and missions. They use data to understand and empower millions of readers and contributors. Their goals include building a positive community culture, improving sources and citations, understanding reader behavior, etc.

Wikimedia Research team hosts a live showcase on the Youtube channel every month. In the showcase, they invite two speakers to share interesting research, topics are broad from “Protecting Wikipedia from Disinformation” to “Characterizing Wikipedia Reader Demographics and Interests”. That is a good way to know more about their focus and on-going projects.

About My Project

Now let’s talk about my project which I named it Citation Detective.

An image to show the workflow in my project.

There are a few important concepts in the workflow from left to right:

  • MediaWiki API is a web service allows developers to get access to wiki features. In the project I use MediaWiki API to get millions of pages and retrieve the content text of articles on Wikipedia by sending requests to the API.
  • Text processing in my project aims to prepare the data to a machine learning model for prediction. For example, identifying sentence with a citation or not, since we wouldn’t like to run the model on the whole article but unsourced statements.
  • The core of the Citation Detective is the Citation Needed model created by Wikimedia Research team. With a Recurrent Neural Network and attention mechanism, the model had learned from high quality and well-sourced feature articles on Wikipedia. The model can correctly classify sentences in need of citation with an accuracy of up to 90%. A blog post in technical detail can be found here.
  • Data dump is the last step to release the prediction output to a database with enough information such as the sentence itself in raw wikitext format, section title, revision id, so tool developers can easily search and locate the sentence in a correct edition.

How does the project fit into the community?

My project fits into the community in an interesting way like a bridge between researchers and tool developers. The end-users are neither readers nor editors, they are tool developers. I’m trying to provide a database based on a research work for tool developers to augment or create tools, bots and other systems for improving Wikipedia’s reliability.

Another goal of my project is to integrate the data dump into Citation Hunt. Citation Hunt is a tool for browsing snippets of Wikipedia articles that with a citation needed tag as shown above. The tool is available across 20 languages and used by many editors in the course of their work, and also frequently used in different contribution campaigns like 1lib1ref.

To make the data dump compatible with Citation Hunt, I need to understand the architecture and workflows in Citation Hunt, especially on processing Wikipedia’s articles and updating the database.

Collaborating with my mentors makes me really excited on my project. One of my mentors, Surlycyborg, is an experienced volunteer tool developer and the author of Citation Hunt. His experience in Toolforge gives me direction to accomplish tasks and milestones in my project. In addition, his point of view can be seen as an user’s needs since he is also a tool developer and Citation Hunt is the primary use case of my project. That helps me clarify some points I misunderstood and prevent me from implementing inconvenient features for users.

The most exciting thing is to see the surfaced “citation needed” sentences in Citation Hunt when I finished an end-to-end pipeline this week! ❀ So far it is running on my own Citation Hunt replica though for test and troubleshooting, I’m looking forward to Citation Detective supporting more and more tools in the future. πŸ’‘