Outreachy Internship – Design Specification for Citation Detective

In this design specification, We focus on a minimum viable product (v0) for Citation Detective.

Overview

We want to build a system aims to generate a data export of sentences that may need a citation on Wikipedia. The system processes articles on Wikipedia and produces dataset based on the prediction results from a machine-learning-based model that gives sentences a “citation needed” score. The Citation Need model is a research project: Identification of Unsourced Statements conducted by the Wikimedia Foundation Research team. The data export from the system contains sentences and metadata for use in developing tools to help Wikipedia editors recognize unsourced sentences and further improve the reliability of the content on Wikipedia.

Applications

The data export will support a variety of use cases: bots, MediaWiki UI, web applications and integration with existing tools. For bots, for example, a bot that adds citation needed templates to articles based on model predictions. For MediaWiki UI, a VE add-on that highlights sentences added within the editing interface to help contributors prioritize where to add citations. For web applications, a testbed application that allows anyone to paste a sentence into a web form and generates citation needed scores for that sentence. For integrating with existing tools, the most important use case is to support Citation Hunt integration. More use cases are described in API design research on Citation Need model output.

Requirements

The input of the system is sampled articles on Wikipedia and the system makes predictions on sentences with a citation needed template {{cn}} and without. For output, we export all sentences to see what the resulting data size is like.

We don’t have strict data freshness requirements: the system can make predictions based on article data a few days old.

The system is able to list sentences and their scores, and for each sentence, find out the paragraph and section heading it belongs to, what revision ID it came from. Citation Hunt can use the database replicas to turn that revision ID into a page_id and do its own processing of the article later, if needed.

Design

The inputs to our system are X % of all articles. The system consumes these inputs by querying database replicas (page table) via SQL to generate a list of page_ids and then calls MediaWiki API to get the page content for parsing.

The outputs is a database accessible to tool developers in Toolforge and they will not be on the list of public Wikimedia datasets for this version.

Output schema:

  • sentence_id: primary key
  • sentence_text: wikitext of the sentence
  • sentence_context: wikitext of the paragraph, which contains sentence_text
  • sentence_score
  • section_heading
  • revision_id

The system will be a scheduled batch job deployed in Toolforge and export database readable to any other tools.

Validation

We use internal metrics in Citation Hunt provides statistics on which snippets got fixed. 

Tracking and communication

I created a GitHub repository to put my code!

We’ll post about this on labs-l (the mailing list for Toolforge developers) and the Village Pump when we’re ready. 🙂

One thought on “Outreachy Internship – Design Specification for Citation Detective

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s