Outreachy Internship – Wrapping Everything Up

My Outreachy internship has come to the end! I really appreciate this life-changing experience. Looking back now, when applying to Outreachy, I just want a breather space from my research thesis. I had been working on my thesis for quite a long time and got less and less motivated. I realized I should do something. Luckily, I got this Outreachy internship and it turned out to be so much more than I expected!

In the final post of the internship, I would like to share how the Outreachy internship helps me grow my skills, how my mentors help me along the way, what I achieved and what are the next steps, and what are amazing things during the internship.

Let’s get started! πŸ™‚

Table of contents

What I have learned and improved

More than Machine Learning

The goal of my project is to productionize a machine learning model, Citation Need model, to make it accessible to tool developers other than machine learning researchers and practitioners. A part related to ML includes importing a trained Tensorflow model into the system, preparing data and getting prediction results. Based on this, developing an end-to-end pipeline that the jobs include writing Python scripts to interact with Wikipedia replicas and databases, parsing Wikitext to extract the information, and writing a Python multiprocess script to improve efficiency, etc.

Learn from Citation Hunt codebase

Acquainted with the codebase of Citation Hunt gave me a lot of ideas on how to build the system from 0 to 1. My project and Citation Hunt are similar in the fact that they both parse Wikipedia articles, generate some data and store the data into databases. The difference is their outputs and core algorithms. Therefore, in the course of understanding the codebase, I gradually had a clear picture of the workflow for my project. In addition, the well-organized, concise codebase inspired me to write code with high readability and maintainability.

Learn from Toolforge documentation and community

I also learned about Toolforge, a Wikimedia hosting environment, where tool developers create and run tools. Toolforge documentation is a great resource that I learned almost everything I needed for my project, including bastion host, the grid engine, etc. At the same time, I have improved my skills in Unix / Linux command line by using Toolforge frequently.

Although the documentation is useful most of the time, sometimes I might get stuck on problems in Toolforge. Asking the community for help is a great idea. I once filed an issue on Phabricator and tagged Toolforge, the problem was resolved soon. People in the community love to help others, so don’t worry, just make sure you explain the problem and provide enough information.

The best way to learn Git β€” Contribute to FOSS

Besides Python language and Toolforge environment, I get more familiar with Git, a version control system, during the internship. I would say the best way to learn Git is by contributing to the free and open source software (FOSS). From the middle period of the internship, I started to upload my code to the Github repository and created pull requests (PRs). I have learned to observe repository status, work with branches, make changes, synchronize, etc. and finally figured out how Git works.

Learn collaboration with Git

In addition, Git is a good way to learn collaboration. One mean of collaboration between my mentor and I was through PR comments on Github. That is to say, my mentor reviews my code and leaves comments in the PR and we would keep the conversation until we both agree the changes can be merged into the master branch.

The following are PRs I made during the internship:

73 comments in one PR!

I can’t thank my mentors enough. We had video calls once a week that was really helpful to keep on track of the goals, make constant progress and remind me if I deviate slightly from our aims. Every time I ask questions in our project group chat, I get responses quickly. Regarding communication skills, I was relatively passive in the past. During the internship, I gradually have changed my mindset and become more proactive to inform my progress and my ideas, which makes the project run smoother.

Project outcomes and future works

Here is the results I completed:

  • Developed a tool and public dataset named Citation Detective that contains sentences that have been identified as needing a citation using a machine learning-based classifier;
  • Created a proof of concept for integrating Citation Detective and Citation Hunt.

In the future, we hope to expand Citation Detective

  • to more languages Wikipedia projects;
  • to a greater number of articles;
  • to have more data fields, such as particular words that the model paid attention to;
  • to be more accessible, such as through the Quarry database querying service.

On the other hand, the prototype Citation Hunt

  • requires more UX works;
  • requires a mature Citation Detective a bit more (above future works mentioned)

in order to merge to real Citation Hunt.

A new journey begins

After the game is before the game.

Outreachy not only helps me feel more confident in making open source and free software contributions but also in working with a diverse team remotely. I’m more courageous in a team to ask questions, convey ideas. I came to know more what my interest is, what my advantages are, and how to prepare my career and find opportunities.

All this would not be possible without the help of my mentors. Guilherme Gonçalves, Miriam Redi, Sam Walton, thanks for all the time you spent selflessly guiding me, leading me, encouraging me and affirming me throughout the journey. All these things make me more confident to face the challenges ahead.

Getting to know Wikimedia and being part of the community is one of the amazing things during the internship. I still remember the vision of Wikimedia γƒΌ help everyone share in the sum of all knowledge γƒΌ inspired me at the very beginning. It is meaningful and fulfilling to me to have contributed a little to the Wikimedia movement. Although the internship ends, that is a new beginning for me to contribute to FOSS.

Cheers! πŸ˜€

Outreachy Internship – Submitting to the Wiki Workshop 2020!

Time flies! It has been the last week of the Outreachy internship :/

β€”

Recently I’ve learnt something about:

  • Python unittest framework to perform unit tests for small components in an application.
  • multiprocessing.Pool to achieve data parallelism.
  • multiprocessing.Manager to distribute the execution of a TensorFlow model prediction across processes.

Both the concepts of unit test and multiprocessing aren’t new to me, which was taught more or less at school. However, this time I learned how the code is written in Python and how to solve some issues that you would face in practice in a production-ready code. It’s a really good learning experience.

Besides, I feel I’m becoming more and more familiar with Git when I’ve had the PRs in the repository merged one by one over the past few weeks. πŸ˜€

β€”

One of the most cheerful things is I just submitted a paper of my internship project to the Wiki Workshop 2020 ⭐

Wiki Workshop is a forum bringing together researchers exploring all aspects of Wikipedia, Wikidata, and other Wikimedia projects.

Wiki Workshop has entered its 7th this year. Every year, Wiki Workshop is held in conjunction with The Web Conference (formerly known as WWW conference), a yearly international conference on the topic of the World Wide Web, which was held in Lyon, France in 2018 and San Francisco, USA in 2019. This year the conference will be held in Taipei, Taiwan (in my country!) in April 2020.

Working with my mentors to complete a paper submission is a wonderful experience. To be honest, I’m not very good at writing, especially writing in English. Writing a blog post always takes me a lot of time, not to mention academic writing.

But this time under the guidance of my mentors, the paper had a clear structure and summary overview in each section, I just wrote down what I’ve done and what I’ve learned. Also, a key point is I have a lot of materials help me writing such as

  • meeting notes during the project sync every week,
  • comments on the GitHub where I discussed code with my mentors,
  • communication records on the group chat with my mentors.

These materials detail the difficulties I encountered in achieving the project goal, the feasible solutions and logic and at the end how I solved them. With these materials, writing seems not to be hard like before.

I’m so happy to finish a paper submission through collaborating with my mentors! Looking forward to the The Web Conference 2020 in April and meeting folks in the Wikimedia community from over the world. How exciting! ❀

Outreachy Internship – Progress and Reflection

In the previous post, I mentioned there are two goals in my internship project. One is to deliver a database of the Wikipedia sentences detected by the “Citation Needed” machine learning model. Another goal is to integrate into the web application “Citation Hunt” for browsing the Wikipedia snippets without a reliable source.

My original project timeline was:

  1. Design research and code developing for Citation Detective (Week 1-5)
  2. Testing and regular job work for Citation Detective (Week 6-8)
  3. Integration with Citation Hunt (Week 9-13)

Progress

The timeline was modified a bit after I had completed code developing for Citation Detective and got it merged during Week 6. According to the original timeline, I was going to spend two weeks doing a comprehensive test, including a test for a larger sampled article set, dealing with possible edge cases and evaluation of system efficiency.

However, my mentor suggested I start looking at ingesting some of the data into Citation Hunt and see what problems we encounter. In other words, we swap the tasks for Week 6-8 with the tasks for Week 9-13.

It turned out to be the right decision. When integrating with Citation Hunt, I had met some issues. As a consequence, I made a few significant changes in Citation Detective as follow:

  • Save unprocessed, raw Wikitext of sentences in the database
  • Run all individual sentences in a giving article

So far, I have accomplished an initial version of Citation Detective that tests on roughly 10k sampled articles in Wikipedia. Also, a PR for the integration of Citation Hunt is going to be merged.

A snippet with a highlight sentence is autodetected with Citation Detective

Reflection

Working on ingesting data into Citation Hunt took longer than I expected. I have spent two weeks and finally about to get a version merged.

The reason for taking much time on this goal is related to two facts I realized afterward:

  • I should guarantee the Citation Hunt or other tools can consume the data in a friendly way.
  • It is hard to determine from Wikitext how many sentences a reference tag or citation needed tag is applied to.

In fact, I could have realized the two important facts earlier and speeded up the code development if I had discussed more in-depth with my mentors.

What I have learned is, before making efforts on a primary task, it is important to write down my thoughts and discuss with mentors to make sure we have the same picture of:

  • What is the goal for this PR
  • How will you achieve it (Pseudocode is a good way to explain your logic)
  • Why do you decide to do in this way

Sometimes we misunderstand a concept or make a wrong assumption and that would be a big problem when we start coding if we did not realize it. We may, in the end, produce hundreds of useless code lines. Only through sufficient communication and understanding, we could find our blind spots in advance.

So.. that’s all for this blog post! A progress report of what I’ve done and reflection in my first half of the internship. For the second half of the internship, I hope I can spend more time on proactively expressing my ideas and communicating with my mentors. πŸ™‚

Outreachy Internship – Wikimedia Community and Project

In this blog post, I would like to explain what I am doing for the internship in Wikimedia. If you are a free software contributor or newcomer wants to learn more about Wikimedia, or you are an Outreachy applicant thinking about applying to Wikimedia, it is the right place. πŸ™‚

If you have read my previous blog post – Design Specification for Citation Detective, I believe it is hard to have a clear picture of my project, since actually I did not explain much about terms or concepts I learned in the community.

So, let’s start from scratch.

Wikimedia Community

To put it simply, there are three kinds of people participating in Wikimedia (especially for Wikipedia) in different ways: editors, tool developers, researchers.

Editor

Most people know anyone can edit Wikipedia, but they probably don’t know Wikipedia has more than 38 million register editors! Here is an interesting article discovers why people want to become volunteer to edit articles on Wikipedia.

Tool developer

Tool developers are a group of people who voluntarily create tools to help editors and other volunteers in their work and provide value to the Wikimedia movement. Toolforge, a hosting environment for developers in Wikimedia, has 2507 hosted tools and 1889 tool maintainers for the present.

Tools such as ClueBot NG is an anti-vandal bot was created to detect and revert vandalism quickly in Wikipedia. “Bots” in the context of Wikipedia means autonomous computer programs, running to keep the encyclopedia in order. A good article to introduce Wikipedia bots can be found here. Developers also create web services like PetScan to help people easily search for an article category in Wikipedia.

Researcher

Researchers in the Wikimedia have clear goals and missions. They use data to understand and empower millions of readers and contributors. Their goals include building a positive community culture, improving sources and citations, understanding reader behavior, etc.

Wikimedia Research team hosts a live showcase on the Youtube channel every month. In the showcase, they invite two speakers to share interesting research, topics are broad from “Protecting Wikipedia from Disinformation” to “Characterizing Wikipedia Reader Demographics and Interests”. That is a good way to know more about their focus and on-going projects.

About My Project

Now let’s talk about my project which I named it Citation Detective.

An image to show the workflow in my project.

There are a few important concepts in the workflow from left to right:

  • MediaWiki API is a web service allows developers to get access to wiki features. In the project I use MediaWiki API to get millions of pages and retrieve the content text of articles on Wikipedia by sending requests to the API.
  • Text processing in my project aims to prepare the data to a machine learning model for prediction. For example, identifying sentence with a citation or not, since we wouldn’t like to run the model on the whole article but unsourced statements.
  • The core of the Citation Detective is the Citation Needed model created by Wikimedia Research team. With a Recurrent Neural Network and attention mechanism, the model had learned from high quality and well-sourced feature articles on Wikipedia. The model can correctly classify sentences in need of citation with an accuracy of up to 90%. A blog post in technical detail can be found here.
  • Data dump is the last step to release the prediction output to a database with enough information such as the sentence itself in raw wikitext format, section title, revision id, so tool developers can easily search and locate the sentence in a correct edition.

How does the project fit into the community?

My project fits into the community in an interesting way like a bridge between researchers and tool developers. The end-users are neither readers nor editors, they are tool developers. I’m trying to provide a database based on a research work for tool developers to augment or create tools, bots and other systems for improving Wikipedia’s reliability.

Another goal of my project is to integrate the data dump into Citation Hunt. Citation Hunt is a tool for browsing snippets of Wikipedia articles that with a citation needed tag as shown above. The tool is available across 20 languages and used by many editors in the course of their work, and also frequently used in different contribution campaigns like 1lib1ref.

To make the data dump compatible with Citation Hunt, I need to understand the architecture and workflows in Citation Hunt, especially on processing Wikipedia’s articles and updating the database.

Collaborating with my mentors makes me really excited on my project. One of my mentors, Surlycyborg, is an experienced volunteer tool developer and the author of Citation Hunt. His experience in Toolforge gives me direction to accomplish tasks and milestones in my project. In addition, his point of view can be seen as an user’s needs since he is also a tool developer and Citation Hunt is the primary use case of my project. That helps me clarify some points I misunderstood and prevent me from implementing inconvenient features for users.

The most exciting thing is to see the surfaced “citation needed” sentences in Citation Hunt when I finished an end-to-end pipeline this week! ❀ So far it is running on my own Citation Hunt replica though for test and troubleshooting, I’m looking forward to Citation Detective supporting more and more tools in the future. πŸ’‘

Outreachy Internship – Design Specification for Citation Detective

In this design specification, We focus on a minimum viable product (v0) for Citation Detective.

Overview

We want to build a system aims to generate a data export of sentences that may need a citation on Wikipedia. The system processes articles on Wikipedia and produces dataset based on the prediction results from a machine-learning-based model that gives sentences a “citation needed” score. The Citation Need model is a research project: Identification of Unsourced Statements conducted by the Wikimedia Foundation Research team. The data export from the system contains sentences and metadata for use in developing tools to help Wikipedia editors recognize unsourced sentences and further improve the reliability of the content on Wikipedia.

Applications

The data export will support a variety of use cases: bots, MediaWiki UI, web applications and integration with existing tools. For bots, for example, a bot that adds citation needed templates to articles based on model predictions. For MediaWiki UI, a VE add-on that highlights sentences added within the editing interface to help contributors prioritize where to add citations. For web applications, a testbed application that allows anyone to paste a sentence into a web form and generates citation needed scores for that sentence. For integrating with existing tools, the most important use case is to support Citation Hunt integration. More use cases are described in API design research on Citation Need model output.

Requirements

The input of the system is sampled articles on Wikipedia and the system makes predictions on sentences with a citation needed template {{cn}} and without. For output, we export all sentences to see what the resulting data size is like.

We don’t have strict data freshness requirements: the system can make predictions based on article data a few days old.

The system is able to list sentences and their scores, and for each sentence, find out the paragraph and section heading it belongs to, what revision ID it came from. Citation Hunt can use the database replicas to turn that revision ID into a page_id and do its own processing of the article later, if needed.

Design

The inputs to our system are X % of all articles. The system consumes these inputs by querying database replicas (page table) via SQL to generate a list of page_ids and then calls MediaWiki API to get the page content for parsing.

The outputs is a database accessible to tool developers in Toolforge and they will not be on the list of public Wikimedia datasets for this version.

Output schema:

  • sentence_id: primary key
  • sentence_text: wikitext of the sentence
  • sentence_context: wikitext of the paragraph, which contains sentence_text
  • sentence_score
  • section_heading
  • revision_id

The system will be a scheduled batch job deployed in Toolforge and export database readable to any other tools.

Validation

We use internal metrics in Citation Hunt provides statistics on which snippets got fixed. 

Tracking and communication

I created a GitHub repository to put my code!

We’ll post about this on labs-l (the mailing list for Toolforge developers) and the Village Pump when we’re ready. πŸ™‚