Outreachy Internship – Submitting to the Wiki Workshop 2020!

Time flies! It has been the last week of the Outreachy internship :/

Recently I’ve learnt something about:

  • Python unittest framework to perform unit tests for small components in an application.
  • multiprocessing.Pool to achieve data parallelism.
  • multiprocessing.Manager to distribute the execution of a TensorFlow model prediction across processes.

Both the concepts of unit test and multiprocessing aren’t new to me, which was taught more or less at school. However, this time I learned how the code is written in Python and how to solve some issues that you would face in practice in a production-ready code. It’s a really good learning experience.

Besides, I feel I’m becoming more and more familiar with Git when I’ve had the PRs in the repository merged one by one over the past few weeks. 😀

One of the most cheerful things is I just submitted a paper of my internship project to the Wiki Workshop 2020

Wiki Workshop is a forum bringing together researchers exploring all aspects of Wikipedia, Wikidata, and other Wikimedia projects.

Wiki Workshop has entered its 7th this year. Every year, Wiki Workshop is held in conjunction with The Web Conference (formerly known as WWW conference), a yearly international conference on the topic of the World Wide Web, which was held in Lyon, France in 2018 and San Francisco, USA in 2019. This year the conference will be held in Taipei, Taiwan (in my country!) in April 2020.

Working with my mentors to complete a paper submission is a wonderful experience. To be honest, I’m not very good at writing, especially writing in English. Writing a blog post always takes me a lot of time, not to mention academic writing.

But this time under the guidance of my mentors, the paper had a clear structure and summary overview in each section, I just wrote down what I’ve done and what I’ve learned. Also, a key point is I have a lot of materials help me writing such as

  • meeting notes during the project sync every week,
  • comments on the GitHub where I discussed code with my mentors,
  • communication records on the group chat with my mentors.

These materials detail the difficulties I encountered in achieving the project goal, the feasible solutions and logic and at the end how I solved them. With these materials, writing seems not to be hard like before.

I’m so happy to finish a paper submission through collaborating with my mentors! Looking forward to the The Web Conference 2020 in April and meeting folks in the Wikimedia community from over the world. How exciting! ❤

Outreachy Internship – Progress and Reflection

In the previous post, I mentioned there are two goals in my internship project. One is to deliver a database of the Wikipedia sentences detected by the “Citation Needed” machine learning model. Another goal is to integrate into the web application “Citation Hunt” for browsing the Wikipedia snippets without a reliable source.

My original project timeline was:

  1. Design research and code developing for Citation Detective (Week 1-5)
  2. Testing and regular job work for Citation Detective (Week 6-8)
  3. Integration with Citation Hunt (Week 9-13)


The timeline was modified a bit after I had completed code developing for Citation Detective and got it merged during Week 6. According to the original timeline, I was going to spend two weeks doing a comprehensive test, including a test for a larger sampled article set, dealing with possible edge cases and evaluation of system efficiency.

However, my mentor suggested I start looking at ingesting some of the data into Citation Hunt and see what problems we encounter. In other words, we swap the tasks for Week 6-8 with the tasks for Week 9-13.

It turned out to be the right decision. When integrating with Citation Hunt, I had met some issues. As a consequence, I made a few significant changes in Citation Detective as follow:

  • Save unprocessed, raw Wikitext of sentences in the database
  • Run all individual sentences in a giving article

So far, I have accomplished an initial version of Citation Detective that tests on roughly 10k sampled articles in Wikipedia. Also, a PR for the integration of Citation Hunt is going to be merged.

A snippet with a highlight sentence is autodetected with Citation Detective


Working on ingesting data into Citation Hunt took longer than I expected. I have spent two weeks and finally about to get a version merged.

The reason for taking much time on this goal is related to two facts I realized afterward:

  • I should guarantee the Citation Hunt or other tools can consume the data in a friendly way.
  • It is hard to determine from Wikitext how many sentences a reference tag or citation needed tag is applied to.

In fact, I could have realized the two important facts earlier and speeded up the code development if I had discussed more in-depth with my mentors.

What I have learned is, before making efforts on a primary task, it is important to write down my thoughts and discuss with mentors to make sure we have the same picture of:

  • What is the goal for this PR
  • How will you achieve it (Pseudocode is a good way to explain your logic)
  • Why do you decide to do in this way

Sometimes we misunderstand a concept or make a wrong assumption and that would be a big problem when we start coding if we did not realize it. We may, in the end, produce hundreds of useless code lines. Only through sufficient communication and understanding, we could find our blind spots in advance.

So.. that’s all for this blog post! A progress report of what I’ve done and reflection in my first half of the internship. For the second half of the internship, I hope I can spend more time on proactively expressing my ideas and communicating with my mentors. 🙂