Being a Research Engineer at Wikimedia, part 1

Before officially joining the Wikimedia Foundation, I worked as a research engineer contractor on the research team for about a year. In the foundation, contractors are recruited by different teams to work on specific projects. The length of the contract depends on the project you work on. In my case, I first got a five-month contract, and then another one-year contract. Where does the foundation find a suitable contractor? Open source internship programs like GSoC and Outreachy are good sources of talent. Recently, more and more former Outreachy interns have become contractors in the foundation.

In the period of one year being a research engineer, I have worked on various interesting projects that help me gain practical experience. I decided to write down things I learned from these projects on the research team. πŸ™‚

1st project – Image Recommendations

“Add an image” feature on mobile (source)

The project aims to automatically recommend images for Wikipedia articles to new editors. With the help of the recommend system, newcomers can easily edit – “add images” to Wikipedia articles. The idea is to encourage newcomers to make edits and increase their engagement on Wikipedia.

The project is a cross-team project. I remember in my first meeting with a product manager from the growth team, he enthusiastically told me what he expected and imagined for the project and what they were currently doing. The android team, another team involved, their goal is to develop a minimum viable product (MVP) for this purpose on the Wikipedia Android app so as to collect feedback from the community. The research team is responsible for developing the recommendation algorithm behind the application. The platform engineering team is responsible for producing algorithms and developing data pipelines.

My role was to work with data engineers to help productionize the algorithm and solve on-demand tasks related to the algorithm from other teams.

What I learned

After I started working on the project, I learned the data platform maintained by the data engineering team, including a data lake, ingestion system, processing pipelines, and several systems for exploring and visualizing the data. It is the behind-the-scenes platform for Wikipedia and other sister wiki projects, which is different from Toolforge that I used during my Outreachy internship.

I first had to understand the infrastructure. For this project, I need to be familiar with query and storage systems like HDFS and Hive (i.e. what data is stored there? how do research scientists get the data?), as well as processing units like Hadoop clusters, YARN and Spark (i.e. how does the job being executed at scale?). I was excited to have the opportunity to learn about the production system behind Wikipedia, after all it is one of the most visited sites in the world.

I also learned the recommendation algorithm used in the project, developed by the research scientist of the research team, including its data source and processing logic. It is not a machine learning based approach, but a simple, efficient heuristic approach, thus it is very suitable for production. To fully understand it, I had to get familiar with data processing modules like Spark SQL and PySpark. If you have worked with SQL and Python Pandas before, it is not difficult to learn how to use these Spark modules, because they have similar syntax and processing logic, only the mechanics behind them differ due to the size of the data aimed to be processed.

What I’ve done and Current status

Currently, the project is in Iteration 1 and has been deployed to 40% of mobile accounts on Arabic, Czech, and Bengali Wikipedia in the last quarter. Evaluation is ongoing. Looking back at the tasks I have done leading up to Iteration 1. Some are related to the algorithm logic, like adding logic of removing placeholders to the algorithm, evaluating the coverage of the algorithm, etc. Some are related to data analysis that I have to analyze the android proof-of-concept (POC) data and made several charts to help decision-making. I am glad that I have been part of the process, which helps me familiarize myself with the production environment and build a foundation for the following projects.

Check out the project meta page if you are interested to know more about this project! In the next post, I will share another exciting project I was working on at that time – “An end-to-end image classification pipeline” πŸš€

Leave a comment