In the previous post, I mentioned there are two goals in my internship project. One is to deliver a database of the Wikipedia sentences detected by the “Citation Needed” machine learning model. Another goal is to integrate into the web application “Citation Hunt” for browsing the Wikipedia snippets without a reliable source.
My original project timeline was:
- Design research and code developing for Citation Detective (Week 1-5)
- Testing and regular job work for Citation Detective (Week 6-8)
- Integration with Citation Hunt (Week 9-13)
The timeline was modified a bit after I had completed code developing for Citation Detective and got it merged during Week 6. According to the original timeline, I was going to spend two weeks doing a comprehensive test, including a test for a larger sampled article set, dealing with possible edge cases and evaluation of system efficiency.
However, my mentor suggested I start looking at ingesting some of the data into Citation Hunt and see what problems we encounter. In other words, we swap the tasks for Week 6-8 with the tasks for Week 9-13.
It turned out to be the right decision. When integrating with Citation Hunt, I had met some issues. As a consequence, I made a few significant changes in Citation Detective as follow:
- Save unprocessed, raw Wikitext of sentences in the database
- Run all individual sentences in a giving article
So far, I have accomplished an initial version of Citation Detective that tests on roughly 10k sampled articles in Wikipedia. Also, a PR for the integration of Citation Hunt is going to be merged.
Working on ingesting data into Citation Hunt took longer than I expected. I have spent two weeks and finally about to get a version merged.
The reason for taking much time on this goal is related to two facts I realized afterward:
- I should guarantee the Citation Hunt or other tools can consume the data in a friendly way.
- It is hard to determine from Wikitext how many sentences a reference tag or citation needed tag is applied to.
In fact, I could have realized the two important facts earlier and speeded up the code development if I had discussed more in-depth with my mentors.
What I have learned is, before making efforts on a primary task, it is important to write down my thoughts and discuss with mentors to make sure we have the same picture of:
- What is the goal for this PR
- How will you achieve it (Pseudocode is a good way to explain your logic)
- Why do you decide to do in this way
Sometimes we misunderstand a concept or make a wrong assumption and that would be a big problem when we start coding if we did not realize it. We may, in the end, produce hundreds of useless code lines. Only through sufficient communication and understanding, we could find our blind spots in advance.
So.. that’s all for this blog post! A progress report of what I’ve done and reflection in my first half of the internship. For the second half of the internship, I hope I can spend more time on proactively expressing my ideas and communicating with my mentors. 🙂