If you’re not already familiar, Rwanda is a country in east-central Africa that is roughly the size of Maryland. Rwanda has one large urban area known as Kigali City, but otherwise it is composed of small villages that are scattered around the country, many of which are surrounded by difficult terrain. I knew none of this a couple months ago. For the last two months I have been working with an NGO by the name of Bridges to Prosperity(BTP) to build some additional infrastructure and bring safer transport into the country of Rwanda.
The representative from BTP gave us two datasets and followed with a seemingly straightforward request: Match the village ids in the Rwandan government’s dataset to their dataset, which had recently been created by their field researchers in Rwanda. Easy, right? Nope.
It would be great if every dataset looked like a Kaggle dataset, unfortunately with real-life datasets this is rarely the case. The BTP researchers collected this data in a largely undeveloped country, likely entering unfamiliar names into some kind of tablet or laptop. Add this to the fact that the BTP workers were recording bridge sites, not village sites, so a good portion of the entries included locally named areas that were not recognized in the government dataset. Village names were also sometimes entered in parentheses in the cell column, often with something like “Cell 1(Between village 1 and village 2)”. Point is, there was plenty of misspelling and no sense of consistent formatting, a living hell for any data scientist. I became very familiar with writing lengthy conditional functions to handle the various edge cases, and luckily found the FuzzyWuzzy library which allowed me to utilize Levenshtein Distance calculations to identify the majority of misspelled words.
Once the cleaning was done, we started matching the village ids to the BTP dataset. This proved equally difficult as there were multiple village entries that had the same name but were actually located in different districts. The same problem applied to cells, the Rwandan equivalent of a county. This was the first of many issues we ran into while matching this data. I have attached a figure below to outline the general process that we took, but within that process there were several exceptions that required further investigation into other features within the dataset. I’d also like to emphasize that if we were not able to choose an ID without a high degree of confidence, we chose to leave that ID empty as to avoid falsely reporting IDs. In the end we were able to get 1306 Village IDs out of the 1424 unique values, over 90%.
Once this part of the project was completed we moved the data into a few FastAPI endpoints so that the web developers could display it on the website. They did a great job intuitively displaying our data for the BTP team. You can view the DS GitHub repository here. We believe this work will forward the mission that Bridges to Prosperity has undertaken, and hope for many more footbridges to come!