An internship on BelHisFirm

Blog by intern Heike Bekaert

During summer I contributed to the BelHisFirm project, testing and training a Named Entity Recognition (NER) model for recognizing organisations, people, locations and more, with a little help from the GhentCDH team. This student job was an exciting opportunity to elevate my Digital Humanities skills. In this blogpost, I will share what I worked on.

What is BelHisFirm?

The BelHisFirm project aims to build a database of Belgian companies based on information from the Moniteur Belge. This source contains enormous amounts of information, which has never been digitized. Manually entering all these records would be a slow and tedious process. That’s where Artificial Intelligence, such as Named Entity Recognition (NER) , can lend a hand.

The process of training a custom NER model

Using an existing model called FLAIR and documentation from GhentCDH-researcher Tess Dejaeghere, I went on a mission of creating my very own NER model. My main task was to create and train a model that could automatically recognize a variety of entities. With little prior experience in programming or large language models, this was a challenging but rewarding assignment.

Labelling the training data

As every student knows, quality data is key to a good final result, and the same applies to NER models. Both students and NER models need their text to be prepared. For students this would be in the form of notes, and for my model they take the shape of ‘labels’. This is the ground-truth on which my model will be trained on. The first step, therefore, was to manually label several pages from the Moniteur Belge with the appropriate tags. Although the process was sometimes tedious, creating a ground-truth dataset made me familiar with the source material and allowed me to understand potential pitfalls of the model.

At the same time, I worked through an introductory Python course. With plenty of frustration and a few thrilling eureka moments when I finally spotted that one missing bracket, I gradually became comfortable with some basic programming.

Training the model

Once the ground-truth data and coding skills were in place, I started training my NER model. I fed the training data to the system and started experimenting with different code configurations. As in any programming project, this included a fair share of errors and bug hunting. Fortunately, thanks to the Team’s guidance, I learned a lot from the process and was able to complete training my NER model.

By the end of my student job, I obtained very promising results. The model could almost perfectly extract information about people, their jobs, and their corporate titles. Extracting company data and legal events (e.g. the registration of meeting minutes) proved more difficult because the training set did not include enough examples. While there is still plenty of room for improvement, I think the results create a solid baseline for future work.

Reflection

At the end of my four-week student job, I presented my results during a final meeting with the BelHisFirm partners in Antwerp: a great way to round off an enriching month. I look back very fondly to this experience at GhentCDH. I was welcomed into a warm and supportive environment full of generous people eager to share their knowledge.

I especially want to thank Bas Vercruysse and Vincent Ducatteeuw for their guidance, patience, and trust throughout the process. It was a truly inspiring introduction to research in the Digital Humanities and made me eager to keep exploring this field!

Related projects

BelHisFirm: long-term firm-level data for the social sciences