Erasmus Projects 2003
Background
This project is part of the Erasmus course on language processing given in the Master Knowledge and Information Management at the University of Ghent, Belgium. Responsible Professor Fernand Vandamme.
Extracting information from newspaper clippings
The objective of the project is to extract tabulated information from texts and hence enable to store it into a database compatible format.
You may use the programming language you want, typically Perl, Java, or Prolog.
You may read a project report done by students of LTH on a similar topic here: http://www.cs.lth.se/Education/Courses/EDA170/Reports2003/erik_johan.pdf
Detailed steps of the project
- Select a development corpus of 10 to 20 press clippings from Flemish newspapers describing a football match. You may also choose another language and another sport.
- Write regular expressions and/or phrase-structure rules to extract the score, the teams, and the winner.
- Try to write expressions or rules that are general enough and are not over adapted to the development corpus.
- Your program may include a small database of names but also rules to guess team names.
- You may also use inference rules to guess part of the information you want to extract.
- Develop your system, test it, and improve it with the development corpus. Once you have decided the program is finished, you should not modify it.
- Select a new corpus -- the test corpus -- of 10 to 20 new press clippings and test the performance of your system.
Report
Write a summary of one to two pages, in Flemish, in English, or in French, that describes the general structure of your system and its performances.
Project Reports
List of students that completed the project:
- Liqing Zhang [report] [program]
- Pamela Kalle [report] [program]