lunduniversity.lu.se

Computer Science

Faculty of Engineering, LTH

Denna sida på svenska This page in English

Erasmus Projects 2003

Projects Erasmus 2003

Background

This project is part of the Erasmus course on language processing given in the Master Knowledge and Information Management at the University of Ghent, Belgium. Responsible Professor Fernand Vandamme.

Extracting information from newspaper clippings

The objective of the project is to extract tabulated information from texts and hence enable to store it into a database compatible format.

You may use the programming language you want, typically Perl, Java, or Prolog.

You may read a project report done by students of LTH on a similar topic here: http://www.cs.lth.se/Education/Courses/EDA170/Reports2003/erik_johan.pdf

Detailed steps of the project

  • Select a development corpus of 10 to 20 press clippings from Flemish newspapers describing a football match. You may also choose another language and another sport.
  • Write regular expressions and/or phrase-structure rules to extract the score, the teams, and the winner.
    • Try to write expressions or rules that are general enough and are not over adapted to the development corpus.
    • Your program may include a small database of names but also rules to guess team names.
    • You may also use inference rules to guess part of the information you want to extract.
  • Develop your system, test it, and improve it with the development corpus. Once you have decided the program is finished, you should not modify it.
  • Select a new corpus -- the test corpus -- of 10 to 20 new press clippings and test the performance of your system.

Report

Write a summary of one to two pages, in Flemish, in English, or in French, that describes the general structure of your system and its performances.

Project Reports

List of students that completed the project:

Page Manager: