Computer Science

Faculty of Engineering, LTH

Denna sida på svenska This page in English



This page is provisional. It will be updated in due time.

I will live stream the lectures. The Zoom link is:
Enter the password: 75012

I opened a chat room so that students can discuss topics regarding the course and the labs. The link is:
Enter the password: 75019
It is only open to Lund University registered students


^Chapter 1: An overview of language processing (30/08/2021) [pdf] [first ed. pdf]

^Chapter 2: Corpus processing tools (30/08/2021 and 2/09/2021) [pdf] [first ed. pdf]

^Chapter 3: Encoding and annotation schemes (3/09/2020) [pdf] [first ed. pdf]

^Chapter 4: Topics in information theory and machine learning (7/09/2020) [pdf] [first ed. pdf]

  • Contents:
    • Topics in information theory
    • Using scikit learn, a popular machine learning toolkit
  • Lecture slides: [pdf].
  • Resources:
    • Machine-learning software:
      • scikit learn, an excellent data mining software for Python
      • C4.5, ID3's successor, by Ross Quinlan
      • Weka, a comprehensive data mining software in Java
      • LIBSVM, an efficient implementation of support vector machines.
      • LIBLINEAR, a library for large linear classification.
    • Courses on machine learning:

^Chapter 5: Counting words (7, 10, and 14/09/2020) [pdf] [first ed. pdf]

  • Contents:
    • Tokenization
    • N-grams
    • Counting words and N-grams
    • Probability of a word sequence
    • Smoothing
    • Collocations and other statistics
    • Embeddings
  • Lecture slides: Three parts: [pdf], [pdf], [pdf].
  • Python programs:
    • The notesbooks of this chapter: [1], [2], and [3]
    • Simple tokenizers [ 1a], [1b] and a more complex one by Gregory Grefenstette [ 2]
    • Another popular tokenizer by Robert MacIntyre, original version in sed [3] and its translation in Perl [ 4]
    • Counting unigrams [ 5] and bigrams [ 6]
    • Mutual information [ 7], t-scores [ 8], and the log-likelihood ratio [ 9].
  • Java programs to tokenize text, count words and bigrams [Java]. Run them on your corpus. You can count the words from the output of the tokenization program using the Unix sort and uniq commands
  • Demonstrations:
    • A collocation demo from from the Corpus Linguistics group at FAU Erlangen-Nürnberg.
  • Software and resources:

^Chapter 6: Words, parts of speech, and morphology (14/09/2020) [pdf] [first ed. pdf]

^Chapter 7: Part-of-speech tagging using rules (17/09/2020) [pdf] [first ed. pdf]

  • Contents:
    • Part-of-speech tagging with symbolic rules
    • Annotation standards for parts of speech (tagsets)
  • Lecture slides: [pdf].
  • Annotation manuals and corpora:
    • The universal dependencies: Multilingual annotated corpora
    • BNC, the British national corpus, an annotated corpus in English following the text encoding initiative (TEI).
    • SUC, the Stockholm-Umeå corpus, an annotated corpus in Swedish
    • Negra, an annotated corpus in German
    • An inventory of available corpora compiled by a group at Stanford.
  • Software:

^Chapter 8: Part-of-speech tagging using stochastic techniques (17/09/2020) [pdf] [first ed. pdf]

^Chapter 9: Phrase-structure grammars in Prolog (not taught in 2020) [pdf] [first ed. pdf]

  • Contents:
    • Constituents, trees
    • Using Prolog to do natural language analysis, DCG rules, variables
    • Getting the syntactic structure
    • Compositional analysis to get the semantic structure
  • Lecture slides: [pdf]
  • Prolog programs:
    • Two small DCG grammars [1] [2]
    • A tokenizer using Prolog clauses [3] and another one using DCG rules [ 4].
    • A small interpreter of regular expressions in Prolog by Robert Cameron [5].
  • Application examples:
    • The grammar checker in MS Word whose parser uses phrase-structure rules.
    • The natural language group at Microsoft Research.

^Chapter 10: Partial parsing (17 and 24/09/2020) [pdf] [first ed. pdf]

  • Contents:
    • ELIZA: word spotting and pattern matching
    • Multiwords and named entities
    • Noun groups and verb groups
    • Partial parsing: multiword and group detection in Prolog
    • Partial parsing: statistical techniques
    • Information extraction
    • Precision, recall, and F-measure (harmonic mean)
  • Lecture slides: [pdf]
  • Prolog programs:
    • Prolog predicates to write local DCG grammars with simple noun group and verb group rules [1].
  • Documents:
    • Many interesting papers on partial parsing by Steven Abney;
    • An application example of information extraction: the FASTUS system from SRI.
    • Carsim, a system to generate animated 3D scenes from text that uses information extraction techniques.
  • Annotated corpora and evaluation resources:
  • Demonstrations:
    • CiceroLite, a system to extract named entities
    • AlchemyAPI, a system to identify people, organizations, locations, and categorize text
    • Calais, an information extraction system
    • Visualizing and monitoring events and disasters on a map at EMM labs, part of the Europe media monitor. The information extraction part of the event detector. A key to the symbols used is available from this page. See also their name explorer.
  • Software:
  • Annotation resources:
    • The MUC site;
    • PEAS, a group annotation scheme for French
    • TüPP-D/Z, Tübingen Partially Parsed Corpus of Written German

^Chapter 11: Syntactic formalisms (24/09 and 01/10/2020) [pdf] [first ed. pdf]

  • Contents:
    • Constituency and dependency
    • Phrase categories
    • Unification-based grammars
    • Dependency grammars
    • Valence and subcategorization frames
    • Functions
  • Lecture slides: [pdf]
  • Prolog programs:
    • Some simple DCG rules for German noun phrases [1]
    • The generalized unification [2].
    • Detection of nonprojective links in a dependency tree [ 3] and examples of graphs [4].
    • A program to convert the CONLL-X file format into a Prolog clause [ 5]. Useful with the nonprojectivity detection.
  • Corpus and programming resources:
    • More than 60 annotated corpora in multiple languages from the Universal dependencies site.
    • Four freely available annotated dependency corpora, Danish, Dutch, Portuguese, and Swedish, and links to seven others from the CONLL-X shared task.
    • The Susanne corpus, a free treebank for English.
    • A French treebank from Université Paris VII (Available with a license).
    • Tables lexique-grammaire , subcategorization frames in French available from Université de Marne-la-Vallée.
    • The LTH converter to convert constituent trees using the Penn Treebank annotation into dependency graphs.
  • Lexical and grammar resources:
  • Annotation resources:

^Chapter 12: Constituent parsing (not taught in 2020) [pdf] [first ed. pdf]

  • Contents:
    • Top-down and bottom-up strategies
    • The shift-reduce algorithm
    • Earley's algorithm
    • Statistical parsing and PCFG
  • Lecture slides: [pdf]
  • Prolog programs:
    • A shift-reduce parser [1]
    • Earley's parser [2]
  • Corpus resources:
  • Parsers resources:
  • On-line parsers:

^Chapter 13: Dependency parsing (01/10/2020) [pdf] [first ed. pdf]

  • Contents:
    • Dependency parsing
    • Nivre's parser
  • Lecture slides: [pdf]
  • Prolog programs:
    • Joakim Nivre's dependency parser [3].
    • Updates to the book:
      • Nivre's parser to parse an annotated corpus (gold standard parsing) [ 4] and an improved version of Nivre's parser [5].
      • Utilities to parse a CoNLL 2006 or 2007 corpus [ 6] [ 7] [ 8].
      • The Swedish corpus used in CoNLL 2006 and formatted as a Prolog clause. Training set [ 9] and test set [ 10].
  • Corpus resources:
    • More than 60 annotated corpora in multiple languages from the Universal dependencies site.
    • Four freely available annotated dependency corpora, Danish, Dutch, Portuguese, and Swedish, and links to 7 others from the CoNLL-X shared task. Seven other corpora with the same annotation, Basque, Catalan, Chinese, Greek, Hungarian, Italian, and Turkish, from the CoNLL 2007 shared task.
  • Parsers resources:
  • On-line parsers:

^Chapter 14: Semantics and predicate logic (08/10/2020) [pdf] [first ed. pdf]

^Chapter 15: Lexical semantics (08/10/2020) [pdf] [first ed. pdf]

  • Contents:
    • Words and meaning
    • Lexical semantics
    • Lexical networks
    • Word sense disambiguation
    • Case grammars
    • Frame semantics and semantic roles
    • Semantic grammars
  • Lecture slides: [pdf]. Anders Björkelund's presentation of his thesis on semantic role labeling [pdf].
  • Resources:
    • Lexical databases:
    • Sense identification:
      • SemCor, the Brown corpus tagged with Wordnet senses. This was originally done at Princeton with WordNet 1.6. In the meantime, WordNet people reorganized the sense nomenclature. The different corpora are mappings according to WordNet sense versions
    • Semantic role labeling:
    • Semantic role labeling software:

^Chapter 16: Discourse (15/10/2020) [pdf] [first ed. pdf]

  • Contents:
    • Discourse definition,
    • Discourse entities
    • Reference and anaphora
    • Rhetorical structure theory (RST)
    • Parsing a text
    • Machine learning to discover RST relations
    • TimeML
  • Lecture slides: [pdf]
  • Annotation and evaluation resources:
    • The coreference annotation manual used in MUC-7 by Hirschman and Chinchor.
    • A paper on coreference evaluation by Vilain et al. (1995).
    • An annotation manual for Rhetorical structure theory from the University of Southern California's Information Sciences Institute.
    • Another annotation manual for the Penn Discourse Treebank.
    • TimeML, markup language for temporal and event expressions.
  • Corpus resources:
  • Demonstrations:

^Chapter 17: Dialogue (15/10/2020) [pdf] [first ed. pdf]

  • Contents:
    • Dialogue automata
    • Pairs
    • Speech acts
    • Speech act recognition
  • Lecture slides: [pdf]
  • Resources:
    • DAMSL, Dialogue markup scheme from the University of Rochester.
    • Dialogue acts in Verbmobil and Verbmobil-2 [1] [2].
    • The TRAINS corpus and annotated files from the University of Rochester.
  • VoiceXML, a markup framework to develop dialogue applications:
  • Application examples:
    • A train information system in Swedish from SJ. Call 0046 771-75-75-75.
    • A paper by Johan Boye, Mats Wirén, Manny Rayner, Ian Lewin, David Carter, and Ralph Becket, "Language-Processing Strategies and Mixed-Initiative Dialogues", IJCAI-99 Workshop on Knowledge and Reasoning in Practical Dialogue Systems, July 1999.

^Complement: Speech synthesis (15/10/2020)

^Complement: Speech recognition (15/10/2020)

Page Manager: