lunduniversity.lu.se

Computer Science

Faculty of Engineering, LTH

Denna sida på svenska This page in English

Lectures

Lectures

This page is provisional. It will be updated in due time.

Contents

^Chapter 1: An overview of language processing (29/08/2022) [pdf] [first ed. pdf]

^Chapter 2: Corpus processing tools (29/08/2022 and 1/09/2022) [pdf] [first ed. pdf]

^Chapter 3: Encoding and annotation schemes (1/09/2022) [pdf] [first ed. pdf]

^Chapter 4: Topics in information theory and machine learning (1/09/2022) [pdf] [first ed. pdf]

  • Contents:
    • Topics in information theory
    • Using scikit learn, a popular machine learning toolkit
  • Lecture slides: [pdf].
  • Resources:
    • Machine-learning software:
      • scikit learn, an excellent data mining software for Python
      • C4.5, ID3's successor, by Ross Quinlan
      • Weka, a comprehensive data mining software in Java
      • LIBSVM, an efficient implementation of support vector machines.
      • LIBLINEAR, a library for large linear classification.
    • Courses on machine learning:

^Chapter 5: Counting words (5/09/2022) [pdf] [first ed. pdf]

  • Contents:
    • Tokenization
    • N-grams
    • Counting words and N-grams
    • Probability of a word sequence
    • Smoothing
    • Collocations and other statistics
    • Embeddings
  • Lecture slides: Three parts: [pdf], [pdf], [pdf].
  • Python programs:
    • The notesbooks of this chapter: [1], [2], and [3]
    • Simple tokenizers [ 1a], [1b] and a more complex one by Gregory Grefenstette [ 2]
    • Another popular tokenizer by Robert MacIntyre, original version in sed [3] and its translation in Perl [ 4]
    • Counting unigrams [ 5] and bigrams [ 6]
    • Mutual information [ 7], t-scores [ 8], and the log-likelihood ratio [ 9].
  • Java programs to tokenize text, count words and bigrams [Java]. Run them on your corpus. You can count the words from the output of the tokenization program using the Unix sort and uniq commands
  • Demonstrations:
    • A collocation demo from from the Corpus Linguistics group at FAU Erlangen-Nürnberg.
  • Software and resources:

^Chapter 6: Words, parts of speech, and morphology (8 and 15/09/2022) [pdf] [first ed. pdf]

^Chapter 7: Part-of-speech tagging using rules (not taught in 2022) [pdf] [first ed. pdf]

  • Contents:
    • Part-of-speech tagging with symbolic rules
    • Annotation standards for parts of speech (tagsets)
  • Lecture slides: [pdf].
  • Annotation manuals and corpora:
    • The universal dependencies: Multilingual annotated corpora
    • BNC, the British national corpus, an annotated corpus in English following the text encoding initiative (TEI).
    • SUC, the Stockholm-Umeå corpus, an annotated corpus in Swedish
    • Negra, an annotated corpus in German
    • An inventory of available corpora compiled by a group at Stanford.
  • Software:

^Chapter 8: Part-of-speech tagging using machine-learning techniques (15/09/2022) [pdf] [first ed. pdf]

^Chapter 9: Phrase-structure grammars in Prolog (not taught in 2022) [pdf] [first ed. pdf]

  • Contents:
    • Constituents, trees
    • Using Prolog to do natural language analysis, DCG rules, variables
    • Getting the syntactic structure
    • Compositional analysis to get the semantic structure
  • Lecture slides: [pdf]
  • Prolog programs:
    • Two small DCG grammars [1] [2]
    • A tokenizer using Prolog clauses [3] and another one using DCG rules [ 4].
    • A small interpreter of regular expressions in Prolog by Robert Cameron [5].
  • Application examples:
    • The grammar checker in MS Word whose parser uses phrase-structure rules.
    • The natural language group at Microsoft Research.

^Chapter 10: Techniques for sequence predictions (22/09/2022) [pdf] [first ed. pdf]

  • Contents:
    • ELIZA: word spotting and pattern matching
    • Multiwords and named entities
    • Noun groups and verb groups
    • Partial parsing: statistical techniques
    • Information extraction
    • Precision, recall, and F-measure (harmonic mean)
  • Lecture slides: [pdf]
  • Prolog programs:
    • Prolog predicates to write local DCG grammars with simple noun group and verb group rules [1].
  • Documents:
    • Many interesting papers on partial parsing by Steven Abney;
    • An application example of information extraction: the FASTUS system from SRI.
    • Carsim, a system to generate animated 3D scenes from text that uses information extraction techniques.
  • Annotated corpora and evaluation resources:
  • Demonstrations:
    • CiceroLite, a system to extract named entities
    • AlchemyAPI, a system to identify people, organizations, locations, and categorize text
    • Calais, an information extraction system
    • Visualizing and monitoring events and disasters on a map at EMM labs, part of the Europe media monitor. The information extraction part of the event detector. A key to the symbols used is available from this page. See also their name explorer.
  • Software:
  • Annotation resources:
    • The MUC site;
    • PEAS, a group annotation scheme for French
    • TüPP-D/Z, Tübingen Partially Parsed Corpus of Written German

^Chapter 11: Syntactic formalisms (not taught in 2022) [pdf] [first ed. pdf]

  • Contents:
    • Constituency and dependency
    • Phrase categories
    • Unification-based grammars
    • Dependency grammars
    • Valence and subcategorization frames
    • Functions
  • Lecture slides: [pdf]
  • Prolog programs:
    • Some simple DCG rules for German noun phrases [1]
    • The generalized unification [2].
    • Detection of nonprojective links in a dependency tree [ 3] and examples of graphs [4].
    • A program to convert the CONLL-X file format into a Prolog clause [ 5]. Useful with the nonprojectivity detection.
  • Corpus and programming resources:
    • More than 60 annotated corpora in multiple languages from the Universal dependencies site.
    • Four freely available annotated dependency corpora, Danish, Dutch, Portuguese, and Swedish, and links to seven others from the CONLL-X shared task.
    • The Susanne corpus, a free treebank for English.
    • A French treebank from Université Paris VII (Available with a license).
    • Tables lexique-grammaire , subcategorization frames in French available from Université de Marne-la-Vallée.
    • The LTH converter to convert constituent trees using the Penn Treebank annotation into dependency graphs.
  • Lexical and grammar resources:
  • Annotation resources:

^Chapter 12: Transformers (22/09/2022) [pdf] [first ed. pdf]

  • Contents:
    • Contextual embeddings
    • Attention
    • Masked language models
  • Lecture slides: [pdf]
Page Manager: