Denna sida på svenska This page in English

Lectures

This page is provisional. It will be updated in due time.

Ch. 1: An overview of language processing
Ch. 2: Corpus processing tools
Ch. 3: Encoding and annotation schemes
Ch. 4: Topics in information theory and machine learning
Ch. 5: Counting words
Ch. 6: Words, parts of speech, and morphology
Ch. 7: Part-of-speech tagging using rules
Ch. 8: Part-of-speech tagging using stochastic techniques
Ch. 9: Phrase-structure grammars in Prolog

Ch. 10: Partial parsing
Ch. 11: Syntactic formalisms
Ch. 12: Constituent parsing
Ch. 13: Dependency parsing
Ch. 14: Semantics and predicate logic
Ch. 15: Lexical semantics
Ch. 16: Discourse
Ch. 17: Dialogue
Compl.: Speech synthesis
Compl.: Speech recognition

^Chapter 1: An overview of language processing (29/08/2022) [pdf] [first ed. pdf]

Contents: Presentation of language processing, applications, disciplines of linguistics
Lecture slides: [pdf].
Application examples:
- Watson from IBM: Question answering on Jeopardy!, a footage from the show, and an overview.
- Carsim from LTH
- Direkt Profil from Lund university
- The Persona project from Microsoft Research
- A video of Higgins
- a video of Ulysse.
General resources:
- Wikipedia
- ACL anthology
Research opportunities:
- Companies: Microsoft research, Research at Google, IBM research, Yahoo Research,
- Lists: Corpora, ELSNET, LN
Associations: ACL, ATALA, GSCL.

^Chapter 2: Corpus processing tools (29/08/2022 and 1/09/2022) [pdf] [first ed. pdf]

Contents:
- Regular expressions
- Automata
- An introduction to Python
- Concordances
- Approximate string matching
Lecture slides [pdf].
Programs:
1. Python
  - Short programs to illustrate regular expressions and pattern matching [2]. They include a Jupyter notebook, where you can run regular expressions interactively.
  - Concordances [ 10]
  - Minimum edit distance [ 11]
  - A concise and elegant spelling corrector in Python by Peter Norvig and a variation of it in Prolog: Spelling corrector in Prolog.
2. Prolog
  - An elementary automaton in Prolog [ 1]
  - Searching edits in Prolog [ 12]
Corpora:
- Corpus thomisticum , the first electronic corpus compiled by Roberto Busa.
- A modern concordance to the Clementine Vulgate.
- The Oxford text archive, Centre National des Ressources Textuelles et Lexicales, Project Gutenberg, the Internet archive, the Runeberg project, Gallica.
Demonstrations:
- Regex 101, an online regex tester.
- Concordances and collocations:
  - Corpus thomisticum
  - Many corpora from Brigham-Young, such as the corpus of contemporary American English,Språkbanken, CNRTL.
  - Google, one of the largest concordancers to date.
Software:
- OpenFst, a library for constructing weighted finite-state transducers in C++ with bindings in Python
- FSA, finite state automata utilities in Prolog
Documents:
- Interesting tutorials by Ken Church
- Another interesting paper on an algorithm to align words for historical comparison by Michael Covington

^Chapter 3: Encoding and annotation schemes (1/09/2022) [pdf] [first ed. pdf]

Contents:
- Character sets and Unicode
- Mark-up languages and XML
Lecture slides: [pdf].
Resources:
- Unicode: the Unicode consortium and international components for Unicode
- XML: the XML site at W3C
- XML in text processing: The Text encoding initiative, DocBook, the International Digital Publishing Forum.
Programs:
- The programs of this chapter [1]

^Chapter 4: Topics in information theory and machine learning (1/09/2022) [pdf] [first ed. pdf]

Contents:
- Topics in information theory
- Using scikit learn, a popular machine learning toolkit
Lecture slides: [pdf].
Resources:
- Machine-learning software:
  - scikit learn, an excellent data mining software for Python
  - C4.5, ID3's successor, by Ross Quinlan
  - Weka, a comprehensive data mining software in Java
  - LIBSVM, an efficient implementation of support vector machines.
  - LIBLINEAR, a library for large linear classification.
- Courses on machine learning:
  - At Stanford: CS229
  - At Carnegie Mellon: 10-701
  - An interesting blog: Mechanistician

^Chapter 5: Counting words (5/09/2022) [pdf] [first ed. pdf]

Contents:
- Tokenization
- N-grams
- Counting words and N-grams
- Probability of a word sequence
- Smoothing
- Collocations and other statistics
- Embeddings
Lecture slides: Three parts: [pdf], [pdf], [pdf].
Python programs:
- The notesbooks of this chapter: [1], [2], and [3]
- Simple tokenizers [ 1a], [1b] and a more complex one by Gregory Grefenstette [ 2]
- Another popular tokenizer by Robert MacIntyre, original version in sed [3] and its translation in Perl [ 4]
- Counting unigrams [ 5] and bigrams [ 6]
- Mutual information [ 7], t-scores [ 8], and the log-likelihood ratio [ 9].
Java programs to tokenize text, count words and bigrams [Java]. Run them on your corpus. You can count the words from the output of the tokenization program using the Unix sort and uniq commands
Demonstrations:
- A collocation demo from from the Corpus Linguistics group at FAU Erlangen-Nürnberg.
Software and resources:
- N-grams at Google Research and N-grams at Microsoft Research.
- A journalist's account from wired.com on how Google uses bigrams in its search engine.
- The SRI language modeling toolkit
- The CMU-Cambridge statistical language modeling toolkit

^Chapter 6: Words, parts of speech, and morphology (8 and 15/09/2022) [pdf] [first ed. pdf]

Contents:
- Dictionaries
- Morphology
- Transducers
Lecture slides: [pdf]
Additional slides on the Prolog language [pdf].
Prolog programs:
- Building and searching a letter tree (trie) [1]
- A transducer modeling the future tense of regular French verbs [2].
Grammar resources and history:
- Tecknè, the first grammar of Greek, by Dionysius Thrax, who created concepts we still use today
- De partibus orationis ars minor , the most popular grammar in the west in the Middle ages by Aelius Donatus
- An introduction to the grammar of English from University College London.
Software:
- PC-Kimmo, a morphological parser from the Summer Institute of Linguistics.
- The Helsinki Finite-State Transducer software, a toolkit to implement morphological parsers based on weighted and unweigted finite-state transducers.
- Unitex, a corpus processing system using automata and transducers from Université de Marne-la-Vallée
Demonstrations:
- The Xerox site on multilingual content analysis.
- The Swedish morphological parser from Lingsoft
- The German morphological parser from Canoo.

^Chapter 7: Part-of-speech tagging using rules (not taught in 2022) [pdf] [first ed. pdf]

Contents:
- Part-of-speech tagging with symbolic rules
- Annotation standards for parts of speech (tagsets)
Lecture slides: [pdf].
Annotation manuals and corpora:
- The universal dependencies: Multilingual annotated corpora
- BNC, the British national corpus, an annotated corpus in English following the text encoding initiative (TEI).
- SUC, the Stockholm-Umeå corpus, an annotated corpus in Swedish
- Negra, an annotated corpus in German
- An inventory of available corpora compiled by a group at Stanford.
Software:
- The historical Brill's tagger in Lisp.
- An implementation of Brill's tagger in C++ by Radu Florian.

^Chapter 8: Part-of-speech tagging using machine-learning techniques (15/09/2022) [pdf] [first ed. pdf]

Contents:
- Stochastic tagging
- Markov models
- Tagging with decision trees
- Application: Language models for machine translation
Lecture slides: [pdf].
Demonstrations:
- The Xerox site on multilingual content analysis.
- Demonstrations from Universitat politècnica de Catalunya.
- GRIM from the KTH.
Software:
- The historical Xerox tagger based on hidden Markov models in Lisp.
- TreeTagger, a multiligual tagger using decision trees from Helmut Schmid.
- MXPOST, an efficient tagger from Adwait Ratnaparkhi.
- SVMTool, a tagger using support vector machines from Universitat politècnica de Catalunya.
- A part-of-speech tagger and other tools for Swedish from KTH.
- Stagger: another part-of-speech tagger for Swedish.
- GIZA++, a software to train translation models from Franz Josef Och.

^Chapter 9: Phrase-structure grammars in Prolog (not taught in 2022) [pdf] [first ed. pdf]

Contents:
- Constituents, trees
- Using Prolog to do natural language analysis, DCG rules, variables
- Getting the syntactic structure
- Compositional analysis to get the semantic structure
Lecture slides: [pdf]
Prolog programs:
- Two small DCG grammars [1] [2]
- A tokenizer using Prolog clauses [3] and another one using DCG rules [ 4].
- A small interpreter of regular expressions in Prolog by Robert Cameron [5].
Application examples:
- The grammar checker in MS Word whose parser uses phrase-structure rules.
- The natural language group at Microsoft Research.

^Chapter 10: Techniques for sequence predictions (22/09/2022) [pdf] [first ed. pdf]

Contents:
- ELIZA: word spotting and pattern matching
- Multiwords and named entities
- Noun groups and verb groups
- Partial parsing: statistical techniques
- Information extraction
- Precision, recall, and F-measure (harmonic mean)
Lecture slides: [pdf]
Prolog programs:
- Prolog predicates to write local DCG grammars with simple noun group and verb group rules [1].
Documents:
- Many interesting papers on partial parsing by Steven Abney;
- An application example of information extraction: the FASTUS system from SRI.
- Carsim, a system to generate animated 3D scenes from text that uses information extraction techniques.
Annotated corpora and evaluation resources:
- CoNLL-2002 and CoNLL-2003 on language-independent named entity recognition: Spanish, Dutch, English, and German.
- CoNLL-2000 on chunking and CoNLL-1999 on noun phrase chunking
- CoNLL-2001 on clause identification
Demonstrations:
- CiceroLite, a system to extract named entities
- AlchemyAPI, a system to identify people, organizations, locations, and categorize text
- Calais, an information extraction system
- Visualizing and monitoring events and disasters on a map at EMM labs, part of the Europe media monitor. The information extraction part of the event detector. A key to the symbols used is available from this page. See also their name explorer.
Software:
- Yamcha, an efficient chunker
- The Stanford named entity recognizer from Stanford University
- The Illinois named entity tagger from the University of Illinois
- The Langforia multilingual pipelines from Lund University
Annotation resources:
- The MUC site;
- PEAS, a group annotation scheme for French
- TüPP-D/Z, Tübingen Partially Parsed Corpus of Written German

^Chapter 11: Syntactic formalisms (not taught in 2022) [pdf] [first ed. pdf]

Contents:
- Constituency and dependency
- Phrase categories
- Unification-based grammars
- Dependency grammars
- Valence and subcategorization frames
- Functions
Lecture slides: [pdf]
Prolog programs:
- Some simple DCG rules for German noun phrases [1]
- The generalized unification [2].
- Detection of nonprojective links in a dependency tree [ 3] and examples of graphs [4].
- A program to convert the CONLL-X file format into a Prolog clause [ 5]. Useful with the nonprojectivity detection.
Corpus and programming resources:
- More than 60 annotated corpora in multiple languages from the Universal dependencies site.
- Four freely available annotated dependency corpora, Danish, Dutch, Portuguese, and Swedish, and links to seven others from the CONLL-X shared task.
- The Susanne corpus, a free treebank for English.
- A French treebank from Université Paris VII (Available with a license).
- Tables lexique-grammaire , subcategorization frames in French available from Université de Marne-la-Vallée.
- The LTH converter to convert constituent trees using the Penn Treebank annotation into dependency graphs.
Lexical and grammar resources:
- The Oxford Advanced Learner's Dictionary, a dictionary listing valence patterns of English verbs.
Annotation resources:
- A dependency annotated corpus in Swedish from Joakim Nivre
- What's wrong with my NLP, a visualizer of dependency graphs using the CoNLL formats.
- A guide to annotate dependencies for Danish from Handelshøjskolen i København, (Copenhagen Business School).

^Chapter 12: Transformers (22/09/2022) [pdf] [first ed. pdf]

Contents:
- Contextual embeddings
- Attention
- Masked language models
Lecture slides: [pdf]

Computer Science

Faculty of Engineering, LTH

Lectures

Contents

^Chapter 1: An overview of language processing (29/08/2022) [pdf] [first ed. pdf]

^Chapter 2: Corpus processing tools (29/08/2022 and 1/09/2022) [pdf] [first ed. pdf]

^Chapter 3: Encoding and annotation schemes (1/09/2022) [pdf] [first ed. pdf]

^Chapter 4: Topics in information theory and machine learning (1/09/2022) [pdf] [first ed. pdf]

^Chapter 5: Counting words (5/09/2022) [pdf] [first ed. pdf]

^Chapter 6: Words, parts of speech, and morphology (8 and 15/09/2022) [pdf] [first ed. pdf]

^Chapter 7: Part-of-speech tagging using rules (not taught in 2022) [pdf] [first ed. pdf]

^Chapter 8: Part-of-speech tagging using machine-learning techniques (15/09/2022) [pdf] [first ed. pdf]

^Chapter 9: Phrase-structure grammars in Prolog (not taught in 2022) [pdf] [first ed. pdf]

^Chapter 10: Techniques for sequence predictions (22/09/2022) [pdf] [first ed. pdf]

^Chapter 11: Syntactic formalisms (not taught in 2022) [pdf] [first ed. pdf]

^Chapter 12: Transformers (22/09/2022) [pdf] [first ed. pdf]

Sidöversikt

Research Group Sites

Research Area Sites

Research Project Sites

Software and Community Sites