Lectures
This page is provisional. It will be updated in due time.
Contents
^Chapter 1: An overview of language processing (29/08/2022) [pdf] [first ed. pdf]
- Contents: Presentation of language processing, applications, disciplines of linguistics
- Lecture slides: [pdf].
- Application examples:
- Watson from IBM: Question answering on Jeopardy!, a footage from the show, and an overview.
- Carsim from LTH
- Direkt Profil from Lund university
- The Persona project from Microsoft Research
- A video of Higgins
- a video of Ulysse.
- General resources:
- Research opportunities:
- Companies: Microsoft research, Research at Google, IBM research, Yahoo Research,
- Lists: Corpora, ELSNET, LN
- Associations: ACL, ATALA, GSCL.
^Chapter 2: Corpus processing tools (29/08/2022 and 1/09/2022) [pdf] [first ed. pdf]
- Contents:
- Regular expressions
- Automata
- An introduction to Python
- Concordances
- Approximate string matching
- Lecture slides [pdf].
- Programs:
- Python
- Short programs to illustrate regular expressions and pattern matching [2]. They include a Jupyter notebook, where you can run regular expressions interactively.
- Concordances [ 10]
- Minimum edit distance [ 11]
- A concise and elegant spelling corrector in Python by Peter Norvig and a variation of it in Prolog: Spelling corrector in Prolog.
- Prolog
- Python
- Corpora:
- Corpus thomisticum , the first electronic corpus compiled by Roberto Busa.
- A modern concordance to the Clementine Vulgate.
- The Oxford text archive, Centre National des Ressources Textuelles et Lexicales, Project Gutenberg, the Internet archive, the Runeberg project, Gallica.
- Demonstrations:
- Regex 101, an online regex tester.
- Concordances and collocations:
- Corpus thomisticum
- Many corpora from Brigham-Young, such as the corpus of contemporary American English,Språkbanken, CNRTL.
- Google, one of the largest concordancers to date.
- Software:
- Documents:
- Interesting tutorials by Ken Church
- Another interesting paper on an algorithm to align words for historical comparison by Michael Covington
^Chapter 3: Encoding and annotation schemes (1/09/2022) [pdf] [first ed. pdf]
- Contents:
- Character sets and Unicode
- Mark-up languages and XML
- Lecture slides: [pdf].
- Resources:
- Unicode: the Unicode consortium and international components for Unicode
- XML: the XML site at W3C
- XML in text processing: The Text encoding initiative, DocBook, the International Digital Publishing Forum.
- Programs:
- The programs of this chapter [1]
^Chapter 4: Topics in information theory and machine learning (1/09/2022) [pdf] [first ed. pdf]
- Contents:
- Topics in information theory
- Using scikit learn, a popular machine learning toolkit
- Lecture slides: [pdf].
- Resources:
- Machine-learning software:
- scikit learn, an excellent data mining software for Python
- C4.5, ID3's successor, by Ross Quinlan
- Weka, a comprehensive data mining software in Java
- LIBSVM, an efficient implementation of support vector machines.
- LIBLINEAR, a library for large linear classification.
- Courses on machine learning:
- At Stanford: CS229
- At Carnegie Mellon: 10-701
- An interesting blog: Mechanistician
- Machine-learning software:
^Chapter 5: Counting words (5/09/2022) [pdf] [first ed. pdf]
- Contents:
- Tokenization
- N-grams
- Counting words and N-grams
- Probability of a word sequence
- Smoothing
- Collocations and other statistics
- Embeddings
- Lecture slides: Three parts: [pdf], [pdf], [pdf].
- Python programs:
- The notesbooks of this chapter: [1], [2], and [3]
- Simple tokenizers [ 1a], [1b] and a more complex one by Gregory Grefenstette [ 2]
- Another popular tokenizer by Robert MacIntyre, original version in sed [3] and its translation in Perl [ 4]
- Counting unigrams [ 5] and bigrams [ 6]
- Mutual information [ 7], t-scores [ 8], and the log-likelihood ratio [ 9].
- Java programs to tokenize text, count words and bigrams [Java]. Run them on your corpus. You can count the words from the output of the tokenization program using the Unix sort and uniq commands
- Demonstrations:
- A collocation demo from from the Corpus Linguistics group at FAU Erlangen-Nürnberg.
- Software and resources:
- N-grams at Google Research and N-grams at Microsoft Research.
- A journalist's account from wired.com on how Google uses bigrams in its search engine.
- The SRI language modeling toolkit
- The CMU-Cambridge statistical language modeling toolkit
^Chapter 6: Words, parts of speech, and morphology (8 and 15/09/2022) [pdf] [first ed. pdf]
- Contents:
- Dictionaries
- Morphology
- Transducers
- Lecture slides: [pdf]
- Additional slides on the Prolog language [pdf].
- Prolog programs:
- Grammar resources and history:
- Tecknè, the first grammar of Greek, by Dionysius Thrax, who created concepts we still use today
- De partibus orationis ars minor , the most popular grammar in the west in the Middle ages by Aelius Donatus
- An introduction to the grammar of English from University College London.
- Software:
- PC-Kimmo, a morphological parser from the Summer Institute of Linguistics.
- The Helsinki Finite-State Transducer software, a toolkit to implement morphological parsers based on weighted and unweigted finite-state transducers.
- Unitex, a corpus processing system using automata and transducers from Université de Marne-la-Vallée
- Demonstrations:
- The Xerox site on multilingual content analysis.
- The Swedish morphological parser from Lingsoft
- The German morphological parser from Canoo.
^Chapter 7: Part-of-speech tagging using rules (not taught in 2022) [pdf] [first ed. pdf]
- Contents:
- Part-of-speech tagging with symbolic rules
- Annotation standards for parts of speech (tagsets)
- Lecture slides: [pdf].
- Annotation manuals and corpora:
- The universal dependencies: Multilingual annotated corpora
- BNC, the British national corpus, an annotated corpus in English following the text encoding initiative (TEI).
- SUC, the Stockholm-Umeå corpus, an annotated corpus in Swedish
- Negra, an annotated corpus in German
- An inventory of available corpora compiled by a group at Stanford.
- Software:
- The historical Brill's tagger in Lisp.
- An implementation of Brill's tagger in C++ by Radu Florian.
^Chapter 8: Part-of-speech tagging using machine-learning techniques (15/09/2022) [pdf] [first ed. pdf]
- Contents:
- Stochastic tagging
- Markov models
- Tagging with decision trees
- Application: Language models for machine translation
- Lecture slides: [pdf].
- Demonstrations:
- The Xerox site on multilingual content analysis.
- Demonstrations from Universitat politècnica de Catalunya.
- GRIM from the KTH.
- Software:
- The historical Xerox tagger based on hidden Markov models in Lisp.
- TreeTagger, a multiligual tagger using decision trees from Helmut Schmid.
- MXPOST, an efficient tagger from Adwait Ratnaparkhi.
- SVMTool, a tagger using support vector machines from Universitat politècnica de Catalunya.
- A part-of-speech tagger and other tools for Swedish from KTH.
- Stagger: another part-of-speech tagger for Swedish.
- GIZA++, a software to train translation models from Franz Josef Och.
^Chapter 9: Phrase-structure grammars in Prolog (not taught in 2022) [pdf] [first ed. pdf]
- Contents:
- Constituents, trees
- Using Prolog to do natural language analysis, DCG rules, variables
- Getting the syntactic structure
- Compositional analysis to get the semantic structure
- Lecture slides: [pdf]
- Prolog programs:
- Application examples:
- The grammar checker in MS Word whose parser uses phrase-structure rules.
- The natural language group at Microsoft Research.
^Chapter 10: Techniques for sequence predictions (22/09/2022) [pdf] [first ed. pdf]
- Contents:
- ELIZA: word spotting and pattern matching
- Multiwords and named entities
- Noun groups and verb groups
- Partial parsing: statistical techniques
- Information extraction
- Precision, recall, and F-measure (harmonic mean)
- Lecture slides: [pdf]
- Prolog programs:
- Prolog predicates to write local DCG grammars with simple noun group and verb group rules [1].
- Documents:
- Many interesting papers on partial parsing by Steven Abney;
- An application example of information extraction: the FASTUS system from SRI.
- Carsim, a system to generate animated 3D scenes from text that uses information extraction techniques.
- Annotated corpora and evaluation resources:
- CoNLL-2002 and CoNLL-2003 on language-independent named entity recognition: Spanish, Dutch, English, and German.
- CoNLL-2000 on chunking and CoNLL-1999 on noun phrase chunking
- CoNLL-2001 on clause identification
- Demonstrations:
- CiceroLite, a system to extract named entities
- AlchemyAPI, a system to identify people, organizations, locations, and categorize text
- Calais, an information extraction system
- Visualizing and monitoring events and disasters on a map at EMM labs, part of the Europe media monitor. The information extraction part of the event detector. A key to the symbols used is available from this page. See also their name explorer.
- Software:
- Yamcha, an efficient chunker
- The Stanford named entity recognizer from Stanford University
- The Illinois named entity tagger from the University of Illinois
- The Langforia multilingual pipelines from Lund University
- Annotation resources:
^Chapter 11: Syntactic formalisms (not taught in 2022) [pdf] [first ed. pdf]
- Contents:
- Constituency and dependency
- Phrase categories
- Unification-based grammars
- Dependency grammars
- Valence and subcategorization frames
- Functions
- Lecture slides: [pdf]
- Prolog programs:
- Corpus and programming resources:
- More than 60 annotated corpora in multiple languages from the Universal dependencies site.
- Four freely available annotated dependency corpora, Danish, Dutch, Portuguese, and Swedish, and links to seven others from the CONLL-X shared task.
- The Susanne corpus, a free treebank for English.
- A French treebank from Université Paris VII (Available with a license).
- Tables lexique-grammaire , subcategorization frames in French available from Université de Marne-la-Vallée.
- The LTH converter to convert constituent trees using the Penn Treebank annotation into dependency graphs.
- Lexical and grammar resources:
- The Oxford Advanced Learner's Dictionary, a dictionary listing valence patterns of English verbs.
- Annotation resources:
- A dependency annotated corpus in Swedish from Joakim Nivre
- What's wrong with my NLP, a visualizer of dependency graphs using the CoNLL formats.
- A guide to annotate dependencies for Danish from Handelshøjskolen i København, (Copenhagen Business School).
^Chapter 12: Transformers (22/09/2022) [pdf] [first ed. pdf]
- Contents:
- Contextual embeddings
- Attention
- Masked language models
- Lecture slides: [pdf]