Lectures
This page is provisional. It will be updated in due time.
I will live stream the lectures. The Zoom link is:
https://lu-se.zoom.us/j/67450590401?pwd=K2hmMXpIMG1Zb0ZWUE96Mzd5Mnp1UT09
Enter the password: 75012
I opened a chat room so that students can discuss topics regarding the course and the labs.
The link is:
https://lu-se.zoom.us/j/64335139506?pwd=WDJaeUtBcnJsQ2c2K2tMVG9jcUJ1UT09
Enter the password: 75019
It is only open to Lund University registered students
Contents
^Chapter 1: An overview of language processing (30/08/2021) [pdf] [first ed. pdf]
- Contents: Presentation of language processing, applications, disciplines of linguistics
- Lecture slides: [pdf].
- Application examples:
- Watson from IBM: Question answering on Jeopardy!, a footage from the show, and an overview.
- Carsim from LTH
- Direkt Profil from Lund university
- The Persona project from Microsoft Research
- A video of Higgins
- a video of Ulysse.
- General resources:
- Research opportunities:
- Companies: Microsoft research, Research at Google, IBM research, Yahoo Research,
- Lists: Corpora, ELSNET, LN
- Associations: ACL, ATALA, GSCL.
^Chapter 2: Corpus processing tools (30/08/2021 and 2/09/2021) [pdf] [first ed. pdf]
- Contents:
- Regular expressions
- Automata
- An introduction to Python
- Concordances
- Approximate string matching
- Lecture slides [pdf].
- Programs:
- Python
- Short programs to illustrate regular expressions and pattern matching [2]. They include a Jupyter notebook, where you can run regular expressions interactively.
- Concordances [ 10]
- Minimum edit distance [ 11]
- A concise and elegant spelling corrector in Python by Peter Norvig and a variation of it in Prolog: Spelling corrector in Prolog.
- Prolog
- Python
- Corpora:
- Corpus thomisticum , the first electronic corpus compiled by Roberto Busa.
- A modern concordance to the Clementine Vulgate.
- The Oxford text archive, Centre National des Ressources Textuelles et Lexicales, Project Gutenberg, the Internet archive, the Runeberg project, Gallica.
- Demonstrations:
- Regex 101, an online regex tester.
- Concordances and collocations:
- Corpus thomisticum
- Many corpora from Brigham-Young, such as the corpus of contemporary American English,Språkbanken, CNRTL.
- Google, one of the largest concordancers to date.
- Software:
- Documents:
- Interesting tutorials by Ken Church
- Another interesting paper on an algorithm to align words for historical comparison by Michael Covington
^Chapter 3: Encoding and annotation schemes (3/09/2020) [pdf] [first ed. pdf]
- Contents:
- Character sets and Unicode
- Mark-up languages and XML
- Lecture slides: [pdf].
- Resources:
- Unicode: the Unicode consortium and international components for Unicode
- XML: the XML site at W3C
- XML in text processing: The Text encoding initiative, DocBook, the International Digital Publishing Forum.
- Programs:
- The programs of this chapter [1]
^Chapter 4: Topics in information theory and machine learning (7/09/2020) [pdf] [first ed. pdf]
- Contents:
- Topics in information theory
- Using scikit learn, a popular machine learning toolkit
- Lecture slides: [pdf].
- Resources:
- Machine-learning software:
- scikit learn, an excellent data mining software for Python
- C4.5, ID3's successor, by Ross Quinlan
- Weka, a comprehensive data mining software in Java
- LIBSVM, an efficient implementation of support vector machines.
- LIBLINEAR, a library for large linear classification.
- Courses on machine learning:
- At Stanford: CS229
- At Carnegie Mellon: 10-701
- An interesting blog: Mechanistician
- Machine-learning software:
^Chapter 5: Counting words (7, 10, and 14/09/2020) [pdf] [first ed. pdf]
- Contents:
- Tokenization
- N-grams
- Counting words and N-grams
- Probability of a word sequence
- Smoothing
- Collocations and other statistics
- Embeddings
- Lecture slides: Three parts: [pdf], [pdf], [pdf].
- Python programs:
- The notesbooks of this chapter: [1], [2], and [3]
- Simple tokenizers [ 1a], [1b] and a more complex one by Gregory Grefenstette [ 2]
- Another popular tokenizer by Robert MacIntyre, original version in sed [3] and its translation in Perl [ 4]
- Counting unigrams [ 5] and bigrams [ 6]
- Mutual information [ 7], t-scores [ 8], and the log-likelihood ratio [ 9].
- Java programs to tokenize text, count words and bigrams [Java]. Run them on your corpus. You can count the words from the output of the tokenization program using the Unix sort and uniq commands
- Demonstrations:
- A collocation demo from from the Corpus Linguistics group at FAU Erlangen-Nürnberg.
- Software and resources:
- N-grams at Google Research and N-grams at Microsoft Research.
- A journalist's account from wired.com on how Google uses bigrams in its search engine.
- The SRI language modeling toolkit
- The CMU-Cambridge statistical language modeling toolkit
^Chapter 6: Words, parts of speech, and morphology (14/09/2020) [pdf] [first ed. pdf]
- Contents:
- Dictionaries
- Morphology
- Transducers
- Lecture slides: [pdf]
- Additional slides on the Prolog language [pdf].
- Prolog programs:
- Grammar resources and history:
- Tecknè, the first grammar of Greek, by Dionysius Thrax, who created concepts we still use today
- De partibus orationis ars minor , the most popular grammar in the west in the Middle ages by Aelius Donatus
- An introduction to the grammar of English from University College London.
- Software:
- PC-Kimmo, a morphological parser from the Summer Institute of Linguistics.
- The Helsinki Finite-State Transducer software, a toolkit to implement morphological parsers based on weighted and unweigted finite-state transducers.
- Unitex, a corpus processing system using automata and transducers from Université de Marne-la-Vallée
- Demonstrations:
- The Xerox site on multilingual content analysis.
- The Swedish morphological parser from Lingsoft
- The German morphological parser from Canoo.
^Chapter 7: Part-of-speech tagging using rules (17/09/2020) [pdf] [first ed. pdf]
- Contents:
- Part-of-speech tagging with symbolic rules
- Annotation standards for parts of speech (tagsets)
- Lecture slides: [pdf].
- Annotation manuals and corpora:
- The universal dependencies: Multilingual annotated corpora
- BNC, the British national corpus, an annotated corpus in English following the text encoding initiative (TEI).
- SUC, the Stockholm-Umeå corpus, an annotated corpus in Swedish
- Negra, an annotated corpus in German
- An inventory of available corpora compiled by a group at Stanford.
- Software:
- The historical Brill's tagger in Lisp.
- An implementation of Brill's tagger in C++ by Radu Florian.
^Chapter 8: Part-of-speech tagging using stochastic techniques (17/09/2020) [pdf] [first ed. pdf]
- Contents:
- Stochastic tagging
- Markov models
- Tagging with decision trees
- Application: Language models for machine translation
- Lecture slides: [pdf].
- Demonstrations:
- The Xerox site on multilingual content analysis.
- Demonstrations from Universitat politècnica de Catalunya.
- GRIM from the KTH.
- Software:
- The historical Xerox tagger based on hidden Markov models in Lisp.
- TreeTagger, a multiligual tagger using decision trees from Helmut Schmid.
- MXPOST, an efficient tagger from Adwait Ratnaparkhi.
- SVMTool, a tagger using support vector machines from Universitat politècnica de Catalunya.
- A part-of-speech tagger and other tools for Swedish from KTH.
- Stagger: another part-of-speech tagger for Swedish.
- GIZA++, a software to train translation models from Franz Josef Och.
^Chapter 9: Phrase-structure grammars in Prolog (not taught in 2020) [pdf] [first ed. pdf]
- Contents:
- Constituents, trees
- Using Prolog to do natural language analysis, DCG rules, variables
- Getting the syntactic structure
- Compositional analysis to get the semantic structure
- Lecture slides: [pdf]
- Prolog programs:
- Application examples:
- The grammar checker in MS Word whose parser uses phrase-structure rules.
- The natural language group at Microsoft Research.
^Chapter 10: Partial parsing (17 and 24/09/2020) [pdf] [first ed. pdf]
- Contents:
- ELIZA: word spotting and pattern matching
- Multiwords and named entities
- Noun groups and verb groups
- Partial parsing: multiword and group detection in Prolog
- Partial parsing: statistical techniques
- Information extraction
- Precision, recall, and F-measure (harmonic mean)
- Lecture slides: [pdf]
- Prolog programs:
- Prolog predicates to write local DCG grammars with simple noun group and verb group rules [1].
- Documents:
- Many interesting papers on partial parsing by Steven Abney;
- An application example of information extraction: the FASTUS system from SRI.
- Carsim, a system to generate animated 3D scenes from text that uses information extraction techniques.
- Annotated corpora and evaluation resources:
- CoNLL-2002 and CoNLL-2003 on language-independent named entity recognition: Spanish, Dutch, English, and German.
- CoNLL-2000 on chunking and CoNLL-1999 on noun phrase chunking
- CoNLL-2001 on clause identification
- Demonstrations:
- CiceroLite, a system to extract named entities
- AlchemyAPI, a system to identify people, organizations, locations, and categorize text
- Calais, an information extraction system
- Visualizing and monitoring events and disasters on a map at EMM labs, part of the Europe media monitor. The information extraction part of the event detector. A key to the symbols used is available from this page. See also their name explorer.
- Software:
- Yamcha, an efficient chunker
- The Stanford named entity recognizer from Stanford University
- The Illinois named entity tagger from the University of Illinois
- The Langforia multilingual pipelines from Lund University
- Annotation resources:
^Chapter 11: Syntactic formalisms (24/09 and 01/10/2020) [pdf] [first ed. pdf]
- Contents:
- Constituency and dependency
- Phrase categories
- Unification-based grammars
- Dependency grammars
- Valence and subcategorization frames
- Functions
- Lecture slides: [pdf]
- Prolog programs:
- Corpus and programming resources:
- More than 60 annotated corpora in multiple languages from the Universal dependencies site.
- Four freely available annotated dependency corpora, Danish, Dutch, Portuguese, and Swedish, and links to seven others from the CONLL-X shared task.
- The Susanne corpus, a free treebank for English.
- A French treebank from Université Paris VII (Available with a license).
- Tables lexique-grammaire , subcategorization frames in French available from Université de Marne-la-Vallée.
- The LTH converter to convert constituent trees using the Penn Treebank annotation into dependency graphs.
- Lexical and grammar resources:
- The Oxford Advanced Learner's Dictionary, a dictionary listing valence patterns of English verbs.
- Annotation resources:
- A dependency annotated corpus in Swedish from Joakim Nivre
- What's wrong with my NLP, a visualizer of dependency graphs using the CoNLL formats.
- A guide to annotate dependencies for Danish from Handelshøjskolen i København, (Copenhagen Business School).
^Chapter 12: Constituent parsing (not taught in 2020) [pdf] [first ed. pdf]
- Contents:
- Top-down and bottom-up strategies
- The shift-reduce algorithm
- Earley's algorithm
- Statistical parsing and PCFG
- Lecture slides: [pdf]
- Prolog programs:
- Corpus resources:
- The Susanne corpus, a free treebank for English
- A French treebank from Université Paris VII (Available with a license)
- Parsers resources:
- The Charniak parser (From Eugene Charniak's web page)
- The Collins parser (Michael Collins' web page)
- On-line parsers:
^Chapter 13: Dependency parsing (01/10/2020) [pdf] [first ed. pdf]
- Contents:
- Dependency parsing
- Nivre's parser
- Lecture slides: [pdf]
- Prolog programs:
- Joakim Nivre's dependency parser [3].
- Updates to the book:
- Corpus resources:
- More than 60 annotated corpora in multiple languages from the Universal dependencies site.
- Four freely available annotated dependency corpora, Danish, Dutch, Portuguese, and Swedish, and links to 7 others from the CoNLL-X shared task. Seven other corpora with the same annotation, Basque, Catalan, Chinese, Greek, Hungarian, Italian, and Turkish, from the CoNLL 2007 shared task.
- Parsers resources:
- Joakim Nivre's web page and the Malt parser
- Google's parser: Parsey McParseface, the most accurate in the world according to Google.
- Ryan McDonald's web page
- The CONLL-X and CONLL-2007 shared tasks on dependency parsing covering a total of 19 languages.
- On-line parsers:
- Connexor
- Lingsoft
- Link grammar
- Stanford coreNLP or here: corenlp run
- Multilingual parsers from Lund
^Chapter 14: Semantics and predicate logic (08/10/2020) [pdf] [first ed. pdf]
- Contents:
- Formal semantics
- λ-calculus
- Compositionality: nouns, verbs, determiners
- Lecture slides: [pdf]
- Prolog programs:
- A small grammar embedding compositionality [1]
- Corpus resources:
- A corpus of logical forms from the natural language group at Microsoft research
- Application examples:
- Semantic interpretation for speech recognition (SISR): A W3C recommendation to embed semantic annotation into grammar rules.
- Translation projects by the natural language group at Microsoft Research.
- SPARQL endpoints:
^Chapter 15: Lexical semantics (08/10/2020) [pdf] [first ed. pdf]
- Contents:
- Words and meaning
- Lexical semantics
- Lexical networks
- Word sense disambiguation
- Case grammars
- Frame semantics and semantic roles
- Semantic grammars
- Lecture slides: [pdf]. Anders Björkelund's presentation of his thesis on semantic role labeling [pdf].
- Resources:
- Lexical databases:
- WordNet from Princeton.
- Alexandria from Memodata.
- Sense identification:
- SemCor, the Brown corpus tagged with Wordnet senses. This was originally done at Princeton with WordNet 1.6. In the meantime, WordNet people reorganized the sense nomenclature. The different corpora are mappings according to WordNet sense versions
- Semantic role labeling:
- FrameNet from Berkeley.
- The ACE project and the Propbank annotation guidelines.
- The Unified verb index merging FrameNet, VerbNet, and PropBank from the University of Colorado.
- CONLL-2004 and CONLL-2005 on semantic role labeling.
- CONLL-2008 and CONLL-2009 on joint learning of syntactic and semantic dependencies.
- Semantic role labeling software:
- A demonstration of the LTH semantic parser and its source code. (CoNLL 2009 version).
- The LTH semantic parser code with Propbank and Nombank predicates from Richard Johansson (CoNLL 2008 version).
- The LTH semantic parser with the Framenet paradigm from Richard Johansson.
- The ASSERT Automatic Statistical SEmantic Role Tagger from Sameer Pradhan.
- Semantic role labeling by the University of Illinois at Urbana-Champaign.
- Open information extraction, a system to extract predicate--argument structures from web pages.
- The Senna semantic role-labeling tool from the NEC Laboratories America.
- Lexical databases:
^Chapter 16: Discourse (15/10/2020) [pdf] [first ed. pdf]
- Contents:
- Discourse definition,
- Discourse entities
- Reference and anaphora
- Rhetorical structure theory (RST)
- Parsing a text
- Machine learning to discover RST relations
- TimeML
- Lecture slides: [pdf]
- Annotation and evaluation resources:
- The coreference annotation manual used in MUC-7 by Hirschman and Chinchor.
- A paper on coreference evaluation by Vilain et al. (1995).
- An annotation manual for Rhetorical structure theory from the University of Southern California's Information Sciences Institute.
- Another annotation manual for the Penn Discourse Treebank.
- TimeML, markup language for temporal and event expressions.
- Corpus resources:
- Entity databases: Freebase, DBpedia, and Yago
- CONLL-2011 and CONLL-2012 on modeling unrestricted coreference in OntoNotes.
- A RST annotated corpus in German from the University of Postdam. Available on request.
- TimeBank, a TimeML annotated corpus.
- Demonstrations:
- Entity disambiguation and linking with AIDA.
- Coreference solving using Stanford CoreNLP.
- HERD: Entity disambiguation for Swedish
- A parser for discourse relations using the Penn Discourse Treebank annotations.
- rstWeb, an annotation platform
^Chapter 17: Dialogue (15/10/2020) [pdf] [first ed. pdf]
- Contents:
- Dialogue automata
- Pairs
- Speech acts
- Speech act recognition
- Lecture slides: [pdf]
- Resources:
- DAMSL, Dialogue markup scheme from the University of Rochester.
- Dialogue acts in Verbmobil and Verbmobil-2 [1] [2].
- The TRAINS corpus and annotated files from the University of Rochester.
- VoiceXML, a markup framework to develop dialogue applications:
- The VoiceXML official page
- Java VoiceXML, an open source implementation of VoiceXML.
- Application examples:
- TRAINS, TRIPS.
- A train information system in Swedish from SJ. Call 0046 771-75-75-75.
- A paper by Johan Boye, Mats Wirén, Manny Rayner, Ian Lewin, David Carter, and Ralph Becket, "Language-Processing Strategies and Mixed-Initiative Dialogues", IJCAI-99 Workshop on Knowledge and Reasoning in Practical Dialogue Systems, July 1999.
^Complement: Speech synthesis (15/10/2020)
- Contents:
- Some concepts in signal processing
- Some basics in phonetics
- Speech synthesis
- Lecture slides: [pdf].
- Software resources:
- Application examples:
- Multilingual speech synthesis from Acapela,
- CRISCO speech synthesis in French,
- Other links on synthesis,
- ATT speech synthesis.
^Complement: Speech recognition (15/10/2020)
- Contents:
- Markov models
- Speech recognition
- Lecture slides: [pdf].
- Prolog programs:
- Software resources:
- The HTK speech group at Cambridge.
- Sphinx, a speech recognition program, and other open source resources from the speech group at from Carnegie Mellon
- Evaluation:
- Application examples:
- LIMSI speech recognition and Voxalead, an audio indexing and transcription application. See also Quaero.
- An example of real-time speech recognition.
- Commercial companies: