Asialex Workshop Proposal: 

Practical Techniques in Using Digital Corpora for Collocational Analysis in Lexicography

Ramon Guillermo

Department of Filipino and Philippine Literature

University of the Philippines


The workshop aims to introduce some techniques for undertaking collocational analysis with digital text corpora for the purposes of lexicographic research. Several types of lexical collocation will be introduced, from the simplest types of 2-word co-occurrence, collocational clustering to collocational network analysis. The workshop will employ examples from the Filipino language political lexicon using different texts and authors. The workshop’s objective is to demonstrate the use of these techniques in practical lexicographic research.


Sketch Engine is a leading corpus tool, used for lexicography at Oxford University Press, Cambridge University Press, Collins, Macmillan, and at national language institutes in nine countries; also for language research and teaching at several hundred universities worldwide. It provides corpora in over 80 languages and helps experts dealing with all kinds of issues related to how words behave in context – from getting example corpus sentences, over investigating collocational behaviour of words in particular grammatical relations using word sketches or distributional thesaurus up to integration with dictionary writing systems via Sketch Engine API. On top of that the system provides corpus building functions that allow people to convert their own texts into an annotated corpus or build domain corpora from the web easily, and search those corpora or use them for keyword or term extraction.

The workshop will provide an introduction to the Sketch Engine and cover some of the more advanced features together with their uses cases in lexicography. The workshop will be hands-on. Participants are encouraged to come with both laptops/tablets, and with data sets they would like to work with. The workshop programme may be accommodated to individual participants’ needs as they appear in due course. Every participant will be entitled to a 3-months free Sketch Engine trial account.

The workshop will be led by Miloš Jakubíček and Iztok Kosem.


Miloš Jakubíček is an NLP researcher and software engineer. His research interests are devoted mainly to two fields: effective processing of very large text corpora and parsing of morphologically rich languages. Since 2008, he has been involved in the development of Sketch Engine corpus management suite on behalf of Lexical Computing, a small research company working at the intersection of corpus and computational linguistics. Since 2011, he has been the director of the Czech branch of Lexical Computing leading the local development team of Sketch Engine and became CEO of Lexical Computing in 2014. He is a fellow of the NLP Centre at Masaryk University, where his interests lie mainly in syntactic analysis and its practical applications.