Korpusomat is a simple web application designed for the creation of morphosyntactically tagged text corpora that incorporates the MTAS search engine. Korpusomat integrates a selection of tools for natural language processing developed in Linguistic Engineering Group Institute of Computer Science Polish Academy of Science.

The main tools used for processing are:

The first two tools, Morfeusz & Concraft, are continually developed and updated. Liner 2 has many features such as temporal expression, action description recognition, named entity recognition. Korpusomat is currently limited to NER. TermoPL is used to extract terminology - in the corpus statistics view. MTAS is a corpus search engine developed by Meertens Instituut under the CLARIN project.

Korpusomat processes text files (txt) and most of the other formats used to preserve text data (e.g. epub, mobi, doc, rtf, pdf) - with a full list of formats included here: http://tika.apache.org/1.17/formats.html. All texts are converted to UTF-8 encoding for processing.

Korpusomat allows adding articles from webpages. Added URL is processed by the newspaper library described here.


Details of features are described in the materials below (in Polish)


Used tools

Included tools are described e.g. in: