Account creation

Signing up is necessary to start using Korpusomat. To register you should provide your e-mail address and user password. The account is used to manage your corpora.

To sign up, click on the link or the button in the right corner of the navigation bar.

Corpus creation

To create a new corpus (after signing in) you need to click the "New corpus" button (1).

Provide a name for your new corpora (2) and select text annotation layers (6). Moreover, you can modify metadata schema for text in the new corpus (4), (5), and add a description (3). In the end, click the "create" (7) button.

After corpus creation, you are redirected to the "My corpora" view. To add new texts to the new corpus, click its name. In this view there you can also delete, modify and create new corpora.

To add new text to the corpus, click the "+" (11) button in the right bottom corner.

Then, you are redirected to the text upload view. A list of compatible file formats (e.g. pdf, txt, epub) is presented in the overview. There are two ways of adding text.

Firstly, you can add local files by the "+ Add files" (12) button and then select one or more files from your device.

Secondly, you can select the "provide URL" option, add a URL and click "Download" - Korpusomat will gather files from the internet e.g an article from a website.

After processing the files there is a possibility to edit document metadata (16), add new texts (15), edit documents content. There is an automatic method of metadata extraction. However, metadata can be further updated.

Before confirmation, you can modify metadata manually.

To select additional files click "Add" (15). The process is the same for the subsequent documents.

After adding all files, updating metadata, you should click "Add" (17) at the bottom. Documents will be added to the corpus and processed.

After adding files, you will be redirected to the corpus view. Statuses of the corpus and all files will change according to the processing stage. Annotation of long documents can take from several minutes to half an hour.

After processing completes correctly (21) all files, the status of the corpus will change to "Ready" (20) and the corpus will be ready to use:

  • Corpus query - button (22)
  • Corpus statistics — button (23)
  • Download of processed files — button (24)
  • Sharing the corpus with other users - button (25)
  • Editing the corpus - button (26)

Clicking the button (23) causes redirection to the statistics view of the corpus and starts generation of the statistics. Descriptions of the individual parts of the view are presented here.

Clicking the button (24) causes downloading an archive of processed corpora in XML format (CCL).

After adding new documents or editing the corpus, Korpusomat will reprocess the corpus.

Corpus usage

Corpus query

Clicking the button (22) redirects to query view. In the "query" field (27), you should write your query and then use the "Search" button (31). The corpus query language description is presented in the guide. The button (28) opens the interactive query builder. The button (29) allows restricting query by documents metadata. Then, the button (30) generates a frequency list based on the corpus query.

The button (28) shows query builder. It can be used to create a query easily. However, it does not cover all query language features and capabilities.

Clicking the button (32) shows metadata list (33) which can be used to restrict search results by specific criteria.

Clicking the button (34) shows an extended statistics menu (35). Results of query can be aggregated as frequency list grouped by a linguistic property or a graph grouped by documents metadata.

After the search, the result view is presented. There is additional information about results such as context. More information is presented after clicking on the result. The results list can be downloaded in CSV format. To show dependency parsing visualization of result sentences (if the option of dependency parsing was chosen) click the icon on the right side of the table (37).

Text metadata autocomplete

When adding text to the corpus, the following metadata values are selected from the name of the added file:

  • title
  • author
  • place of publication
  • year of publication

, but only if the metadata listed has been selected in the corpus settings. The format of the text titles should be as follows:

<author>-<title>(<place of publication>,<year of publication>)

In each of the metadata listed, only certain types of characters are accepted:

  • author - letters and digits
  • title - all characters except "("
  • place of publication - letters
  • year of publication - digits

However, it is possible to omit some metadata. Other examples of accepted formats:

Note: The "⎵" sign means a space character or any number of its repetitions.

  • omitting author:
  • <title>(<place of publication>,<year of publication>)

  • omitting year of publication:
  • <author>-<title>(<place of publication>,)

  • omitting place of publication:
  • <author>-<title>(<year of publication>)

  • only title:
  • <title>()

Examples of file names (without the extension):