Example corpora

There are two example corpora that can be used in PCT that consist of entirely made-up data. The first is a very small corpus called the “example” corpus; the second is a slightly larger corpus called “Lemurian.” Their purpose is to serve as “practice” corpora, allowing the user to become familiar with PCT while working with an unfamiliar language.

The example corpus

The example corpus was included in the earliest versions of PCT and has a few useful patterns for testing out analysis functions. For practical purposes, it is essentially superseded by the Lemurian corpus, which is more complex, but we continue to include it since many of the “help” examples are based on its contents.


Stops: /t/

Nasals: /m, n/

Fricatives: /s, ʃ/

Vowels: /i, e, u, o, a/

Phonological restrictions:

  1. [e] and [o] are allophones of [i] and [u], respectively, which occur only immediately before a nasal consonant.
  2. Non-low vowel harmony with blocking nasals: there are no sequences of a non-low front vowel followed (at any distance) by a non-low mid vowel or vice-versa without an intervening nasal consonant.
  3. Left-spreading ʃ-dominant sibilant harmony: there are no sequences of [s] followed (at any distance) by [ʃ].
  4. Purely CV syllable structure.

The Lemurian corpus

The Lemurian corpus is generated by a Python script such that it will contain patterns that are easily detected with the analysis functions in PCT. It can be generated to any size. The specific instantiation of the Lemurian corpus that is available in PCT is generated with 30 words. It is a bit larger than the example corpus and has a few specific characteristics that users may find useful. Not all of the following may be particularly visible in this sample of Lemurian, but these are the guidelines along which the corpus is built:


Stops: /p, b, t, d, k/

Nasals: /m, n/

Fricatives: /f, s, x/

Liquids: /l, r/

Glides: /j, w/

Vowels: /i, e, u, o, a/


Words in the corpus can be anywhere from 1 to 5 syllables long. Lemurian has a maximum syllable of C1C2VC3 with the following phonotactics:

  1. Codas and onsets are always optional.
  2. C1 can be any consonant in the inventory.
  3. C2 can be a glide, a stop, or a nasal. Glides can occur after any consonants.
  4. C2 can only be a stop or a nasal if C1 is a fricative.
  5. C3 if present must be a nasal.

Lemurian has front/back vowel harmony, and a word can only contain vowels from one of those categories. Front vowels are /i, e/, back vowels are /u, o/. The vowel /a/ is neutral and can appear with vowels from either set.

Other phonological patterns:

  1. The sound [z] occurs only as an allophone of /s/ between vowels, i.e. s -> z / V_V
  2. Voiced and voiceless stops only contrast in C1 position of a syllable. If a stop appears in C2 (following a fricative) then it is necessarily a voiceless stop.
  3. Coronals have a high functional load.
  4. Word frequencies are randomly generated, so there is no guarantee about any sound or sound sequence being more or less common.

Orthography vs. transcription:

The Lemurian corpus contains a number of intentional mismatches between the spelling system and the transcription system. This allows users to test out the differences between selecting spelling and transcription for some of the analysis functions, e.g. string similarity.

  1. All of the transcription symbols match the orthographic symbols, except for /x/ which is written as “h.”
  2. Coda nasals are not distinguished in writing. Both use the symbol ‘N’. This obscures some minimal pairs.
  3. If a syllable starts with /ju/ or /wu/, the glide is not written.
  4. The allophone [z] is written as “s”

For example, the word [junxwa] would be spelled “uNhwa”