API Reference

Lexicon classes

lexicon.Attribute(name, att_type[, ...])

Attributes are for collecting summary information about attributes of Words or WordTokens, with different types of attributes allowing for different behaviour

lexicon.Corpus(name[, update])

Lexicon to store information about Words, such as transcriptions, spellings and frequencies


Inventories contain information about a Corpus' segmental inventory. This class exists mainly for the purposes

lexicon.FeatureMatrix(name, feature_entries)

An object that stores feature values for segments

lexicon.Segment(symbol[, features])

Class for segment symbols


Transcription object, sequence of symbols


An object representing a word in a corpus

lexicon.EnvironmentFilter(middle_segments[, ...])

Filter to use for searching words to generate Environments that match

lexicon.Environment(middle, position[, lhs, rhs])

Specific sequence of segments that was a match for an EnvironmentFilter

Speech corpus classes


Discourse objects are collections of linear text with word tokens

spontaneous.Speaker(name, **kwargs)

Speaker objects contain information about the producers of WordTokens or Discourses

spontaneous.SpontaneousSpeechCorpus(name, ...)

SpontaneousSpeechCorpus objects a collection of Discourse objects and Corpus objects for frequency information.


WordToken objects are individual productions of Words

Corpus context managers

contextmanagers.BaseCorpusContext(corpus, ...)

Abstract Corpus context class that all other contexts inherit from.


Corpus context that uses canonical forms for transcriptions and tiers


Corpus context that uses the most frequent pronunciation variants for transcriptions and tiers


Corpus context that treats pronunciation variants as separate types for transcriptions and tiers


Corpus context that weights frequency of pronunciation variants by the number of variants or the token frequency for transcriptions and tiers

Corpus IO functions

Corpus binaries

binary.download_binary(name, path[, call_back])

Download a binary file of example corpora and feature matrices.


Unpickle a binary file

binary.save_binary(obj, path)

Pickle a Corpus or FeatureMatrix object for later loading

Loading from CSV

csv.load_corpus_csv(corpus_name, path, delimiter)

Load a corpus from a column-delimited text file

csv.load_feature_matrix_csv(name, path, ...)

Load a FeatureMatrix from a column-delimited text file

Export to CSV

csv.export_corpus_csv(corpus, path[, ...])

Save a corpus as a column-delimited text file

csv.export_feature_matrix_csv(...[, delimiter])

Save a FeatureMatrix as a column-delimited text file



Generate a list of AnnotationTypes for a specified TextGrid file

pct_textgrid.load_discourse_textgrid(...[, ...])

Load a discourse from a TextGrid file

pct_textgrid.load_directory_textgrid(...[, ...])

Loads a directory of TextGrid files

Running text


Generate a list of AnnotationTypes for a specified text file for parsing it as an orthographic text

text_spelling.load_discourse_spelling(...[, ...])

Load a discourse from a text file containing running text of orthography

text_spelling.load_directory_spelling(...[, ...])

Loads a directory of orthographic texts


Export an orthography discourse to a text file


Generate a list of AnnotationTypes for a specified text file for parsing it as a transcribed text


Load a discourse from a text file containing running transcribed text


Loads a directory of transcribed texts.


Export an transcribed discourse to a text file

Interlinear gloss text

text_ilg.inspect_discourse_ilg(path[, number])

Generate a list of AnnotationTypes for a specified text file for parsing it as an interlinear gloss text file

text_ilg.load_discourse_ilg(corpus_name, ...)

Load a discourse from a text file containing interlinear glosses

text_ilg.load_directory_ilg(corpus_name, ...)

Loads a directory of interlinear gloss text files

text_ilg.export_discourse_ilg(discourse, path)

Export a discourse to an interlinear gloss text file, with a maximal line size of 10 words

Other standards


Generate a list of AnnotationTypes for a specified dialect


Load a discourse from a text file containing interlinear glosses


Loads a directory of corpus standard files (separated into words files and phones files)

Analysis functions

Frequency of alternation

Frequency of alternation is currently not supported in PCT.

freq_of_alt.calc_freq_of_alt(corpus_context, ...)

Returns a double that is a measure of the frequency of alternation of two sounds in a given corpus

Functional load

Kullback-Leibler divergence

kl.KullbackLeibler(corpus_context, seg1, ...)

Calculates KL distances between two Phoneme objects in some context, either the left or right-hand side.

Mutual information

mutual_information.pointwise_mi(...[, ...])

Calculate the mutual information for a bigram.

Transitional probability

Neighborhood density


Calculate the neighborhood density of a particular word in the corpus.


Find all minimal pairs of the query word based only on segment mutations (not deletions/insertions)

Phonotactic probability


Calculate the phonotactic_probability of a particular word using the Vitevitch & Luce algorithm

Predictability of distribution

pred_of_dist.calc_prod_all_envs(...[, ...])

Main function for calculating predictability of distribution for two segments over a corpus, regardless of environment.

pred_of_dist.calc_prod(corpus_context, envs)

Main function for calculating predictability of distribution for two segments over specified environments in a corpus.

Symbol similarity


This function computes similarity of pairs of words across a corpus.

edit_distance.edit_distance(word1, word2, ...)

Returns the Levenshtein edit distance between a string from two words word1 and word2, code drawn from http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python.

khorsi.khorsi(word1, word2, freq_base, ...)

Calculate the string similarity of two words given a set of characters and their frequencies in a corpus based on Khorsi (2012)


Returns an analogue to Levenshtein edit distance but uses phonological _features instead of characters