Corpus

class corpustools.corpus.classes.lexicon.Corpus(name, update=False)[source]

Lexicon to store information about Words, such as transcriptions, spellings and frequencies

Parameters
namestring

Name to identify Corpus

Attributes
namestr

Name of the corpus, used only for easy of reference

attributeslist of Attributes

List of Attributes that Words in the Corpus have

wordlistdict

Dictionary where every key is a unique string representing a word in a corpus, and each entry is a Word object

wordslist of strings

All the keys for the wordlist of the Corpus

specifierFeatureSpecifier

See the FeatureSpecifier object

inventoryInventory

Inventory that contains information about segments in the Corpus

Methods

__init__(name[, update])

add_abstract_tier(attribute, spec)

Add a abstract tier (currently primarily for generating CV skeletons from tiers).

add_attribute(attribute[, initialize_defaults])

Add an Attribute of any type to the Corpus or replace an existing Attribute.

add_count_attribute(attribute, ...)

Add an Numeric Attribute that is a count of a segments in a tier that match a given specification.

add_tier(attribute, spec)

Add a Tier Attribute based on the transcription of words as a new Attribute that includes all segments that match the specification.

add_word(word[, allow_duplicates])

Add a word to the Corpus.

check_coverage()

Checks the coverage of the specifier (FeatureMatrix) of the Corpus over the inventory of the Corpus

features_to_segments(feature_description)

Given a feature description, return the segments in the inventory that match that feature description

find(word[, ignore_case])

Search for a Word in the corpus

find_all(spelling)

Find all Word objects with the specified spelling

generate_alternative_inventories()

get_features()

Get a list of the _features used to describe Segments

get_or_create_word(**kwargs)

Get a Word object that has the spelling and transcription specified or create that Word, add it to the Corpus and return it.

get_random_subset(size[, new_corpus_name])

Get a new corpus consisting a random selection from the current corpus

initDefaults()

iter_sort()

Sorts the keys in the corpus dictionary, then yields the values in that order

iter_words()

Sorts the keys in the corpus dictionary, then yields the values in that order

key(word)

keys()

random_word()

Return a randomly selected Word

remove_attribute(attribute)

Remove an Attribute from the Corpus and from all its Word objects.

remove_word(word_key)

Remove a Word from the Corpus using its identifier in the Corpus.

retranscribe(segmap)

segment_to_features(seg)

Given a segment, return the _features for that segment.

set_default_representations()

set_feature_matrix(matrix)

Set the feature system to be used by the corpus and make sure every word is using it too.

subset(filters, mode)

Generate a subset of the corpus based on filters.

symbol_to_segment(symbol)

update(old_corpus)

update_features()

update_inventory(transcription)

Update the inventory of the Corpus to ensure it contains all the segments in the given transcription

update_wordlist(new_wordlist)