Corpus

class corpustools.corpus.classes.lexicon.Corpus(name)[source]

Lexicon to store information about Words, such as transcriptions, spellings and frequencies

Parameters:

name : string

Name to identify Corpus

Attributes

name (str) Name of the corpus, used only for easy of reference
attributes (list of Attributes) List of Attributes that Words in the Corpus have
wordlist (dict) Dictionary where every key is a unique string representing a word in a corpus, and each entry is a Word object
words (list of strings) All the keys for the wordlist of the Corpus
specifier (FeatureSpecifier) See the FeatureSpecifier object
inventory (Inventory) Inventory that contains information about segments in the Corpus

Methods

__init__(name)
add_abstract_tier(attribute, spec) Add a abstract tier (currently primarily for generating CV skeletons from tiers).
add_attribute(attribute[, initialize_defaults]) Add an Attribute of any type to the Corpus or replace an existing Attribute.
add_count_attribute(attribute, ...) Add an Numeric Attribute that is a count of a segments in a tier that match a given specification.
add_tier(attribute, spec) Add a Tier Attribute based on the transcription of words as a new Attribute that includes all segments that match the specification.
add_word(word[, allow_duplicates]) Add a word to the Corpus.
check_coverage() Checks the coverage of the specifier (FeatureMatrix) of the Corpus over the
features_to_segments(feature_description) Given a feature description, return the segments in the inventory
find(word[, keyerror, ignore_case]) Search for a Word in the corpus
find_all(spelling) Find all Word objects with the specified spelling
get_features() Get a list of the features used to describe Segments
get_or_create_word(**kwargs) Get a Word object that has the spelling and transcription specified or create that Word, add it to the Corpus and return it.
get_random_subset(size[, new_corpus_name]) Get a new corpus consisting a random selection from the current corpus
iter_sort() Sorts the keys in the corpus dictionary, then yields the
iter_words() Sorts the keys in the corpus dictionary,
key(word)
keys()
random_word() Return a randomly selected Word
remove_attribute(attribute) Remove an Attribute from the Corpus and from all its Word objects.
remove_word(word_key) Remove a Word from the Corpus using its identifier in the Corpus.
segment_to_features(seg) Given a segment, return the features for that segment.
set_feature_matrix(matrix) Set the feature system to be used by the corpus and make sure every word is using it too.
subset(filters) Generate a subset of the corpus based on filters.
update_inventory(transcription) Update the inventory of the Corpus to ensure it contains all
add_abstract_tier(attribute, spec)[source]

Add a abstract tier (currently primarily for generating CV skeletons from tiers).

Specifiers for abstract tiers should be dictionaries with keys that are the abstract symbol (such as ‘C’ or ‘V’) and the values are iterables of segments that should count as that abstract symbols (such as all consonants or all vowels).

Currently only operates on the transcription of words.

Parameters:

attribute : Attribute

Attribute to add/replace

spec : dict

Mapping for creating abstract tier

add_attribute(attribute, initialize_defaults=False)[source]

Add an Attribute of any type to the Corpus or replace an existing Attribute.

Parameters:

attribute : Attribute

Attribute to add or replace

initialize_defaults : boolean

If True, words will have this attribute set to the default_value of the attribute, defaults to False

add_count_attribute(attribute, sequence_type, spec)[source]

Add an Numeric Attribute that is a count of a segments in a tier that match a given specification.

The specification should be either a list of segments or a string of the format ‘+feature1,-feature2’ that specifies the set of segments.

Parameters:

attribute : Attribute

Attribute to add or replace

sequence_type : string

Specifies whether to use ‘spelling’, ‘transcription’ or the name of a transcription tier to use for comparisons

spec : list or str

Specification of what segments should be counted

add_tier(attribute, spec)[source]

Add a Tier Attribute based on the transcription of words as a new Attribute that includes all segments that match the specification.

The specification should be either a list of segments or a string of the format ‘+feature1,-feature2’ that specifies the set of segments.

Parameters:

attribute : Attribute

Attribute to add or replace

spec : list or str

Specification of what segments should be counted

add_word(word, allow_duplicates=True)[source]

Add a word to the Corpus. If allow_duplicates is True, then words with identical spelling can be added. They are kept sepearate by adding a “silent” number to them which is never displayed to the user. If allow_duplicates is False, then duplicates are simply ignored.

Parameters:

word : Word

Word object to be added

allow_duplicates : bool

If False, duplicate Words with the same spelling as an existing word in the corpus will not be added

check_coverage()[source]

Checks the coverage of the specifier (FeatureMatrix) of the Corpus over the inventory of the Corpus

Returns:

list

List of segments in the inventory that are not in the specifier

features_to_segments(feature_description)[source]

Given a feature description, return the segments in the inventory that match that feature description

Feature descriptions should be either lists, such as [‘+feature1’, ‘-feature2’] or strings that can be separated into lists by ‘,’, such as ‘+feature1,-feature2’.

Parameters:

feature_description : string or list

Feature values that specify the segments, see above for format

Returns:

list of Segments

Segments that match the feature description

find(word, keyerror=True, ignore_case=False)[source]

Search for a Word in the corpus If keyerror == True, then raise a KeyError if the word is not found If keyerror == False, then return an EmptyWord if the word is not found

Parameters:

word : str

String representing the spelling of the word (not transcription)

keyerror : bool

Set whether a KeyError should be raised if a word is not found

Returns:

Word

Word that matches the spelling specified

Raises:

KeyError

If keyerror == True and word is not found

find_all(spelling)[source]

Find all Word objects with the specified spelling

Parameters:

spelling : string

Spelling to look up

Returns:

list of Words

Words that have the specified spelling

get_features()[source]

Get a list of the features used to describe Segments

Returns:list of str
get_or_create_word(**kwargs)[source]

Get a Word object that has the spelling and transcription specified or create that Word, add it to the Corpus and return it.

Parameters:

spelling : string

Spelling to search for

transcription : list

Transcription to search for

Returns:

Word

Existing or newly created Word with the spelling and transcription specified

get_random_subset(size, new_corpus_name='randomly_generated')[source]

Get a new corpus consisting a random selection from the current corpus

Parameters:

size : int

Size of new corpus

new_corpus_name : str

Returns:

new_corpus : Corpus

New corpus object with len(new_corpus) == size

iter_sort()[source]

Sorts the keys in the corpus dictionary, then yields the values in that order

Returns:

generator

Sorted Words in the corpus

iter_words()[source]

Sorts the keys in the corpus dictionary, then yields the values in that order

Returns:

generator

Sorted Words in the corpus

random_word()[source]

Return a randomly selected Word

Returns:

Word

Random Word

remove_attribute(attribute)[source]

Remove an Attribute from the Corpus and from all its Word objects.

Parameters:

attribute : Attribute

Attribute to remove

remove_word(word_key)[source]

Remove a Word from the Corpus using its identifier in the Corpus.

If the identifier is not found, nothing happens.

Parameters:

word_key : string

Identifier to use to remove the Word

segment_to_features(seg)[source]

Given a segment, return the features for that segment.

Parameters:

seg : string or Segment

Segment or Segment symbol to look up

Returns:

dict

Dictionary with keys as features and values as featue values

set_feature_matrix(matrix)[source]

Set the feature system to be used by the corpus and make sure every word is using it too.

Parameters:

matrix : FeatureMatrix

New feature system to use in the corpus

subset(filters)[source]

Generate a subset of the corpus based on filters.

Filters for Numeric Attributes should be tuples of an Attribute (of the Corpus), a comparison callable (__eq__, __neq__, __gt__, __gte__, __lt__, or __lte__) and a value to compare all such attributes in the Corpus to.

Filters for Factor Attributes should be tuples of an Attribute, and a set of levels for inclusion in the subset.

Other attribute types cannot currently be the basis for filters.

Parameters:

filters : list of tuples

See above for format

Returns:

Corpus

Subset of the corpus that matches the filter conditions

update_inventory(transcription)[source]

Update the inventory of the Corpus to ensure it contains all the segments in the given transcription

Parameters:

transcription : list

Segment symbols to add to the inventory if needed