Corpus¶

class corpustools.corpus.classes.lexicon.Corpus(name, update=False)[source]¶

Lexicon to store information about Words, such as transcriptions, spellings and frequencies

Parameters:

name : string: Name to identify Corpus

Attributes:

name : str: Name of the corpus, used only for easy of reference
attributes : list of Attributes: List of Attributes that Words in the Corpus have
wordlist : dict: Dictionary where every key is a unique string representing a word in a corpus, and each entry is a Word object
words : list of strings: All the keys for the wordlist of the Corpus
specifier : FeatureSpecifier: See the FeatureSpecifier object
inventory : Inventory: Inventory that contains information about segments in the Corpus

Methods

`__init__`(name[, update])	Initialize self.
`add_abstract_tier`(attribute, spec)	Add a abstract tier (currently primarily for generating CV skeletons from tiers).
`add_attribute`(attribute[, initialize_defaults])	Add an Attribute of any type to the Corpus or replace an existing Attribute.
`add_count_attribute`(attribute, …)	Add an Numeric Attribute that is a count of a segments in a tier that match a given specification.
`add_tier`(attribute, spec)	Add a Tier Attribute based on the transcription of words as a new Attribute that includes all segments that match the specification.
`add_word`(word[, allow_duplicates])	Add a word to the Corpus.
`check_coverage`()	Checks the coverage of the specifier (FeatureMatrix) of the Corpus over the inventory of the Corpus
`features_to_segments`(feature_description)	Given a feature description, return the segments in the inventory that match that feature description
`find`(word[, ignore_case])	Search for a Word in the corpus
`find_all`(spelling)	Find all Word objects with the specified spelling
`generate_alternative_inventories`()
`get_features`()	Get a list of the _features used to describe Segments
`get_or_create_word`(**kwargs)	Get a Word object that has the spelling and transcription specified or create that Word, add it to the Corpus and return it.
`get_random_subset`(size[, new_corpus_name])	Get a new corpus consisting a random selection from the current corpus
`initDefaults`()
`iter_sort`()	Sorts the keys in the corpus dictionary, then yields the values in that order
`iter_words`()	Sorts the keys in the corpus dictionary, then yields the values in that order
`key`(word)
`keys`()
`random_word`()	Return a randomly selected Word
`remove_attribute`(attribute)	Remove an Attribute from the Corpus and from all its Word objects.
`remove_word`(word_key)	Remove a Word from the Corpus using its identifier in the Corpus.
`retranscribe`(segmap)
`segment_to_features`(seg)	Given a segment, return the _features for that segment.
`set_default_representations`()
`set_feature_matrix`(matrix)	Set the feature system to be used by the corpus and make sure every word is using it too.
`subset`(filters)	Generate a subset of the corpus based on filters.
`symbol_to_segment`(symbol)
`update`(old_corpus)
`update_features`()
`update_inventory`(transcription)	Update the inventory of the Corpus to ensure it contains all the segments in the given transcription
`update_wordlist`(new_wordlist)

add_abstract_tier(attribute, spec)[source]¶

Add a abstract tier (currently primarily for generating CV skeletons from tiers).

Specifiers for abstract tiers should be dictionaries with keys that are the abstract symbol (such as ‘C’ or ‘V’) and the values are iterables of segments that should count as that abstract symbols (such as all consonants or all vowels).

Currently only operates on the transcription of words.

Parameters:	attribute : Attribute Attribute to add/replace spec : dict Mapping for creating abstract tier

add_attribute(attribute, initialize_defaults=False)[source]¶

Add an Attribute of any type to the Corpus or replace an existing Attribute.

Parameters:	attribute : Attribute Attribute to add or replace initialize_defaults : boolean If True, words will have this attribute set to the `default_value` of the attribute, defaults to False

add_count_attribute(attribute, sequence_type, spec)[source]¶

Add an Numeric Attribute that is a count of a segments in a tier that match a given specification.

The specification should be either a list of segments or a string of the format ‘+feature1,-feature2’ that specifies the set of segments.

Parameters:	attribute : Attribute Attribute to add or replace sequence_type : string Specifies whether to use ‘spelling’, ‘transcription’ or the name of a transcription tier to use for comparisons spec : list or str Specification of what segments should be counted

add_tier(attribute, spec)[source]¶

Add a Tier Attribute based on the transcription of words as a new Attribute that includes all segments that match the specification.

The specification should be either a list of segments or a string of the format ‘+feature1,-feature2’ that specifies the set of segments.

Parameters:	attribute : Attribute Attribute to add or replace spec : list or str Specification of what segments should be counted

add_word(word, allow_duplicates=False)[source]¶

Add a word to the Corpus. If allow_duplicates is True, then words with identical spelling can be added. They are kept sepearate by adding a “silent” number to them which is never displayed to the user. If allow_duplicates is False, then duplicates are simply ignored.

Parameters:	word : Word Word object to be added allow_duplicates : bool If False, duplicate Words with the same spelling as an existing word in the corpus will not be added

check_coverage()[source]¶

Checks the coverage of the specifier (FeatureMatrix) of the Corpus over the inventory of the Corpus

Returns:	list List of segments in the inventory that are not in the specifier

features_to_segments(feature_description)[source]¶

Given a feature description, return the segments in the inventory that match that feature description

Feature descriptions should be either lists, such as [‘+feature1’, ‘-feature2’] or strings that can be separated into lists by ‘,’, such as ‘+feature1,-feature2’.

Parameters:	feature_description : string or list Feature values that specify the segments, see above for format
Returns:	list of Segments Segments that match the feature description

find(word, ignore_case=False)[source]¶

Search for a Word in the corpus

Parameters:	word : str String representing the spelling of the word (not transcription)
Returns:	Word Word that matches the spelling specified
Raises:	KeyError If word is not found

find_all(spelling)[source]¶

Find all Word objects with the specified spelling

Parameters:	spelling : string Spelling to look up
Returns:	list of Words Words that have the specified spelling

get_features()[source]¶

Get a list of the _features used to describe Segments

Returns:	list of str

get_or_create_word(**kwargs)[source]¶

Get a Word object that has the spelling and transcription specified or create that Word, add it to the Corpus and return it.

Parameters:	spelling : string Spelling to search for transcription : list Transcription to search for
Returns:	Word Existing or newly created Word with the spelling and transcription specified

get_random_subset(size, new_corpus_name='randomly_generated')[source]¶

Get a new corpus consisting a random selection from the current corpus

Parameters:	size : int Size of new corpus new_corpus_name : str
Returns:	new_corpus : Corpus New corpus object with len(new_corpus) == size

iter_sort()[source]¶

Sorts the keys in the corpus dictionary, then yields the values in that order

Returns:	generator Sorted Words in the corpus

iter_words()[source]¶

Sorts the keys in the corpus dictionary, then yields the values in that order

Returns:	generator Sorted Words in the corpus

random_word()[source]¶

Return a randomly selected Word

Returns:	Word Random Word

remove_attribute(attribute)[source]¶

Remove an Attribute from the Corpus and from all its Word objects.

Parameters:	attribute : Attribute Attribute to remove

remove_word(word_key)[source]¶

Remove a Word from the Corpus using its identifier in the Corpus.

If the identifier is not found, nothing happens.

Parameters:	word_key : string Identifier to use to remove the Word

segment_to_features(seg)[source]¶

Given a segment, return the _features for that segment.

Parameters:	seg : string or Segment Segment or Segment symbol to look up
Returns:	dict Dictionary with keys as _features and values as featue values

set_feature_matrix(matrix)[source]¶

Set the feature system to be used by the corpus and make sure every word is using it too.

Parameters:	matrix : FeatureMatrix New feature system to use in the corpus

subset(filters)[source]¶

Generate a subset of the corpus based on filters.

Filters for Numeric Attributes should be tuples of an Attribute (of the Corpus), a comparison callable (__eq__, __neq__, __gt__, __gte__, __lt__, or __lte__) and a value to compare all such attributes in the Corpus to.

Filters for Factor Attributes should be tuples of an Attribute, and a set of levels for inclusion in the subset.

Other attribute types cannot currently be the basis for filters.

Parameters:	filters : list of tuples See above for format
Returns:	Corpus Subset of the corpus that matches the filter conditions

update_inventory(transcription)[source]¶

Update the inventory of the Corpus to ensure it contains all the segments in the given transcription

Parameters:	transcription : list Segment symbols to add to the inventory if needed