Corpus¶
-
class
corpustools.corpus.classes.lexicon.
Corpus
(name)[source]¶ Lexicon to store information about Words, such as transcriptions, spellings and frequencies
Parameters: name : string
Name to identify Corpus
Attributes
name (str) Name of the corpus, used only for easy of reference attributes (list of Attributes) List of Attributes that Words in the Corpus have wordlist (dict) Dictionary where every key is a unique string representing a word in a corpus, and each entry is a Word object words (list of strings) All the keys for the wordlist of the Corpus specifier (FeatureSpecifier) See the FeatureSpecifier object inventory (Inventory) Inventory that contains information about segments in the Corpus Methods
__init__
(name)add_abstract_tier
(attribute, spec)Add a abstract tier (currently primarily for generating CV skeletons from tiers). add_attribute
(attribute[, initialize_defaults])Add an Attribute of any type to the Corpus or replace an existing Attribute. add_count_attribute
(attribute, ...)Add an Numeric Attribute that is a count of a segments in a tier that match a given specification. add_tier
(attribute, spec)Add a Tier Attribute based on the transcription of words as a new Attribute that includes all segments that match the specification. add_word
(word[, allow_duplicates])Add a word to the Corpus. check_coverage
()Checks the coverage of the specifier (FeatureMatrix) of the Corpus over the features_to_segments
(feature_description)Given a feature description, return the segments in the inventory find
(word[, keyerror, ignore_case])Search for a Word in the corpus find_all
(spelling)Find all Word objects with the specified spelling get_features
()Get a list of the features used to describe Segments get_or_create_word
(**kwargs)Get a Word object that has the spelling and transcription specified or create that Word, add it to the Corpus and return it. get_random_subset
(size[, new_corpus_name])Get a new corpus consisting a random selection from the current corpus iter_sort
()Sorts the keys in the corpus dictionary, then yields the iter_words
()Sorts the keys in the corpus dictionary, key
(word)keys
()random_word
()Return a randomly selected Word remove_attribute
(attribute)Remove an Attribute from the Corpus and from all its Word objects. remove_word
(word_key)Remove a Word from the Corpus using its identifier in the Corpus. segment_to_features
(seg)Given a segment, return the features for that segment. set_feature_matrix
(matrix)Set the feature system to be used by the corpus and make sure every word is using it too. subset
(filters)Generate a subset of the corpus based on filters. update_inventory
(transcription)Update the inventory of the Corpus to ensure it contains all -
add_abstract_tier
(attribute, spec)[source]¶ Add a abstract tier (currently primarily for generating CV skeletons from tiers).
Specifiers for abstract tiers should be dictionaries with keys that are the abstract symbol (such as ‘C’ or ‘V’) and the values are iterables of segments that should count as that abstract symbols (such as all consonants or all vowels).
Currently only operates on the
transcription
of words.Parameters: attribute : Attribute
Attribute to add/replace
spec : dict
Mapping for creating abstract tier
-
add_attribute
(attribute, initialize_defaults=False)[source]¶ Add an Attribute of any type to the Corpus or replace an existing Attribute.
Parameters: attribute : Attribute
Attribute to add or replace
initialize_defaults : boolean
If True, words will have this attribute set to the
default_value
of the attribute, defaults to False
-
add_count_attribute
(attribute, sequence_type, spec)[source]¶ Add an Numeric Attribute that is a count of a segments in a tier that match a given specification.
The specification should be either a list of segments or a string of the format ‘+feature1,-feature2’ that specifies the set of segments.
Parameters: attribute : Attribute
Attribute to add or replace
sequence_type : string
Specifies whether to use ‘spelling’, ‘transcription’ or the name of a transcription tier to use for comparisons
spec : list or str
Specification of what segments should be counted
-
add_tier
(attribute, spec)[source]¶ Add a Tier Attribute based on the transcription of words as a new Attribute that includes all segments that match the specification.
The specification should be either a list of segments or a string of the format ‘+feature1,-feature2’ that specifies the set of segments.
Parameters: attribute : Attribute
Attribute to add or replace
spec : list or str
Specification of what segments should be counted
-
add_word
(word, allow_duplicates=True)[source]¶ Add a word to the Corpus. If allow_duplicates is True, then words with identical spelling can be added. They are kept sepearate by adding a “silent” number to them which is never displayed to the user. If allow_duplicates is False, then duplicates are simply ignored.
Parameters: word : Word
Word object to be added
allow_duplicates : bool
If False, duplicate Words with the same spelling as an existing word in the corpus will not be added
-
check_coverage
()[source]¶ Checks the coverage of the specifier (FeatureMatrix) of the Corpus over the inventory of the Corpus
Returns: list
List of segments in the inventory that are not in the specifier
-
features_to_segments
(feature_description)[source]¶ Given a feature description, return the segments in the inventory that match that feature description
Feature descriptions should be either lists, such as [‘+feature1’, ‘-feature2’] or strings that can be separated into lists by ‘,’, such as ‘+feature1,-feature2’.
Parameters: feature_description : string or list
Feature values that specify the segments, see above for format
Returns: list of Segments
Segments that match the feature description
-
find
(word, keyerror=True, ignore_case=False)[source]¶ Search for a Word in the corpus If keyerror == True, then raise a KeyError if the word is not found If keyerror == False, then return an EmptyWord if the word is not found
Parameters: word : str
String representing the spelling of the word (not transcription)
keyerror : bool
Set whether a KeyError should be raised if a word is not found
Returns: Word
Word that matches the spelling specified
Raises: KeyError
If keyerror == True and word is not found
-
find_all
(spelling)[source]¶ Find all Word objects with the specified spelling
Parameters: spelling : string
Spelling to look up
Returns: list of Words
Words that have the specified spelling
-
get_or_create_word
(**kwargs)[source]¶ Get a Word object that has the spelling and transcription specified or create that Word, add it to the Corpus and return it.
Parameters: spelling : string
Spelling to search for
transcription : list
Transcription to search for
Returns: Word
Existing or newly created Word with the spelling and transcription specified
-
get_random_subset
(size, new_corpus_name='randomly_generated')[source]¶ Get a new corpus consisting a random selection from the current corpus
Parameters: size : int
Size of new corpus
new_corpus_name : str
Returns: new_corpus : Corpus
New corpus object with len(new_corpus) == size
-
iter_sort
()[source]¶ Sorts the keys in the corpus dictionary, then yields the values in that order
Returns: generator
Sorted Words in the corpus
-
iter_words
()[source]¶ Sorts the keys in the corpus dictionary, then yields the values in that order
Returns: generator
Sorted Words in the corpus
-
remove_attribute
(attribute)[source]¶ Remove an Attribute from the Corpus and from all its Word objects.
Parameters: attribute : Attribute
Attribute to remove
-
remove_word
(word_key)[source]¶ Remove a Word from the Corpus using its identifier in the Corpus.
If the identifier is not found, nothing happens.
Parameters: word_key : string
Identifier to use to remove the Word
-
segment_to_features
(seg)[source]¶ Given a segment, return the features for that segment.
Parameters: seg : string or Segment
Segment or Segment symbol to look up
Returns: dict
Dictionary with keys as features and values as featue values
-
set_feature_matrix
(matrix)[source]¶ Set the feature system to be used by the corpus and make sure every word is using it too.
Parameters: matrix : FeatureMatrix
New feature system to use in the corpus
-
subset
(filters)[source]¶ Generate a subset of the corpus based on filters.
Filters for Numeric Attributes should be tuples of an Attribute (of the Corpus), a comparison callable (
__eq__
,__neq__
,__gt__
,__gte__
,__lt__
, or__lte__
) and a value to compare all such attributes in the Corpus to.Filters for Factor Attributes should be tuples of an Attribute, and a set of levels for inclusion in the subset.
Other attribute types cannot currently be the basis for filters.
Parameters: filters : list of tuples
See above for format
Returns: Corpus
Subset of the corpus that matches the filter conditions
-