WeightedVariantContext¶

class corpustools.contextmanagers.WeightedVariantContext(corpus, sequence_type, type_or_token, attribute=None, frequency_threshold=0)[source]¶

Corpus context that weights frequency of pronunciation variants by the number of variants or the token frequency for transcriptions and tiers

See the documentation of BaseCorpusContext for additional information

Methods

`__init__`(corpus, sequence_type, type_or_token)
`get_frequency_base`([gramsize, halve_edges, ...])	Generate (and cache) frequencies for each segment in the Corpus.
`get_phone_probs`([gramsize, probability, ...])	Generate (and cache) phonotactic probabilities for segments in the Corpus.

get_frequency_base(gramsize=1, halve_edges=False, probability=False)¶

Generate (and cache) frequencies for each segment in the Corpus.

Parameters:

halve_edges : boolean

If True, word boundary symbols (‘#’) will only be counted once per word, rather than twice. Defaults to False.

gramsize : integer

Size of n-gram to use for getting frequency, defaults to 1 (unigram)

probability : boolean

If True, frequency counts will be normalized by total frequency, defaults to False

Returns:

dict

Keys are segments (or sequences of segments) and values are their frequency in the Corpus

get_phone_probs(gramsize=1, probability=True, preserve_position=True, log_count=True)¶

Generate (and cache) phonotactic probabilities for segments in the Corpus.

Parameters:

gramsize : integer

Size of n-gram to use for getting frequency, defaults to 1 (unigram)

probability : boolean

If True, frequency counts will be normalized by total frequency, defaults to False

preserve_position : boolean

If True, segments will in different positions in the transcription will not be collapsed, defaults to True

log_count : boolean

If True, token frequencies will be logrithmically-transformed prior to being summed

Returns:

dict

Keys are segments (or sequences of segments) and values are their phonotactic probability in the Corpus