API Reference

Data structures

class pygamma_agreement.Unit(segment: Segment, annotation: str | None = None)

Represents an annotated unit, e.g., a time segment and (optionally) a text annotation. Can be sorted or used in a set. If two units share the same time segment, they’re sorted alphabetically using their annotation. The None annotation is first in the “alphabet”

>>> new_unit = Unit(segment=Segment(17.5, 21.3), annotation='Verb')
>>> new_unit.segment.start, new_unit.segment.end
17.5, 21.3
>>> new_unit.annotation
'Verb'
__delattr__(name)

Implement delattr(self, name).

__eq__(other)

Return self==value.

__ge__(other, NotImplemented=NotImplemented)

Return a >= b. Computed by @total_ordering from (not a < b).

__gt__(other, NotImplemented=NotImplemented)

Return a > b. Computed by @total_ordering from (not a < b) and (a != b).

__hash__()

Return hash(self).

__init__(segment: Segment, annotation: str | None = None) None
__le__(other, NotImplemented=NotImplemented)

Return a <= b. Computed by @total_ordering from (a < b) or (a == b).

__lt__(other: Unit)

Return self<value.

__repr__()

Return repr(self).

__setattr__(name, value)

Implement setattr(self, name, value).

class pygamma_agreement.Continuum(uri: str | None = None)

Representation of a continuum, i.e a set of annotated segments by multiple annotators. It is implemented as a dictionnary of sets (all sorted) :

{'annotator1': {unit1, ...}, ...}

__add__(other: Continuum)

Same as a “not-in-place” merge.

Parameters:

other (Continuum) – the continuum to merge into self

__bool__()

Truthiness, basically tests for emptiness

>>> if continuum:
...    # continuum is not empty
... else:
...    # continuum is empty
__eq__(other: Continuum)

Two continua are equal if and only if all their annotators and all their units are strictly equal

__getitem__(keys: str | Tuple[str, int]) SortedSet | Unit

Get the set of annotations from an annotator, or a specific annotation. (Deep copies are returned to ensure some constraints cannot be violated)

>>> continuum['Alex']
SortedSet([Unit(segment=<Segment(2, 9)>, annotation='1'), Unit(segment=<Segment(11, 17)>, ...
>>> continuum['Alex', 0]
Unit(segment=<Segment(2, 9)>, annotation='1')
Parameters:

keys (Annotator or Annotator,int) –

Raises:

KeyError

__hash__ = None
__init__(uri: str | None = None)

Default constructor.

Parameters:

uri (optional str) – name of annotated resource (e.g. audio or video file)

__iter__() Generator[Tuple[str, Unit], None, None]

Iterates over (annotator, unit) tuples for every unit in the continuum.

__ne__(other: Continuum)

Return self!=value.

add(annotator: str, segment: Segment, annotation: str | None = None)

Add a segment to the continuum

Parameters:
  • annotator (Annotator (str)) – The annotator that produced the added annotation

  • segment (pyannote.core.Segment) – The segment for that annotation

  • annotation (optional str) – That segment’s annotation, if any.

add_annotation(annotator: str, annotation: Annotation)

Add a full pyannote annotation to the continuum.

Parameters:
  • annotator (Annotator (str)) – A string id for the annotator who produced that annotation.

  • annotation (pyannote.core.Annotation) – A pyannote Annotation object. If a label is present for a given segment, it will be considered as that label’s annotation.

add_annotator(annotator: str)

Adds the annotator to the set, with no annotated segment. Does nothing if already present.

add_elan(annotator: str, eaf_path: str | Path, selected_tiers: List[str] | None = None, use_tier_as_annotation: bool = False)

Add an Elan (.eaf) file’s content to the Continuum

Parameters:
  • annotator (Annotator (str)) – A string id for the annotator who produced that ELAN file.

  • eaf_path (Path or str) – Path to the .eaf (ELAN) file.

  • selected_tiers (optional list of str) – If set, will drop tiers that are not contained in this list.

  • use_tier_as_annotation (optional bool) – If True, the annotation for each non-empty interval will be the name of its parent Tier.

add_textgrid(annotator: str, tg_path: str | Path, selected_tiers: List[str] | None = None, use_tier_as_annotation: bool = False)

Add a textgrid file’s content to the Continuum

Parameters:
  • annotator (Annotator (str)) – A string id for the annotator who produced that TextGrid.

  • tg_path (Path or str) – Path to the textgrid file.

  • selected_tiers (optional list of str) – If set, will drop tiers that are not contained in this list.

  • use_tier_as_annotation (optional bool) – If True, the annotation for each non-empty interval will be the name of its parent Tier.

add_timeline(annotator: str, timeline: Timeline)

Add a full pyannote timeline to the continuum.

Parameters:
  • annotator (Annotator (str)) – A string id for the annotator who produced that timeline.

  • timeline (pyannote.core.Timeline) – A pyannote Annotation object. No annotation will be attached to segments.

property annotators: SortedSet

Returns a sorted set of the annotators in the Continuum

>>> self.annotators:
... SortedSet(["annotator_a", "annotator_b", "annot_ref"])
property avg_length_unit: float

Mean of the annotated segments’ durations

property avg_num_annotations_per_annotator: float

Average number of annotated segments per annotator

property bounds: Tuple[float, float]

Bounds of the continuum. Initially defined as (0, 0), they grow as annotations are added.

property categories: SortedSet

Returns the (alphabetically) sorted set of all the continuum’s annotations’s categories.

property category_weights: SortedDict

Returns a dictionary where the keys are the categories in the continuum, and a key’s value is the proportion of occurrence of the category in the continuum.

compute_gamma(dissimilarity: AbstractDissimilarity | None = None, n_samples: int = 30, precision_level: float | Literal['high', 'medium', 'low'] | None = None, ground_truth_annotators: SortedSet | None = None, sampler: AbstractContinuumSampler = None, fast: bool = False, soft: bool = False) GammaResults
Parameters:
  • dissimilarity (AbstractDissimilarity, optional) – dissimilarity instance. Used to compute the disorder between units. If not set, it defaults to the combined categorical dissimilarity with parameters taken from the java implementation.

  • n_samples (optional int) – number of random continuum sampled from this continuum used to estimate the gamma measure

  • precision_level (optional float or "high", "medium", "low") – error percentage of the gamma estimation. If a literal precision level is passed (e.g. “medium”), the corresponding numerical value will be used (high: 1%, medium: 2%, low : 5%)

  • ground_truth_annotators (SortedSet of str) – if set, the random continuua will only be sampled from these annotators. This should be used when you want to compare a prediction against some ground truth annotation.

  • sampler (AbstractContinuumSampler) – Sampler object, which implements a sampling strategy for creating random continuua used to calculate the expected disorder. If not set, defaults to the Statistical continuum sampler

  • fast – Sets the algorithm to the much faster fast-gamma. It’s supposed to be less precise than the “canonical” algorithm from Mathet 2015, but usually isn’t. Performance gains and precision are explained in the Performance section of the documentation.

  • soft – Activate soft-gamma, an alternative measure that uses a slighlty different definition of an alignment. For further information, please consult the ‘Soft-Gamma’ section of the documentation. Incompatible with fast-gamma : raises an error if both ‘fast’ and ‘soft’ are set to True.

copy() Continuum

Makes a copy of the current continuum.

Returns:

continuum

Return type:

Continuum

copy_flush() Continuum

Returns a copy of the continuum without any annotators/annotations, but with every other information

classmethod from_csv(path: str | Path, discard_invalid_rows=True, delimiter: str = ',')

Load annotations from a CSV file , with structure annotator, category, segment_start, segment_end.

Warning

The CSV file mustn’t have any header

Parameters:
  • path (Path or str) – Path to the CSV file storing annotations

  • discard_invalid_rows (bool) – If set, every invalid row is ignored when parsing the file.

  • delimiter (str) – CSV columns delimiter. Defaults to ‘,’

Returns:

New continuum object loaded from the CSV

Return type:

Continuum

classmethod from_rttm(path: str | Path) Continuum

Load annotations from a RTTM file. The file name field will be used as an annotation’s annotator

Parameters:

path (Path or str) – Path to the RTTM file storing annotations

Returns:

continuum – New continuum object loaded from the RTTM file

Return type:

Continuum

get_best_alignment(dissimilarity: AbstractDissimilarity) Alignment

Returns the best alignment of the continuum for the given dissimilarity. This alignment comes with the associated disorder, so you can obtain it in constant time with alignment.disorder. Beware that the computational complexity of the algorithm is very high \((O(p_1 \times p_2 \times ... \times p_n)\) where \(p_i\) is the number of annotations of annotator \(i\)).

Parameters:

dissimilarity (AbstractDissimilarity) – the dissimilarity that will be used to compute unit-to-unit disorder.

get_fast_alignment(dissimilarity: AbstractDissimilarity, window_size: int) Alignment

Returns an ‘approximation’ of the best alignment (Very likely to be the actual best alignment for continua with limited overlapping)

get_first_window(dissimilarity: AbstractDissimilarity, w: int = 1) Tuple[Continuum, float]
Returns a tuple (continuum, x_limit), where :
  • Before x_limit, there are the (w * nb_annotators) leftmost annotations of the continuum.

  • After x_limit, there are (approximately) all the annotations from the continuum that have a dissimilarity lower than (delta_empty * nb_annotators) with the annotations before x_limit.

iter_annotator(annotator: str) Generator[Unit, None, None]

Iterates over the annotations of the given annotator.

Raises:

KeyError – If the annotators is not on this continuum.

iterunits(annotator: str)

Iterate over units from the given annotator (in chronological and alphabetical order if annotations are present)

>>> for unit in self.iterunits("Max"):
...     # do something with the unit
property max_num_annotations_per_annotator

The maximum number of annotated segments an annotator has in this continuum

measure_best_window_size(dissimilarity: AbstractDissimilarity)

Sets the best window size for computing the fast-gamma of this continuum, by using the sampling the computing complexity function.

merge(continuum: Continuum, in_place: bool = False) Continuum | None

Merge two Continuua together. Units from the same annotators are also merged together (with the usual order of units).

Parameters:
  • continuum (Continuum) – other continuum to merge into the current one.

  • in_place (bool) – If set to true, the merge is done in place, and the current continuum (self) is the one being modified. A new continuum resulting in the merge is returned otherwise.

Returns:

Continuum, optional

Return type:

Returns the merged copy if in_place is set to True.

property num_annotators: int

Number of annotators

property num_units: int

Total number of units in the continuum.

remove(annotator: str, unit: Unit)

Removes the given unit from the given annotator’s annotations. Keeps the bounds of the continuum as they are. :raises KeyError: if the unit is not from the annotator’s annotations.

reset_bounds()

Resets the bounds of the continuum (used in displaying and/or sampling) to the start of leftmost annotation and the end of rightmost annotation.

class pygamma_agreement.UnitaryAlignment(n_tuple: List[Tuple[str, Unit | None]])

Unitary Alignment

Parameters:

n_tuple – n-tuple where n is the number of annotators of the continuum This is a list of (annotator, segment) couples

__init__(n_tuple: List[Tuple[str, Unit | None]])
property bounds

Start of leftmost unit and end of rightmost unit

compute_disorder(dissimilarity: AbstractDissimilarity)

Building a fake one-element alignment to compute the disorder

property disorder: float

Disorder of the alignment. Raises ValueError if self.compute_disorder(dissimilarity) hasn’t been called before.

property nb_units

The number of non-empty units in the unitary alignment.

class pygamma_agreement.Alignment(unitary_alignments: Iterable[UnitaryAlignment], continuum: Continuum | None = None, check_validity: bool = False, disorder: float | None = None)
__init__(unitary_alignments: Iterable[UnitaryAlignment], continuum: Continuum | None = None, check_validity: bool = False, disorder: float | None = None)

Alignment constructor.

Parameters:
  • unitary_alignments – set of unitary alignments that make a partition of the set of units/segments

  • continuum (optional Continuum) – Continuum where the alignment is from

  • check_validity (bool) – Check the validity of that Alignment against the specified continuum

  • disorder (float, optional) – If set, self.disorder returns it until a call to self.compute_disorder. It allows to make the most of the best alignment computation, that takes advantage of this value.

check(continuum: Continuum | None = None)

Checks that an alignment is a valid partition of a Continuum. That is, that all annotations from the referenced continuum can be found in the alignment and can be found only once. Empty units are not taken into account.

Parameters:

continuum (optional Continuum) – Continuum to check the alignment against. If none is specified, will try to use the one set at instanciation.

Raises:

ValueError, SetPartitionError

compute_disorder(dissimilarity: AbstractDissimilarity)

Recalculates the disorder of this alignment using the given dissimilarity computer. Usually not needed since most alignment are generated from a minimal disorder.

property disorder: float

returns: The disorder of the alignment. :rtype: float

gamma_k_disorder(dissimilarity: AbstractDissimilarity, category: str | None) float

Returns the gamma-k or gamma-cat metric disorder. (Exact implementation of the algorithm from section 4.2.5 of https://hal.archives-ouvertes.fr/hal-01712281)

Parameters:
  • dissimilarity (AbstractDissimilarity) – the dissimilarity measure to be used in the algorithm. Raises ValueError if it is not a combined categorical dissimilarity, as gamma-cat requires both positional and categorical dissimilarity.

  • category – If set, the category to be used as reference for gamma-k. Leave it unset to compute the gamma-cat disorder.

class pygamma_agreement.GammaResults(best_alignment: Alignment, chance_alignments: List[Alignment], dissimilarity: AbstractDissimilarity, precision_level: float | None = None)

Gamma results object. Stores the information about a gamma measure computation, used for getting the values of measures from the gamma family (gamma, gamma-cat and gamma-k).

__eq__(other)

Return self==value.

__hash__ = None
__init__(best_alignment: Alignment, chance_alignments: List[Alignment], dissimilarity: AbstractDissimilarity, precision_level: float | None = None) None
__repr__()

Return repr(self).

property alignments_nb

Number of unitary alignments in the best alignment.

property approx_gamma_range

Returns a tuple of the expected boundaries of the computed gamma, obtained using the expected disagreement and the precision level

property expected_disorder: float

Returns the expected disagreement for computed random samples, i.e., the mean of the sampled continuua’s disorders

property gamma: float

Returns the gamma value

property gamma_cat: float

Returns the gamma-cat value

gamma_k(category: str) float

Returns the gamma-k value for the given category

property n_samples

Number of samples used for computation of the expected disorder.

property observed_disorder: float

Returns the disorder of the computed best alignment, i.e, the observed disagreement.

Dissimilarities

class pygamma_agreement.AbstractDissimilarity(categories: SortedSet | None = None, delta_empty: float = 1.0)

Function used to measure the difference between two annotations, using their positioning and categorization.

Parameters:
  • delta_empty (float) – Distance between a unit and a “null” unit. Defaults to 1.0

  • categories (SortedSet of str, optional) – Labels of annotations involved. Some categories don’t consider the actual content of the categories, so it is left optional.

__init__(categories: SortedSet | None = None, delta_empty: float = 1.0)
abstract compile_d_mat() Callable[[ndarray, ndarray], float]

Must set self.d_mat to the cfunc (decorated with @dissimilarity_dec) function that corresponds to the unit-to-unit (in arrays form) disorder given by the dissimilarity.

compute_disorder(alignment: Alignment) ndarray

Returns the disorder of the given alignment.

abstract d(unit1: Unit, unit2: Unit)

Dissimilarity between two units as a real Unit object.

valid_alignments(continuum: Continuum) Tuple[ndarray, ndarray]

Returns all the unitary alignment (in matricial form), and their disorders that could potentially be in the best alignment of the continuum (based on the criterium detailed in section 5.1.1 of the gamma paper (https://aclanthology.org/J15-3003.pdf).

class pygamma_agreement.PositionalSporadicDissimilarity(delta_empty: float = 1.0)

Positional-sporadic dissimilarity. Takes only the position of annotations into account. This distance is :

  • 0 when segments are equal

  • < delta_empty when segments completely overlap (\(A \cup B = A\) or \(B\))

  • > delta_empty when segments are separated (\(A \cap B = \emptyset\))

__init__(delta_empty: float = 1.0)
compile_d_mat()

Must set self.d_mat to the cfunc (decorated with @dissimilarity_dec) function that corresponds to the unit-to-unit (in arrays form) disorder given by the dissimilarity.

d(unit1: Unit, unit2: Unit)

Dissimilarity between two units as a real Unit object.

class pygamma_agreement.CategoricalDissimilarity(categories: SortedSet, delta_empty: float = 1.0)

Abstract base class for categorical dissimilarity.

__init__(categories: SortedSet, delta_empty: float = 1.0)
class pygamma_agreement.AbsoluteCategoricalDissimilarity(delta_empty: float = 1.0)

Basic categorical dissimilarity. Worth 0.0 when categories are identical, delta_empty otherwise.

__init__(delta_empty: float = 1.0)
compile_d_mat()

Must set self.d_mat to the cfunc (decorated with @dissimilarity_dec) function that corresponds to the unit-to-unit (in arrays form) disorder given by the dissimilarity.

d(unit1: Unit, unit2: Unit)

Dissimilarity between two units as a real Unit object.

class pygamma_agreement.PrecomputedCategoricalDissimilarity(categories: SortedSet, matrix: ndarray, delta_empty: float = 1.0)

Categorical dissimilarity with a provided matrix that contains all the category-to-category dissimilarity. The indexes of the matrix correspond to the categories in alphabetical order.

__init__(categories: SortedSet, matrix: ndarray, delta_empty: float = 1.0)
compile_d_mat()

Must set self.d_mat to the cfunc (decorated with @dissimilarity_dec) function that corresponds to the unit-to-unit (in arrays form) disorder given by the dissimilarity.

d(unit1: Unit, unit2: Unit)

Dissimilarity between two units as a real Unit object.

class pygamma_agreement.OrdinalCategoricalDissimilarity(labels: Iterable[str], p: Iterable[float] | None = None, delta_empty=1.0)

Categorical dissimilarity where each label is given a position on the real axis, and the disorder between categories of positions ‘a’ and ‘b’ being |a - b|/m * delta_empty with m the maximum position. If not provided, positions are 0, 1, 2…

__init__(labels: Iterable[str], p: Iterable[float] | None = None, delta_empty=1.0)
Parameters:
  • labels (Iterable of str) – The categories involved in the dissimilarity

  • p (Iterable of floats) – The real numbers associated with each label, in the same order.

class pygamma_agreement.NumericalCategoricalDissimilarity(labels: Iterable[str], delta_empty: float = 1.0)

Categorical dissimilarity made for numerical categories (i.e a category is a float or int literal). The disorder between categories ‘a’ and ‘b’ being |a - b|/m * delta_empty with m the maximum category.

__init__(labels: Iterable[str], delta_empty: float = 1.0)
Parameters:
  • labels (Iterable of str) – The categories involved in the dissimilarity

  • p (Iterable of floats) – The real numbers associated with each label, in the same order.

class pygamma_agreement.LambdaCategoricalDissimilarity(labels: Iterable[str], delta_empty: float = 1.0)

Categorical dissimilarity, whose values are precomputed from a (str, str) -> float function (the cat_dissim_func method) and the list of categories provided.

__init__(labels: Iterable[str], delta_empty: float = 1.0)
class pygamma_agreement.LevenshteinCategoricalDissimilarity(labels: Iterable[str], delta_empty: float = 1.0)

Precomputed categorical dissimilarity whose value is the proportional levenshtein distance between the category labels.

__init__(labels: Iterable[str], delta_empty: float = 1.0)
class pygamma_agreement.CombinedCategoricalDissimilarity(alpha: float = 1.0, beta: float = 1.0, delta_empty: float = 1.0, pos_dissim: AbstractDissimilarity | None = None, cat_dissim: CategoricalDissimilarity | None = None)

This dissimilarity takes both positioning and categorizing of annotations into account. Combined categorical dissimilarity constructor.

Parameters:
  • delta_empty (optional, float) – empty dissimilarity value. Defaults to 1.

  • alpha (optional float) – coefficient weighting the positional dissimilarity value. Defaults to 1.

  • beta (optional float) – coefficient weighting the categorical dissimilarity value. Defaults to 1.

  • cat_dissim (optional, CategoricalDissimilarity) – Categorical-only dissimilarity to be used. If not set, defaults to the absolute categorical dissimilarity.

__init__(alpha: float = 1.0, beta: float = 1.0, delta_empty: float = 1.0, pos_dissim: AbstractDissimilarity | None = None, cat_dissim: CategoricalDissimilarity | None = None)
compile_d_mat()

Must set self.d_mat to the cfunc (decorated with @dissimilarity_dec) function that corresponds to the unit-to-unit (in arrays form) disorder given by the dissimilarity.

d(unit1: Unit, unit2: Unit)

Dissimilarity between two units as a real Unit object.

Samplers

class pygamma_agreement.AbstractContinuumSampler

Tool for generating sampled continuua from a reference continuum. Used to compute the “expected disorder” when calculating the gamma, using particular sampling techniques. Must be initalized (with self.init_sampling for instance)

__init__()

Super constructor, sets everything to None since a call to init_sampling to set parameters is mandatory.

init_sampling(reference_continuum: Continuum, ground_truth_annotators: Iterable[str] | None = None)
Parameters:
  • reference_continuum (Continuum) – the continuum that will be shuffled into the samples

  • ground_truth_annotators (iterable of str, optional) – the set of annotators (from the reference) that will be considered for sampling

abstract property sample_from_continuum: Continuum

Returns a shuffled continuum based on the reference. Everything in the generated sample is at least a copy.

Raises:

ValueError: – if init_sampling or another initalization method hasn’t been called before.

class pygamma_agreement.ShuffleContinuumSampler(pivot_type: Literal['float_pivot', 'int_pivot'] = 'int_pivot')

This continuum sampler uses the methods used in gamma-software, ie those described in gamma-paper : https://www.aclweb.org/anthology/J15-3003.pdf, section 5.2. and implemented in the GammaSoftware.

__init__(pivot_type: Literal['float_pivot', 'int_pivot'] = 'int_pivot')

This constructor allows to set the pivot type to int or float. Defaults to int to match the java implementation.

init_sampling(reference_continuum: Continuum, ground_truth_annotators: Iterable[str] | None = None)
Parameters:
  • reference_continuum (Continuum) – the continuum that will be shuffled into the samples

  • ground_truth_annotators (iterable of str, optional) – the set of annotators (from the reference) that will be considered for sampling

property sample_from_continuum: Continuum

Returns a shuffled continuum based on the reference. Everything in the generated sample is at least a copy.

Raises:

ValueError: – if init_sampling or another initalization method hasn’t been called before.

class pygamma_agreement.StatisticalContinuumSampler

This sampler creates continua using the average and standard deviation of :

  • The number of annotations per annotator

  • The gap between two of an annotator’s annotations

  • The duration of the annotations’ segments

The sample is thus created by computing normal distributions using these parameters.

It also requires the probability of occurence of each annotations category. You can either initalize sampling with custom values or with a reference continuum.

init_sampling(reference_continuum: Continuum, ground_truth_annotators: Iterable[str] | None = None)

Sets the sampling parameters using statistical values obtained from the reference continuum.

Parameters:
  • reference_continuum (Continuum) – the continuum that will be shuffled into the samples

  • ground_truth_annotators (iterable of str, optional) – the set of annotators (from the reference) that will be considered for sampling

init_sampling_custom(annotators: Iterable[str], avg_num_units_per_annotator: float, std_num_units_per_annotator: float, avg_gap: float, std_gap: float, avg_duration: float, std_duration: float, categories: Iterable[str], categories_weight: Iterable[float] | None = None)
Parameters:
  • annotators – the annotators that will be involved in the samples

  • avg_num_units_per_annotator (float, optional) – average number of units per annotator

  • std_num_units_per_annotator (float, optional) – standard deviation of the number of units per annotator

  • avg_gap (float, optional) – average gap between two of an annotator’s annotations

  • std_gap (float, optional) – standard deviation of the gap between two of an annotator’s annotations

  • avg_duration (float, optional) – average duration of an annotation

  • std_duration (float, optional) – standard deviation of the duration of an annotation

  • categories (np.array[str, 1d]) – The possible categories of the annotations

  • categories_weight (np.array[float, 1d], optional) – The probability of occurence of each category. Can raise errors if len(categories) != len(category_weights) and category_weights.sum() != 1.0. If not set, every category is equiprobable.

property sample_from_continuum: Continuum

Returns a shuffled continuum based on the reference. Everything in the generated sample is at least a copy.

Raises:

ValueError: – if init_sampling or another initalization method hasn’t been called before.

Corpus Shuffling Tool

class pygamma_agreement.CorpusShufflingTool(magnitude: float, reference_continuum: Continuum, categories: Iterable[str] | None = None)

Corpus shuffling tool as detailed in section 6.3 of the gamma paper (https://www.aclweb.org/anthology/J15-3003.pdf#page=30).

__init__(magnitude: float, reference_continuum: Continuum, categories: Iterable[str] | None = None)
Parameters:
  • magnitude – magnitude m of the cst (cf gamma paper)

  • reference_continuum – this continuum will serve as reference for the tweaks made by the corpus shuffling tool.

  • categories – this is used to consider additionnal categories when shuffling the corpus, in the eventuality that the reference continuum does not contain any unit of a possible category.

category_shuffle(continuum: Continuum, overlapping_fun: Callable[[str, str], float] | None = None, prevalence: bool = False)

Shuffles the categories of the annotations in the given continuum using the process described in section 3.3.5 of https://hal.archives-ouvertes.fr/hal-00769639/.

Parameters:
  • overlapping_fun – gives the “categorical distance” between two annotations, which is taken into account when provided. (the lower the distance between categories, the higher the chance one will be changed into the other).

  • prevalence – specify whether or not to consider the proportion of presence of each category in the reference.

corpus_shuffle(annotators: int | Iterable[str], shift: bool = False, false_pos: bool = False, false_neg: bool = False, split: bool = False, cat_shuffle: bool = False, include_ref: bool = False) Continuum

Generates a new shuffled corpus with the provided (or generated) reference annotation set, using the method described in 6.3 of the gamma paper, https://www.aclweb.org/anthology/J15-3003.pdf#page=30 (and missing elements described in another article : https://hal.archives-ouvertes.fr/hal-00769639/).

false_neg_shuffle(continuum: Continuum) None

Tweaks the continuum by randomly removing units (“false negatives”). Every unit (for each annotator) have a probability equal to the magnitude of being removed. If this probability is one, a single random unit (for each annotator) will be left alone.

false_pos_shuffle(continuum: Continuum) None

Tweaks the continuum by randomly adding “false positive” units. The number of added units per annotator is constant & proportionnal to the magnitude of the CST. The chosen category is random and depends on the probability of occurence of the category in the reference. The length of the segment is random (normal distribution) based on the average and standard deviation of those of the reference.

shift_shuffle(continuum: Continuum) None

Tweaks the given continuum by shifting the ends of each segment, with uniformly distributed values of bounds proportionnal to the magnitude of the CST and the length of the segment.

splits_shuffle(continuum: Continuum)

Tweak the continuum by randomly splitting segments. Number of splits per annotator is constant & proportionnal to the magnitude of the CST and the number of units in the reference. A splitted segment can be re-splitted.