calibrax.metrics.functional.text¤

Text evaluation metrics for translation, summarization, and generation. Tier 0 functions include BLEU, ROUGE-N, ROUGE-L, perplexity, and distinct-N -- all implemented in pure Python/JAX without external NLP libraries.

Text evaluation metrics -- n-gram and math based.

All metrics in this module are n-gram counting or mathematical operations. No pretrained models, no transformers, no neural networks.

Includes: BLEU, ROUGE-N, ROUGE-L, perplexity, and distinct-n. Registered with domain="text".

Note: BLEU/ROUGE/distinct_n operate on Python strings or token lists and are NOT JAX-traceable. Perplexity operates on JAX arrays.

`bleu(candidate: str | list[str], references: list[str | list[str]], *, max_n: int = 4, weights: tuple[float, ...] | None = None) -> float` ¤

BLEU score for machine translation evaluation.

Computes modified n-gram precision for n=1..max_n with brevity penalty.

Parameters:

Name	Type	Description	Default
`candidate`	`str \| list[str]`	Candidate translation (string or token list).	required
`references`	`list[str \| list[str]]`	List of reference translations.	required
`max_n`	`int`	Maximum n-gram order (default 4 for BLEU-4).	`4`
`weights`	`tuple[float, ...] \| None`	Weights for each n-gram order. Default: uniform (1/max_n each).	`None`

Returns:

Type	Description
`float`	BLEU score in [0, 1]. 1.0 = perfect match.

Examples:

>>> bleu("the cat sat on the mat", ["the cat is on the mat"])
...

`rouge_n(candidate: str | list[str], reference: str | list[str], *, n: int = 1) -> float` ¤

ROUGE-N recall: fraction of reference n-grams found in candidate.

Parameters:

Name	Type	Description	Default
`candidate`	`str \| list[str]`	Candidate text (string or token list).	required
`reference`	`str \| list[str]`	Reference text (string or token list).	required
`n`	`int`	N-gram order (default 1 for ROUGE-1).	`1`

Returns:

Type	Description
`float`	ROUGE-N recall in [0, 1]. 1.0 = all reference n-grams found.

Examples:

>>> rouge_n("the cat sat on the mat", "the cat is on the mat", n=1)
...

`rouge_l(candidate: str | list[str], reference: str | list[str]) -> float` ¤

ROUGE-L: longest common subsequence based F-measure.

Parameters:

Name	Type	Description	Default
`candidate`	`str \| list[str]`	Candidate text (string or token list).	required
`reference`	`str \| list[str]`	Reference text (string or token list).	required

Returns:

Type	Description
`float`	ROUGE-L F-measure in [0, 1]. 1.0 = identical sequences.

Examples:

>>> rouge_l("the cat sat on the mat", "the cat is on the mat")
...

`perplexity(log_probabilities: Any) -> Any` ¤

Perplexity from log-probabilities.

Computes exp(-mean(log_probs)). Lower perplexity = better model.

Parameters:

Name	Type	Description	Default
`log_probabilities`	`Any`	Array of log-probabilities from a language model.	required

Returns:

Type	Description
`Any`	Perplexity value >= 1.0.

Examples:

>>> import jax.numpy as jnp
>>> perplexity(jnp.array([0.0, 0.0, 0.0]))  # Perfect model
1.0

`distinct_n(tokens: list[str], *, n: int = 1) -> float` ¤

Distinct-N: ratio of unique n-grams to total n-grams.

Measures lexical diversity. Higher = more diverse vocabulary usage.

Parameters:

Name	Type	Description	Default
`tokens`	`list[str]`	List of tokens.	required
`n`	`int`	N-gram order (default 1 for unigram diversity).	`1`

Returns:

Type	Description
`float`	Distinct-N ratio in [0, 1]. 1.0 = all n-grams unique.

Examples:

>>> distinct_n(["the", "cat", "sat", "on"], n=1)
1.0
>>> distinct_n(["the", "the", "the"], n=1)
0.333...

Plugin Metrics (Tier 1)¤

Optional Dependency

BERTScore requires pretrained BERT embeddings: uv pip install "calibrax[text]"

Import directly from the plugin module:

from calibrax.metrics.plugins.text import BERTScoreMetric

See Stateful Metrics for the base class API.

calibrax.metrics.functional.text¤

bleu(candidate: str | list[str], references: list[str | list[str]], *, max_n: int = 4, weights: tuple[float, ...] | None = None) -> float ¤

rouge_n(candidate: str | list[str], reference: str | list[str], *, n: int = 1) -> float ¤

rouge_l(candidate: str | list[str], reference: str | list[str]) -> float ¤

perplexity(log_probabilities: Any) -> Any ¤

distinct_n(tokens: list[str], *, n: int = 1) -> float ¤

Plugin Metrics (Tier 1)¤

`bleu(candidate: str | list[str], references: list[str | list[str]], *, max_n: int = 4, weights: tuple[float, ...] | None = None) -> float` ¤

`rouge_n(candidate: str | list[str], reference: str | list[str], *, n: int = 1) -> float` ¤

`rouge_l(candidate: str | list[str], reference: str | list[str]) -> float` ¤

`perplexity(log_probabilities: Any) -> Any` ¤

`distinct_n(tokens: list[str], *, n: int = 1) -> float` ¤