Skip to content

calibrax.metrics.functional.text¤

Text evaluation metrics for translation, summarization, and generation. Tier 0 functions include BLEU, ROUGE-N, ROUGE-L, perplexity, and distinct-N -- all implemented in pure Python/JAX without external NLP libraries.

Text evaluation metrics -- n-gram and math based.

All metrics in this module are n-gram counting or mathematical operations. No pretrained models, no transformers, no neural networks.

Includes: BLEU, ROUGE-N, ROUGE-L, perplexity, and distinct-n. Registered with domain="text".

Note: BLEU/ROUGE/distinct_n operate on Python strings or token lists and are NOT JAX-traceable. Perplexity operates on JAX arrays.

bleu(candidate: str | list[str], references: list[str | list[str]], *, max_n: int = 4, weights: tuple[float, ...] | None = None) -> float ¤

BLEU score for machine translation evaluation.

Computes modified n-gram precision for n=1..max_n with brevity penalty.

Parameters:

Name Type Description Default
candidate str | list[str]

Candidate translation (string or token list).

required
references list[str | list[str]]

List of reference translations.

required
max_n int

Maximum n-gram order (default 4 for BLEU-4).

4
weights tuple[float, ...] | None

Weights for each n-gram order. Default: uniform (1/max_n each).

None

Returns:

Type Description
float

BLEU score in [0, 1]. 1.0 = perfect match.

Examples:

>>> bleu("the cat sat on the mat", ["the cat is on the mat"])
...

rouge_n(candidate: str | list[str], reference: str | list[str], *, n: int = 1) -> float ¤

ROUGE-N recall: fraction of reference n-grams found in candidate.

Parameters:

Name Type Description Default
candidate str | list[str]

Candidate text (string or token list).

required
reference str | list[str]

Reference text (string or token list).

required
n int

N-gram order (default 1 for ROUGE-1).

1

Returns:

Type Description
float

ROUGE-N recall in [0, 1]. 1.0 = all reference n-grams found.

Examples:

>>> rouge_n("the cat sat on the mat", "the cat is on the mat", n=1)
...

rouge_l(candidate: str | list[str], reference: str | list[str]) -> float ¤

ROUGE-L: longest common subsequence based F-measure.

Parameters:

Name Type Description Default
candidate str | list[str]

Candidate text (string or token list).

required
reference str | list[str]

Reference text (string or token list).

required

Returns:

Type Description
float

ROUGE-L F-measure in [0, 1]. 1.0 = identical sequences.

Examples:

>>> rouge_l("the cat sat on the mat", "the cat is on the mat")
...

perplexity(log_probabilities: Any) -> Any ¤

Perplexity from log-probabilities.

Computes exp(-mean(log_probs)). Lower perplexity = better model.

Parameters:

Name Type Description Default
log_probabilities Any

Array of log-probabilities from a language model.

required

Returns:

Type Description
Any

Perplexity value >= 1.0.

Examples:

>>> import jax.numpy as jnp
>>> perplexity(jnp.array([0.0, 0.0, 0.0]))  # Perfect model
1.0

distinct_n(tokens: list[str], *, n: int = 1) -> float ¤

Distinct-N: ratio of unique n-grams to total n-grams.

Measures lexical diversity. Higher = more diverse vocabulary usage.

Parameters:

Name Type Description Default
tokens list[str]

List of tokens.

required
n int

N-gram order (default 1 for unigram diversity).

1

Returns:

Type Description
float

Distinct-N ratio in [0, 1]. 1.0 = all n-grams unique.

Examples:

>>> distinct_n(["the", "cat", "sat", "on"], n=1)
1.0
>>> distinct_n(["the", "the", "the"], n=1)
0.333...

Plugin Metrics (Tier 1)¤

Optional Dependency

BERTScore requires pretrained BERT embeddings: uv pip install "calibrax[text]"

Import directly from the plugin module:

from calibrax.metrics.plugins.text import BERTScoreMetric

See Stateful Metrics for the base class API.