calibrax.metrics.functional.text¤
Text evaluation metrics for translation, summarization, and generation. Tier 0 functions include BLEU, ROUGE-N, ROUGE-L, perplexity, and distinct-N -- all implemented in pure Python/JAX without external NLP libraries.
Text evaluation metrics -- n-gram and math based.
All metrics in this module are n-gram counting or mathematical operations. No pretrained models, no transformers, no neural networks.
Includes: BLEU, ROUGE-N, ROUGE-L, perplexity, and distinct-n.
Registered with domain="text".
Note: BLEU/ROUGE/distinct_n operate on Python strings or token lists and are NOT JAX-traceable. Perplexity operates on JAX arrays.
bleu(candidate: str | list[str], references: list[str | list[str]], *, max_n: int = 4, weights: tuple[float, ...] | None = None) -> float
¤
BLEU score for machine translation evaluation.
Computes modified n-gram precision for n=1..max_n with brevity penalty.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
candidate
|
str | list[str]
|
Candidate translation (string or token list). |
required |
references
|
list[str | list[str]]
|
List of reference translations. |
required |
max_n
|
int
|
Maximum n-gram order (default 4 for BLEU-4). |
4
|
weights
|
tuple[float, ...] | None
|
Weights for each n-gram order. Default: uniform (1/max_n each). |
None
|
Returns:
| Type | Description |
|---|---|
float
|
BLEU score in [0, 1]. 1.0 = perfect match. |
Examples:
rouge_n(candidate: str | list[str], reference: str | list[str], *, n: int = 1) -> float
¤
ROUGE-N recall: fraction of reference n-grams found in candidate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
candidate
|
str | list[str]
|
Candidate text (string or token list). |
required |
reference
|
str | list[str]
|
Reference text (string or token list). |
required |
n
|
int
|
N-gram order (default 1 for ROUGE-1). |
1
|
Returns:
| Type | Description |
|---|---|
float
|
ROUGE-N recall in [0, 1]. 1.0 = all reference n-grams found. |
Examples:
rouge_l(candidate: str | list[str], reference: str | list[str]) -> float
¤
ROUGE-L: longest common subsequence based F-measure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
candidate
|
str | list[str]
|
Candidate text (string or token list). |
required |
reference
|
str | list[str]
|
Reference text (string or token list). |
required |
Returns:
| Type | Description |
|---|---|
float
|
ROUGE-L F-measure in [0, 1]. 1.0 = identical sequences. |
Examples:
perplexity(log_probabilities: Any) -> Any
¤
Perplexity from log-probabilities.
Computes exp(-mean(log_probs)). Lower perplexity = better model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
log_probabilities
|
Any
|
Array of log-probabilities from a language model. |
required |
Returns:
| Type | Description |
|---|---|
Any
|
Perplexity value >= 1.0. |
Examples:
distinct_n(tokens: list[str], *, n: int = 1) -> float
¤
Distinct-N: ratio of unique n-grams to total n-grams.
Measures lexical diversity. Higher = more diverse vocabulary usage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
list[str]
|
List of tokens. |
required |
n
|
int
|
N-gram order (default 1 for unigram diversity). |
1
|
Returns:
| Type | Description |
|---|---|
float
|
Distinct-N ratio in [0, 1]. 1.0 = all n-grams unique. |
Examples:
Plugin Metrics (Tier 1)¤
Optional Dependency
BERTScore requires pretrained BERT embeddings:
uv pip install "calibrax[text]"
Import directly from the plugin module:
See Stateful Metrics for the base class API.