Reading the report¶
Every run_benchmark call returns a
BenchmarkReport. It carries the full
per-seed history and exposes a small, opinionated set of views — each
designed to answer a different question.
Choosing the right view¶
Question |
Method |
Returns |
|---|---|---|
What’s the headline number per (task, model)? |
|
DataFrame: |
Are these two models significantly different on this task? |
|
DataFrame: per-pair |
Which model wins overall across many tasks? |
|
Demšar / Nemenyi CD diagram |
How do all tasks compare side by side? |
|
one row per task, models jittered within row |
Is the win on task X driven by an outlier seed? |
|
per-comparison CI whiskers |
How fast did each model converge? |
|
mean ± t-CI band, one panel per task |
I want raw seed values for my own analysis. |
|
nested dict |
Summary table¶
print(report.summary()) # default: Student's t CI
print(report.summary(method="bootstrap")) # percentile-bootstrap CI
Per-(task, model) DataFrame with n_seeds, mean, std, sem, ci_low, ci_high.
The metric is auto-detected from the per-epoch history; call
report.metric_name() to confirm. Set report.ci_method = "bootstrap"
once if you want every downstream plot/table to follow.
Pairwise tests¶
print(report.pairwise(alpha=0.05)) # default: paired t-test
print(report.pairwise(alpha=0.05, method="wilcoxon")) # non-parametric
Paired t-test (default) or Wilcoxon signed-rank test across seed-aligned
final metrics per task with Holm step-down adjustment. Columns: task, model_a, model_b, mean_diff, p_raw, p_holm, significant.
The method argument accepts "t" (parametric, default) or
"wilcoxon" (non-parametric). You can also set
report.pairwise_method = "wilcoxon" once to change the default for all
downstream calls. Wilcoxon is recommended at small seed counts where the
paired t-test’s normality assumption is most fragile (Benavoli et al.,
JMLR 17(5):1-36, 2016).
If every paired difference is exactly zero the signed-rank statistic is
undefined; the row reports NaN rather than an artificial p = 1.0.
Note
mean_diff = mean(model_a) − mean(model_b). For lower-is-better metrics
(MAE), a negative mean_diff means model_a is the better model.
Forest plot¶
fig, _ = report.plot_forest(ci=0.95, sort_within=True, band=True)
One row per task; models are jittered within the row. The figure scales with the number of tasks — adding more models widens the within-row jitter rather than adding new rows — so you can compare a dozen models on a dozen tasks in a single column-width figure.
Pairwise heatmap¶
fig, _ = report.plot_pairwise(layout="matrix", alpha=0.05)
One heatmap panel per task. The lower triangle shows
\(-\log_{10}(p_{\text{Holm}})\) (darker = more significant); the upper triangle
shows the signed mean difference. Switch to layout="list" for one row per
comparison with CI whiskers — better when you have only a few pairs.
Pass method="wilcoxon" to use the non-parametric p-values in the plot
(and optionally set report.pairwise_method = "wilcoxon" as a default).
Critical-difference diagram¶
fig, _ = report.plot_critical_difference(alpha=0.05)
A Friedman–Nemenyi diagram with average ranks across tasks and horizontal clique bars joining models within the Nemenyi critical difference. This is the canonical scalable view for multi-method × multi-dataset comparisons.
Tip
Needs ≥ 2 tasks and ≥ 2 models common to every task. With \(N \le 4\) tasks the diagram is informative for rank ordering but rarely strong enough to certify pairwise differences — fall back to the pairwise heatmap at that scale.
Learning curves¶
fig, _ = report.plot_learning_curves(ci=0.95, ylabel="Test accuracy")
One panel per task, sharing the y-axis, with the mean ± t-CI band across seeds. Useful for diagnosing under- vs over-training before you commit to a final epoch budget.
LaTeX exports¶
report.to_latex("results.tex", ci=0.95, bold_best=True)
report.pairwise_to_latex("pairwise.tex")
report.pairwise_to_latex("pairwise_wilcoxon.tex", method="wilcoxon")
to_latex writes a booktabs table with the row-best in bold green; ties
get an almond-cream shade. pairwise_to_latex writes the Holm-adjusted
pairwise comparison table with significance markers. Pass
method="wilcoxon" to emit the non-parametric variant.
Reading a single cell¶
finals = report.final_metrics() # dict[task][model] -> [v_seed_0, ...]
arr = finals["cora"]["GCN"] # ten seeds for GCN on Cora
mean = sum(arr) / len(arr)
final_metrics() is the canonical access path for downstream analyses
(custom plots, ad-hoc statistics, exporting to W&B / MLflow).
Persistence¶
BenchmarkReport is a plain dataclass; pickle it as-is to cache between
sessions:
import pickle
with open("report_social.pkl", "wb") as f:
pickle.dump(report, f)
See examples/experiment.py for a small driver that caches per-category
reports under examples/_artifacts/ and emits figures + LaTeX tables on
top of them.
Legacy mapping access¶
BenchmarkReport implements the read-only mapping protocol so older code
that expected a plain dict[task][model][history] still works — at the cost
of seeing only seed 0:
for task, per_task in report.items():
for model, history in per_task.items():
# history is the seed-0 per-epoch dict (legacy single-seed view)
...
This view is intentionally lossy. Use report.histories for the full
nested dict and report.final_metrics() for analysis-ready arrays.