Benchmark protocol¶
The benchmark is built around a single guarantee: every cell in the report
goes through the same pipeline. Whatever model and task you throw at
run_benchmark, the seeds are drawn the same way, the metrics are reduced
the same way, and the resulting BenchmarkReport
exposes the same set of statistical methods. This page walks through the
five stages of that pipeline and how to drive it.
The five stages¶
# |
Stage |
What happens |
|---|---|---|
1 |
Catalogue |
The chosen |
2 |
Encoders |
Each model declares the task types it supports; incompatible (model, task) pairs are dropped before training. |
3 |
Training |
For every \((t, m, s)\) triple the runner reseeds Python |
4 |
Statistics |
Per-seed final metrics feed three reducers: per-cell mean ± CI, Holm-corrected paired t-tests (or Wilcoxon signed-rank) within each task, Friedman ranks + Nemenyi CD across tasks. |
5 |
Report |
Histories, summaries, and one-call exporters live on the |
Running¶
The standard call:
from graphnetz import GAT, GCN, GraphSAGE, GraphTransformer, run_benchmark
report = run_benchmark(
category="social",
models={"GCN": GCN, "GAT": GAT, "GraphSAGE": GraphSAGE, "GraphTransformer": GraphTransformer},
seeds=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
task_type="node_cls", # restrict to one task family (optional)
epochs=100, # override the per-task default (optional)
only=["cora", "citeseer"],# subset of tasks (optional)
)
A single class instead of a dict is also accepted — handy for one-off sanity checks:
run_benchmark("infrastructure", GAT, task_type="link_pred")
Tip
The first run of a real benchmark downloads + processes datasets to
data/benchmark/<category>/<task>/ (overridable with root=). Reruns
hit the on-disk cache, so iteration on plotting / report logic is fast.
The report object¶
Every method below operates on the same per-seed final_metrics() table; the
choice of method just decides which view to render.
- class graphnetz.benchmark.BenchmarkReport(seeds: tuple[int, ...], histories: dict[str, dict[str, list[dict[str, list[float]]]]], config: dict[str, ~typing.Any] = <factory>, ci_method: str = 't', bootstrap_n: int = 10000, bootstrap_seed: int = 0, pairwise_method: str = 't')[source]
Bases:
_ReportPlotsMixinStructured outcome of a multi-seed benchmark run.
histories[task][model]is a list with one history dict per seed (in seed order). The report is also a read-only mappingtask -> {model: history_seed_0}for backward compatibility with single-seed callers.- ci_method: str = 't'
- bootstrap_n: int = 10000
- bootstrap_seed: int = 0
- pairwise_method: str = 't'
- items()[source]
- keys()[source]
- values()[source]
- final_metrics(key: str | None = None) dict[str, dict[str, list[float]]][source]
Final metric value per (task, model, seed).
- summary(ci: float = 0.95, method: str | None = None) pandas.DataFrame[source]
Per-(task, model) mean, std, sem, CI half-width and bounds.
methodoverridesself.ci_methodfor this call only; choose"t"for Student’s-t intervals (default) or"bootstrap"for percentile-bootstrap intervals (better for non-Gaussian metrics such as Hits@K, MRR, or AUC).
- pairwise(alpha: float = 0.05, method: str | None = None) pandas.DataFrame[source]
Paired pairwise tests between models per task with Holm adjustment.
methodoverridesself.pairwise_methodfor this call only:"t"(default) – paired Student’s t-test on per-seed final metrics."wilcoxon"– non-parametric Wilcoxon signed-rank test on the paired differences. Recommended at small seed counts where the paired t-test’s normality assumption is most fragile; see Benavoli et al., JMLR 17(5):1-36, 2016.
- friedman(alpha: float = 0.05) dict[str, float | int | bool][source]
Friedman omnibus test on per-task ranks of seed-mean metrics.
Returns a dict with the statistic
chi2, the asymptotic $chi^2_{k-1}$ p-value, the rejection flag atalpha, and the $(k, N)$ shape used. The Nemenyi post-hoc surfaced inplot_critical_difference()should only be interpreted whenrejectedis true (Demšar, 2006).Only models present in every task are included; per-task ranks use the metric direction (lower-is-better for
val_maeandtrain_loss).
- to_latex(path: str | Path, *, ci: float = 0.95, bold_best: bool = True, pretty_tasks: Mapping[str, str] | None = None, caption: str | None = None, label: str | None = None, method: str | None = None) Path[source]
Booktabs LaTeX table of mean ± CI half-width with bold-best per task.
methodoverridesself.ci_method("t"or"bootstrap").
One-call publication artefacts¶
# Mean ± t-CI table; row-best in bold green, ties in almond cream
report.to_latex("results.tex", ci=0.95, bold_best=True)
# Holm-corrected pairwise test table
report.pairwise_to_latex("pairwise.tex")
# Per-task forest plot, models jittered within rows
fig, _ = report.plot_forest(ci=0.95)
# Pairwise significance heatmap (one panel per task)
fig, _ = report.plot_pairwise(layout="matrix")
# Demšar critical-difference diagram across tasks
fig, _ = report.plot_critical_difference(alpha=0.05)
For interactive analysis, see Reading the report.
Statistical guarantees¶
The library is opinionated about which tests are appropriate for which question. The table below states each one explicitly so you can cite it without re-deriving:
Per-cell CI (default Student’s t) over \(n\) seeds:
For non-Gaussian metrics (Hits@K, MRR, AUC on imbalanced splits), pass
method="bootstrap" to summary()
or set report.ci_method = "bootstrap".
Holm pairwise. The default is a paired t-test per task across seed-aligned final metrics, then Holm step-down adjustment so the family-wise error rate is controlled:
For small seed counts (typically \(n < 10\)) where the normality assumption of
the paired t-test is fragile, pass method="wilcoxon" to
pairwise(),
pairwise_to_latex(), or
plot_pairwise(), or set
report.pairwise_method = "wilcoxon" to change the default for the whole
report (Benavoli et al., JMLR 17(5):1-36, 2016).
Friedman + Nemenyi. Average ranks across \(N\) tasks; clique bars in the CD diagram join models within the Nemenyi critical difference:
The CD diagram is the canonical scalable visualisation for multi-method, multi-dataset comparisons (Demšar, 2006); the implementation handles heterogeneous metric directions per-task before averaging ranks.
Reproducibility¶
run_benchmark reseeds every RNG the training code reaches —
random.seed, numpy.random.seed, torch.manual_seed, and
torch.cuda.manual_seed_all — before each (task, model, seed) triple.
Combinatorial loaders thread the seed through to their data generator, so
cross-seed variance reflects both model initialisation and data
resampling, not only the former.
A run with the same seed list and software stack reproduces bit-for-bit on the same hardware.