Benchmark protocol¶
The benchmark is built around a single guarantee: every cell in the report
goes through the same pipeline. Whatever model and task you throw at
run_benchmark, the seeds are drawn the same way, the metrics are reduced
the same way, and the resulting BenchmarkReport
exposes the same set of statistical methods. This page walks through the
five stages of that pipeline and how to drive it.
The five stages¶
# |
Stage |
What happens |
|---|---|---|
1 |
Catalogue |
The chosen |
2 |
Encoders |
Each model declares the task types it supports; incompatible (model, task) pairs are dropped before training. |
3 |
Training |
For every \((t, m, s)\) triple the runner reseeds Python |
4 |
Statistics |
Per-seed final metrics feed three reducers: per-cell mean ± CI, Holm-corrected paired t-tests (or Wilcoxon signed-rank) within each task, Friedman ranks + Nemenyi CD across tasks. |
5 |
Report |
Histories, summaries, and one-call exporters live on the |
Running¶
The standard call:
from graphnetz import GAT, GCN, GraphSAGE, GraphTransformer, run_benchmark
report = run_benchmark(
category="social",
models={"GCN": GCN, "GAT": GAT, "GraphSAGE": GraphSAGE, "GraphTransformer": GraphTransformer},
seeds=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
task_type="node_cls", # restrict to one task family (optional)
epochs=100, # override the per-task default (optional)
only=["cora", "citeseer"],# subset of tasks (optional)
)
A single class instead of a dict is also accepted — handy for one-off sanity checks:
run_benchmark("infrastructure", GAT, task_type="link_pred")
Tip
The first run of a real benchmark downloads + processes datasets to
data/benchmark/<category>/<task>/ (overridable with root=). Reruns
hit the on-disk cache, so iteration on plotting / report logic is fast.
The report object¶
Every method below operates on the same per-seed final_metrics() table; the
choice of method just decides which view to render.
- class graphnetz.benchmark.BenchmarkReport(seeds: tuple[int, ...], histories: dict[str, dict[str, list[dict[str, list[float]]]]], config: dict[str, ~typing.Any] = <factory>, ci_method: str = 't', bootstrap_n: int = 10000, bootstrap_seed: int = 0, pairwise_method: str = 't')[source]
Bases:
objectStructured outcome of a multi-seed benchmark run.
histories[task][model]is a list with one history dict per seed (in seed order). The report is also a read-only mappingtask -> {model: history_seed_0}for backward compatibility with single-seed callers.- ci_method: str = 't'
- bootstrap_n: int = 10000
- bootstrap_seed: int = 0
- pairwise_method: str = 't'
- items()[source]
- keys()[source]
- values()[source]
- final_metrics(key: str | None = None) dict[str, dict[str, list[float]]][source]
Final metric value per (task, model, seed).
- summary(ci: float = 0.95, method: str | None = None) pandas.DataFrame[source]
Per-(task, model) mean, std, sem, CI half-width and bounds.
methodoverridesself.ci_methodfor this call only; choose"t"for Student’s-t intervals (default) or"bootstrap"for percentile-bootstrap intervals (better for non-Gaussian metrics such as Hits@K, MRR, or AUC).
- pairwise(alpha: float = 0.05, method: str | None = None) pandas.DataFrame[source]
Paired pairwise tests between models per task with Holm adjustment.
methodoverridesself.pairwise_methodfor this call only:"t"(default) – paired Student’s t-test on per-seed final metrics."wilcoxon"– non-parametric Wilcoxon signed-rank test on the paired differences. Recommended at small seed counts where the paired t-test’s normality assumption is most fragile; see Benavoli et al., JMLR 17(5):1-36, 2016.
- friedman(alpha: float = 0.05) dict[str, float | int | bool][source]
Friedman omnibus test on per-task ranks of seed-mean metrics.
Returns a dict with the statistic
chi2, the asymptotic $chi^2_{k-1}$ p-value, the rejection flag atalpha, and the $(k, N)$ shape used. The Nemenyi post-hoc surfaced inplot_critical_difference()should only be interpreted whenrejectedis true (Demšar, 2006).Only models present in every task are included; per-task ranks use the metric direction (lower-is-better for
val_maeandtrain_loss).
- to_latex(path: str | Path, *, ci: float = 0.95, bold_best: bool = True, pretty_tasks: Mapping[str, str] | None = None, caption: str | None = None, label: str | None = None, method: str | None = None) Path[source]
Booktabs LaTeX table of mean ± CI half-width with bold-best per task.
methodoverridesself.ci_method("t"or"bootstrap").
- pairwise_to_latex(path: str | Path, *, alpha: float = 0.05, caption: str | None = None, label: str | None = None, method: str | None = None) Path[source]
LaTeX booktabs table of pairwise Holm-adjusted p-values.
methodoverridesself.pairwise_method("t"or"wilcoxon") for this call only.
- plot(ax: plt.Axes | None = None, *, ci: float = 0.95, ylabel: str | None = None, title: str | None = None, annotate: bool = True, pretty_tasks: Mapping[str, str] | None = None) tuple[plt.Figure, plt.Axes][source]
Grouped bar chart of mean ± CI half-width across seeds.
- plot_forest(ax: plt.Axes | None = None, *, ci: float = 0.95, pretty_tasks: Mapping[str, str] | None = None, xlabel: str | None = None, height_per_task: float = 0.42, sort_within: bool = False, band: bool = True) tuple[plt.Figure, plt.Axes][source]
Forest plot, one row per task with models jittered within the row.
Height scales with the number of tasks only – adding more models widens the within-row jitter rather than adding new rows – so the figure stays compact for many models.
sort_within=Trueorders the jittered positions per-task so the best mean lands at the top of the row (helps spot leaders when there are many models). Each model keeps a stable colour across tasks.band=Trueshades alternating task rows (banded reading aid).
- plot_pairwise(ax: plt.Axes | None = None, *, ci: float = 0.95, alpha: float = 0.05, pretty_tasks: Mapping[str, str] | None = None, layout: str = 'matrix', max_cols: int = 3, method: str | None = None) tuple[plt.Figure, Any][source]
Pairwise comparison plot, with two layouts that scale differently.
layout="matrix"(default) – one significance heatmap per task, with the lower triangle holding $-log_{10}(p_{text{Holm}})$ and the upper triangle holding the signed mean difference. Scales to many models and many tasks (panels arranged in a grid with at mostmax_colscolumns).layout="list"– one row per pairwise comparison with CI whiskers and a significance marker. Best for small numbers of comparisons.methodoverridesself.pairwise_method("t"or"wilcoxon") for this call only.
- plot_critical_difference(*, alpha: float = 0.05, title: str | None = None) tuple[matplotlib.pyplot.Figure, matplotlib.pyplot.Axes][source]
Demšar critical-difference (CD) diagram.
Computes mean ranks of every model across tasks and overlays the Nemenyi critical difference at level
alpha. Models withinCDof each other are joined by a thick horizontal “clique” bar (i.e., not significantly different). This is the canonical scalable visualization for multi-method, multi-dataset benchmarks (Demšar, 2006).Only models present in every task are included. Requires at least two tasks and at least two such models.
- plot_learning_curves(*, ci: float = 0.95, metric_key: str | None = None, pretty_tasks: Mapping[str, str] | None = None, ylabel: str = 'Test accuracy', legend_loc: str = 'lower right') tuple[matplotlib.pyplot.Figure, numpy.ndarray][source]
Mean ± t-CI learning curves, one panel per task, sharing y-axis.
One-call publication artefacts¶
# Mean ± t-CI table; row-best in bold green, ties in almond cream
report.to_latex("results.tex", ci=0.95, bold_best=True)
# Holm-corrected pairwise test table
report.pairwise_to_latex("pairwise.tex")
# Per-task forest plot, models jittered within rows
fig, _ = report.plot_forest(ci=0.95)
# Pairwise significance heatmap (one panel per task)
fig, _ = report.plot_pairwise(layout="matrix")
# Demšar critical-difference diagram across tasks
fig, _ = report.plot_critical_difference(alpha=0.05)
For interactive analysis, see Reading the report.
Statistical guarantees¶
The library is opinionated about which tests are appropriate for which question. The table below states each one explicitly so you can cite it without re-deriving:
Per-cell CI (default Student’s t) over \(n\) seeds:
For non-Gaussian metrics (Hits@K, MRR, AUC on imbalanced splits), pass
method="bootstrap" to summary()
or set report.ci_method = "bootstrap".
Holm pairwise. The default is a paired t-test per task across seed-aligned final metrics, then Holm step-down adjustment so the family-wise error rate is controlled:
For small seed counts (typically \(n < 10\)) where the normality assumption of
the paired t-test is fragile, pass method="wilcoxon" to
pairwise(),
pairwise_to_latex(), or
plot_pairwise(), or set
report.pairwise_method = "wilcoxon" to change the default for the whole
report (Benavoli et al., JMLR 17(5):1-36, 2016).
Friedman + Nemenyi. Average ranks across \(N\) tasks; clique bars in the CD diagram join models within the Nemenyi critical difference:
The CD diagram is the canonical scalable visualisation for multi-method, multi-dataset comparisons (Demšar, 2006); the implementation handles heterogeneous metric directions per-task before averaging ranks.
Reproducibility¶
run_benchmark reseeds every RNG the training code reaches —
random.seed, numpy.random.seed, torch.manual_seed, and
torch.cuda.manual_seed_all — before each (task, model, seed) triple.
Combinatorial loaders thread the seed through to their data generator, so
cross-seed variance reflects both model initialisation and data
resampling, not only the former.
A run with the same seed list and software stack reproduces bit-for-bit on the same hardware.