Benchmark protocol

The benchmark is built around a single guarantee: every cell in the report goes through the same pipeline. Whatever model and task you throw at run_benchmark, the seeds are drawn the same way, the metrics are reduced the same way, and the resulting BenchmarkReport exposes the same set of statistical methods. This page walks through the five stages of that pipeline and how to drive it.

The five stages

#

Stage

What happens

1

Catalogue

The chosen category is mapped to a list of curated tasks via graphnetz.benchmark.BENCHMARK_TASKS.

2

Encoders

Each model declares the task types it supports; incompatible (model, task) pairs are dropped before training.

3

Training

For every \((t, m, s)\) triple the runner reseeds Python random, NumPy, Torch CPU, and Torch CUDA, then trains for \(E\) epochs through the appropriate trainer.

4

Statistics

Per-seed final metrics feed three reducers: per-cell mean ± CI, Holm-corrected paired t-tests (or Wilcoxon signed-rank) within each task, Friedman ranks + Nemenyi CD across tasks.

5

Report

Histories, summaries, and one-call exporters live on the BenchmarkReport.

Running

The standard call:

from graphnetz import GAT, GCN, GraphSAGE, GraphTransformer, run_benchmark

report = run_benchmark(
    category="social",
    models={"GCN": GCN, "GAT": GAT, "GraphSAGE": GraphSAGE, "GraphTransformer": GraphTransformer},
    seeds=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
    task_type="node_cls",          # restrict to one task family (optional)
    epochs=100,               # override the per-task default (optional)
    only=["cora", "citeseer"],# subset of tasks (optional)
)

A single class instead of a dict is also accepted — handy for one-off sanity checks:

run_benchmark("infrastructure", GAT, task_type="link_pred")

Tip

The first run of a real benchmark downloads + processes datasets to data/benchmark/<category>/<task>/ (overridable with root=). Reruns hit the on-disk cache, so iteration on plotting / report logic is fast.

The report object

Every method below operates on the same per-seed final_metrics() table; the choice of method just decides which view to render.

class graphnetz.benchmark.BenchmarkReport(seeds: tuple[int, ...], histories: dict[str, dict[str, list[dict[str, list[float]]]]], config: dict[str, ~typing.Any] = <factory>, ci_method: str = 't', bootstrap_n: int = 10000, bootstrap_seed: int = 0, pairwise_method: str = 't')[source]

Bases: object

Structured outcome of a multi-seed benchmark run.

histories[task][model] is a list with one history dict per seed (in seed order). The report is also a read-only mapping task -> {model: history_seed_0} for backward compatibility with single-seed callers.

seeds: tuple[int, ...]
histories: dict[str, dict[str, list[dict[str, list[float]]]]]
config: dict[str, Any]
ci_method: str = 't'
bootstrap_n: int = 10000
bootstrap_seed: int = 0
pairwise_method: str = 't'
items()[source]
keys()[source]
values()[source]
final_metrics(key: str | None = None) dict[str, dict[str, list[float]]][source]

Final metric value per (task, model, seed).

metric_name() str[source]
summary(ci: float = 0.95, method: str | None = None) pandas.DataFrame[source]

Per-(task, model) mean, std, sem, CI half-width and bounds.

method overrides self.ci_method for this call only; choose "t" for Student’s-t intervals (default) or "bootstrap" for percentile-bootstrap intervals (better for non-Gaussian metrics such as Hits@K, MRR, or AUC).

pairwise(alpha: float = 0.05, method: str | None = None) pandas.DataFrame[source]

Paired pairwise tests between models per task with Holm adjustment.

method overrides self.pairwise_method for this call only:

  • "t" (default) – paired Student’s t-test on per-seed final metrics.

  • "wilcoxon" – non-parametric Wilcoxon signed-rank test on the paired differences. Recommended at small seed counts where the paired t-test’s normality assumption is most fragile; see Benavoli et al., JMLR 17(5):1-36, 2016.

friedman(alpha: float = 0.05) dict[str, float | int | bool][source]

Friedman omnibus test on per-task ranks of seed-mean metrics.

Returns a dict with the statistic chi2, the asymptotic $chi^2_{k-1}$ p-value, the rejection flag at alpha, and the $(k, N)$ shape used. The Nemenyi post-hoc surfaced in plot_critical_difference() should only be interpreted when rejected is true (Demšar, 2006).

Only models present in every task are included; per-task ranks use the metric direction (lower-is-better for val_mae and train_loss).

to_latex(path: str | Path, *, ci: float = 0.95, bold_best: bool = True, pretty_tasks: Mapping[str, str] | None = None, caption: str | None = None, label: str | None = None, method: str | None = None) Path[source]

Booktabs LaTeX table of mean ± CI half-width with bold-best per task.

method overrides self.ci_method ("t" or "bootstrap").

pairwise_to_latex(path: str | Path, *, alpha: float = 0.05, caption: str | None = None, label: str | None = None, method: str | None = None) Path[source]

LaTeX booktabs table of pairwise Holm-adjusted p-values.

method overrides self.pairwise_method ("t" or "wilcoxon") for this call only.

plot(ax: plt.Axes | None = None, *, ci: float = 0.95, ylabel: str | None = None, title: str | None = None, annotate: bool = True, pretty_tasks: Mapping[str, str] | None = None) tuple[plt.Figure, plt.Axes][source]

Grouped bar chart of mean ± CI half-width across seeds.

plot_forest(ax: plt.Axes | None = None, *, ci: float = 0.95, pretty_tasks: Mapping[str, str] | None = None, xlabel: str | None = None, height_per_task: float = 0.42, sort_within: bool = False, band: bool = True) tuple[plt.Figure, plt.Axes][source]

Forest plot, one row per task with models jittered within the row.

Height scales with the number of tasks only – adding more models widens the within-row jitter rather than adding new rows – so the figure stays compact for many models.

sort_within=True orders the jittered positions per-task so the best mean lands at the top of the row (helps spot leaders when there are many models). Each model keeps a stable colour across tasks.

band=True shades alternating task rows (banded reading aid).

plot_pairwise(ax: plt.Axes | None = None, *, ci: float = 0.95, alpha: float = 0.05, pretty_tasks: Mapping[str, str] | None = None, layout: str = 'matrix', max_cols: int = 3, method: str | None = None) tuple[plt.Figure, Any][source]

Pairwise comparison plot, with two layouts that scale differently.

layout="matrix" (default) – one significance heatmap per task, with the lower triangle holding $-log_{10}(p_{text{Holm}})$ and the upper triangle holding the signed mean difference. Scales to many models and many tasks (panels arranged in a grid with at most max_cols columns).

layout="list" – one row per pairwise comparison with CI whiskers and a significance marker. Best for small numbers of comparisons.

method overrides self.pairwise_method ("t" or "wilcoxon") for this call only.

plot_critical_difference(*, alpha: float = 0.05, title: str | None = None) tuple[matplotlib.pyplot.Figure, matplotlib.pyplot.Axes][source]

Demšar critical-difference (CD) diagram.

Computes mean ranks of every model across tasks and overlays the Nemenyi critical difference at level alpha. Models within CD of each other are joined by a thick horizontal “clique” bar (i.e., not significantly different). This is the canonical scalable visualization for multi-method, multi-dataset benchmarks (Demšar, 2006).

Only models present in every task are included. Requires at least two tasks and at least two such models.

plot_learning_curves(*, ci: float = 0.95, metric_key: str | None = None, pretty_tasks: Mapping[str, str] | None = None, ylabel: str = 'Test accuracy', legend_loc: str = 'lower right') tuple[matplotlib.pyplot.Figure, numpy.ndarray][source]

Mean ± t-CI learning curves, one panel per task, sharing y-axis.

One-call publication artefacts

# Mean ± t-CI table; row-best in bold green, ties in almond cream
report.to_latex("results.tex", ci=0.95, bold_best=True)

# Holm-corrected pairwise test table
report.pairwise_to_latex("pairwise.tex")

# Per-task forest plot, models jittered within rows
fig, _ = report.plot_forest(ci=0.95)

# Pairwise significance heatmap (one panel per task)
fig, _ = report.plot_pairwise(layout="matrix")

# Demšar critical-difference diagram across tasks
fig, _ = report.plot_critical_difference(alpha=0.05)

For interactive analysis, see Reading the report.

Statistical guarantees

The library is opinionated about which tests are appropriate for which question. The table below states each one explicitly so you can cite it without re-deriving:

Per-cell CI (default Student’s t) over \(n\) seeds:

\[\bar{x} \;\pm\; t_{1-\alpha/2,\,n-1}\,\dfrac{s}{\sqrt{n}}\]

For non-Gaussian metrics (Hits@K, MRR, AUC on imbalanced splits), pass method="bootstrap" to summary() or set report.ci_method = "bootstrap".

Holm pairwise. The default is a paired t-test per task across seed-aligned final metrics, then Holm step-down adjustment so the family-wise error rate is controlled:

\[p_i^{\text{adj}} \;=\; \min\!\big(p_{(i)}\,(k - i),\, 1\big)\]

For small seed counts (typically \(n < 10\)) where the normality assumption of the paired t-test is fragile, pass method="wilcoxon" to pairwise(), pairwise_to_latex(), or plot_pairwise(), or set report.pairwise_method = "wilcoxon" to change the default for the whole report (Benavoli et al., JMLR 17(5):1-36, 2016).

Friedman + Nemenyi. Average ranks across \(N\) tasks; clique bars in the CD diagram join models within the Nemenyi critical difference:

\[CD_\alpha \;=\; q_\alpha\,\sqrt{\dfrac{k(k+1)}{6N}}\]

The CD diagram is the canonical scalable visualisation for multi-method, multi-dataset comparisons (Demšar, 2006); the implementation handles heterogeneous metric directions per-task before averaging ranks.

Reproducibility

run_benchmark reseeds every RNG the training code reaches — random.seed, numpy.random.seed, torch.manual_seed, and torch.cuda.manual_seed_all — before each (task, model, seed) triple. Combinatorial loaders thread the seed through to their data generator, so cross-seed variance reflects both model initialisation and data resampling, not only the former.

A run with the same seed list and software stack reproduces bit-for-bit on the same hardware.