graphnetz.benchmark¶
Statistically robust benchmarks across a category for one or many models.
The dispatcher trains every compatible (model, task) pair across multiple
seeds and returns a BenchmarkReport that exposes mean ± 95 % t-CI,
paired t-tests with Holm-Bonferroni correction, publication-ready LaTeX
tables, and plots.
Custom models are plugged in via the same three paths as before:
Decorator / registry:
from graphnetz import register_model @register_model(task_type="node_cls") class MyGNN(torch.nn.Module): def __init__(self, in_channels, hidden_channels, out_channels): ...
Class attribute:
class MyGNN(torch.nn.Module): task_types = {"node_cls"}
Inline tuple
(cls, tasks)or(cls, tasks, factory)in themodelsmapping:run_benchmark("social", {"MyGNN": (MyGNN, "node_cls")})
The default factory calls cls(in_channels, hidden_channels, out_channels);
DGI-task models receive (in_channels, hidden_channels) (the third argument
is dropped).
- class graphnetz.benchmark.BenchmarkReport(seeds: tuple[int, ...], histories: dict[str, dict[str, list[dict[str, list[float]]]]], config: dict[str, ~typing.Any] = <factory>, ci_method: str = 't', bootstrap_n: int = 10000, bootstrap_seed: int = 0, pairwise_method: str = 't')[source]¶
Bases:
objectStructured outcome of a multi-seed benchmark run.
histories[task][model]is a list with one history dict per seed (in seed order). The report is also a read-only mappingtask -> {model: history_seed_0}for backward compatibility with single-seed callers.- final_metrics(key: str | None = None) dict[str, dict[str, list[float]]][source]¶
Final metric value per (task, model, seed).
- summary(ci: float = 0.95, method: str | None = None) pandas.DataFrame[source]¶
Per-(task, model) mean, std, sem, CI half-width and bounds.
methodoverridesself.ci_methodfor this call only; choose"t"for Student’s-t intervals (default) or"bootstrap"for percentile-bootstrap intervals (better for non-Gaussian metrics such as Hits@K, MRR, or AUC).
- pairwise(alpha: float = 0.05, method: str | None = None) pandas.DataFrame[source]¶
Paired pairwise tests between models per task with Holm adjustment.
methodoverridesself.pairwise_methodfor this call only:"t"(default) – paired Student’s t-test on per-seed final metrics."wilcoxon"– non-parametric Wilcoxon signed-rank test on the paired differences. Recommended at small seed counts where the paired t-test’s normality assumption is most fragile; see Benavoli et al., JMLR 17(5):1-36, 2016.
- friedman(alpha: float = 0.05) dict[str, float | int | bool][source]¶
Friedman omnibus test on per-task ranks of seed-mean metrics.
Returns a dict with the statistic
chi2, the asymptotic $chi^2_{k-1}$ p-value, the rejection flag atalpha, and the $(k, N)$ shape used. The Nemenyi post-hoc surfaced inplot_critical_difference()should only be interpreted whenrejectedis true (Demšar, 2006).Only models present in every task are included; per-task ranks use the metric direction (lower-is-better for
val_maeandtrain_loss).
- to_latex(path: str | Path, *, ci: float = 0.95, bold_best: bool = True, pretty_tasks: Mapping[str, str] | None = None, caption: str | None = None, label: str | None = None, method: str | None = None) Path[source]¶
Booktabs LaTeX table of mean ± CI half-width with bold-best per task.
methodoverridesself.ci_method("t"or"bootstrap").
- pairwise_to_latex(path: str | Path, *, alpha: float = 0.05, caption: str | None = None, label: str | None = None, method: str | None = None) Path[source]¶
LaTeX booktabs table of pairwise Holm-adjusted p-values.
methodoverridesself.pairwise_method("t"or"wilcoxon") for this call only.
- plot(ax: plt.Axes | None = None, *, ci: float = 0.95, ylabel: str | None = None, title: str | None = None, annotate: bool = True, pretty_tasks: Mapping[str, str] | None = None) tuple[plt.Figure, plt.Axes][source]¶
Grouped bar chart of mean ± CI half-width across seeds.
- plot_forest(ax: plt.Axes | None = None, *, ci: float = 0.95, pretty_tasks: Mapping[str, str] | None = None, xlabel: str | None = None, height_per_task: float = 0.42, sort_within: bool = False, band: bool = True) tuple[plt.Figure, plt.Axes][source]¶
Forest plot, one row per task with models jittered within the row.
Height scales with the number of tasks only – adding more models widens the within-row jitter rather than adding new rows – so the figure stays compact for many models.
sort_within=Trueorders the jittered positions per-task so the best mean lands at the top of the row (helps spot leaders when there are many models). Each model keeps a stable colour across tasks.band=Trueshades alternating task rows (banded reading aid).
- plot_pairwise(ax: plt.Axes | None = None, *, ci: float = 0.95, alpha: float = 0.05, pretty_tasks: Mapping[str, str] | None = None, layout: str = 'matrix', max_cols: int = 3, method: str | None = None) tuple[plt.Figure, Any][source]¶
Pairwise comparison plot, with two layouts that scale differently.
layout="matrix"(default) – one significance heatmap per task, with the lower triangle holding $-log_{10}(p_{text{Holm}})$ and the upper triangle holding the signed mean difference. Scales to many models and many tasks (panels arranged in a grid with at mostmax_colscolumns).layout="list"– one row per pairwise comparison with CI whiskers and a significance marker. Best for small numbers of comparisons.methodoverridesself.pairwise_method("t"or"wilcoxon") for this call only.
- plot_critical_difference(*, alpha: float = 0.05, title: str | None = None) tuple[matplotlib.pyplot.Figure, matplotlib.pyplot.Axes][source]¶
Demšar critical-difference (CD) diagram.
Computes mean ranks of every model across tasks and overlays the Nemenyi critical difference at level
alpha. Models withinCDof each other are joined by a thick horizontal “clique” bar (i.e., not significantly different). This is the canonical scalable visualization for multi-method, multi-dataset benchmarks (Demšar, 2006).Only models present in every task are included. Requires at least two tasks and at least two such models.
- plot_learning_curves(*, ci: float = 0.95, metric_key: str | None = None, pretty_tasks: Mapping[str, str] | None = None, ylabel: str = 'Test accuracy', legend_loc: str = 'lower right') tuple[matplotlib.pyplot.Figure, numpy.ndarray][source]¶
Mean ± t-CI learning curves, one panel per task, sharing y-axis.
- class graphnetz.benchmark.ModelSpec(cls: type, task_type: frozenset[str] = <factory>, factory: ~collections.abc.Callable[[...], torch.nn.Module] | None = None)[source]¶
Bases:
objectHow to instantiate a model and which task tasks it supports.
- factory: Callable[[...], torch.nn.Module] | None = None¶
- class graphnetz.benchmark.Task(name: str, task_type: str, loader: Callable[[...], Any], epochs: int = 30)[source]¶
Bases:
objectA single benchmark task_type: a dataset loader plus its training task.
- graphnetz.benchmark.iter_benchmark_tasks(category: str | None = None, task_type: str | None = None) list[Task][source]¶
Flatten
BENCHMARK_TASKSto a list, optionally filtered by category/task.Examples
>>> [ ... t.name ... for t in iter_benchmark_tasks(category="biology", task_type="graph_cls") ... ] ['mutag', 'proteins']
- graphnetz.benchmark.plot_benchmark(results: BenchmarkReport | Mapping[str, Mapping[str, Mapping[str, list[float]]]], errors: Mapping[str, Mapping[str, float]] | None = None, ax: plt.Axes | None = None, title: str | None = None, annotate: bool = True, ci: float = 0.95) tuple[plt.Figure, plt.Axes][source]¶
Grouped bar chart with mean ± CI error bars.
Accepts a
BenchmarkReport(preferred) or the legacy dict form for a single seed.errorsoverrides the default t-CI half-width.
- graphnetz.benchmark.register_model(cls: type | None = None, *, task_type: str | Iterable[str], factory: Callable[[...], torch.nn.Module] | None = None) Callable[[type], type] | type[source]¶
Register a model with the benchmark dispatcher.
Usable as a decorator (
@register_model(task_type="node_cls")) or as a plain function (register_model(MyGNN, task_type={"graph_cls", "graph_reg"})).
- graphnetz.benchmark.register_task(category: str, task_type: Task) None[source]¶
Register
taskundercategoryinBENCHMARK_TASKS.The task becomes visible to
run_benchmark(category)and toiter_benchmark_tasks(). Useunregister_task()to remove it (e.g. intearDownof a test).
- graphnetz.benchmark.run_benchmark(category: str | None = None, models: type | tuple[Any, ...] | ModelSpec | dict[str, type | tuple[Any, ...] | ModelSpec] | None = None, root: str = 'data/benchmark', hidden_channels: int = 64, epochs: int | None = None, only: list[str] | None = None, verbose: bool = True, seeds: int | Iterable[int] | None = None, seed: int | None = None, task_type: str | None = None, tasks: Iterable[Task] | None = None, device: torch.device | str | None = 'auto') BenchmarkReport[source]¶
Run a benchmark across one or more (model, task, seed) combinations.
Two ways to choose tasks:
By category (default) – tasks come from
BENCHMARK_TASKSindexed as[category][task_type] -> list[Task]. Passcategory="social"(etc.) and optionally restrict withtask_typeandonly=.Ad-hoc – pass
tasks=[Task(...), ...]to bypass the registry entirely. Useful for benchmarking custom datasets without mutating global state.categorythen defaults to"custom"and is used only to namespaceroot/cache directories.
The runner trains every compatible (model, task) pair across each value in
seeds(default(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)) and aggregates the per-seed histories into aBenchmarkReport.
- graphnetz.benchmark.save_figure(fig: matplotlib.pyplot.Figure, path: str | Path, formats: Sequence[str] = ('pdf', 'png'), dpi: int = 300) list[Path][source]¶
Save
figto one path stem in multiple formats; returns the saved paths.
- graphnetz.benchmark.task_from_dataset(name: str, task_type: str, dataset: Any, *, epochs: int = 30) Task[source]¶
Wrap an already-loaded dataset as a
Task.The dataset must satisfy the conventions for
task: a PyG dataset or any object exposingds[0]plus the relevant attributes (num_features/num_classes/num_relations). The benchmark dispatcher caches the dataset, so the same instance is reused across seeds without reloading.