`graphnetz.benchmark`¶

Statistically robust benchmarks across a category for one or many models.

The dispatcher trains every compatible (model, task) pair across multiple seeds and returns a BenchmarkReport that exposes mean ± 95 % t-CI, paired t-tests with Holm-Bonferroni correction, publication-ready LaTeX tables, and plots.

Custom models are plugged in via the same three paths as before:

Decorator / registry:

from graphnetz import register_model


@register_model(task_type="node_cls")
class MyGNN(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels): ...

Class attribute:

class MyGNN(torch.nn.Module):
    task_types = {"node_cls"}

Inline tuple (cls, tasks) or (cls, tasks, factory) in the models mapping:
```
run_benchmark("social", {"MyGNN": (MyGNN, "node_cls")})
```

The default factory calls cls(in_channels, hidden_channels, out_channels); DGI-task models receive (in_channels, hidden_channels) (the third argument is dropped).

class graphnetz.benchmark.BenchmarkReport(seeds: tuple[int, ...], histories: dict[str, dict[str, list[dict[str, list[float]]]]], config: dict[str, ~typing.Any] = <factory>, ci_method: str = 't', bootstrap_n: int = 10000, bootstrap_seed: int = 0, pairwise_method: str = 't')[source]¶

Bases: object

Structured outcome of a multi-seed benchmark run.

histories[task][model] is a list with one history dict per seed (in seed order). The report is also a read-only mapping task -> {model: history_seed_0} for backward compatibility with single-seed callers.

seeds: tuple[int, ...]¶

histories: dict[str, dict[str, list[dict[str, list[float]]]]]¶

config: dict[str, Any]¶

ci_method: str = 't'¶

bootstrap_n: int = 10000¶

bootstrap_seed: int = 0¶

pairwise_method: str = 't'¶

items()[source]¶

keys()[source]¶

values()[source]¶

final_metrics(key: str | None = None) → dict[str, dict[str, list[float]]][source]¶: Final metric value per (task, model, seed).

metric_name() → str[source]¶

summary(ci: float = 0.95, method: str | None = None) → pandas.DataFrame[source]¶

Per-(task, model) mean, std, sem, CI half-width and bounds.

method overrides self.ci_method for this call only; choose "t" for Student’s-t intervals (default) or "bootstrap" for percentile-bootstrap intervals (better for non-Gaussian metrics such as Hits@K, MRR, or AUC).

pairwise(alpha: float = 0.05, method: str | None = None) → pandas.DataFrame[source]¶

Paired pairwise tests between models per task with Holm adjustment.

method overrides self.pairwise_method for this call only:

"t" (default) – paired Student’s t-test on per-seed final metrics.
"wilcoxon" – non-parametric Wilcoxon signed-rank test on the paired differences. Recommended at small seed counts where the paired t-test’s normality assumption is most fragile; see Benavoli et al., JMLR 17(5):1-36, 2016.

friedman(alpha: float = 0.05) → dict[str, float | int | bool][source]¶

Friedman omnibus test on per-task ranks of seed-mean metrics.

Returns a dict with the statistic chi2, the asymptotic $chi^2_{k-1}$ p-value, the rejection flag at alpha, and the $(k, N)$ shape used. The Nemenyi post-hoc surfaced in plot_critical_difference() should only be interpreted when rejected is true (Demšar, 2006).

Only models present in every task are included; per-task ranks use the metric direction (lower-is-better for val_mae and train_loss).

Booktabs LaTeX table of mean ± CI half-width with bold-best per task.

method overrides self.ci_method ("t" or "bootstrap").

pairwise_to_latex(path: str | Path, *, alpha: float = 0.05, caption: str | None = None, label: str | None = None, method: str | None = None) → Path[source]¶

LaTeX booktabs table of pairwise Holm-adjusted p-values.

method overrides self.pairwise_method ("t" or "wilcoxon") for this call only.

plot(ax: plt.Axes | None = None, *, ci: float = 0.95, ylabel: str | None = None, title: str | None = None, annotate: bool = True, pretty_tasks: Mapping[str, str] | None = None) → tuple[plt.Figure, plt.Axes][source]¶: Grouped bar chart of mean ± CI half-width across seeds.

plot_forest(ax: plt.Axes | None = None, *, ci: float = 0.95, pretty_tasks: Mapping[str, str] | None = None, xlabel: str | None = None, height_per_task: float = 0.42, sort_within: bool = False, band: bool = True) → tuple[plt.Figure, plt.Axes][source]¶

Forest plot, one row per task with models jittered within the row.

Height scales with the number of tasks only – adding more models widens the within-row jitter rather than adding new rows – so the figure stays compact for many models.

sort_within=True orders the jittered positions per-task so the best mean lands at the top of the row (helps spot leaders when there are many models). Each model keeps a stable colour across tasks.

band=True shades alternating task rows (banded reading aid).

plot_pairwise(ax: plt.Axes | None = None, *, ci: float = 0.95, alpha: float = 0.05, pretty_tasks: Mapping[str, str] | None = None, layout: str = 'matrix', max_cols: int = 3, method: str | None = None) → tuple[plt.Figure, Any][source]¶

Pairwise comparison plot, with two layouts that scale differently.

layout="matrix" (default) – one significance heatmap per task, with the lower triangle holding $-log_{10}(p_{text{Holm}})$ and the upper triangle holding the signed mean difference. Scales to many models and many tasks (panels arranged in a grid with at most max_cols columns).

layout="list" – one row per pairwise comparison with CI whiskers and a significance marker. Best for small numbers of comparisons.

method overrides self.pairwise_method ("t" or "wilcoxon") for this call only.

plot_critical_difference(*, alpha: float = 0.05, title: str | None = None) → tuple[matplotlib.pyplot.Figure, matplotlib.pyplot.Axes][source]¶

Demšar critical-difference (CD) diagram.

Computes mean ranks of every model across tasks and overlays the Nemenyi critical difference at level alpha. Models within CD of each other are joined by a thick horizontal “clique” bar (i.e., not significantly different). This is the canonical scalable visualization for multi-method, multi-dataset benchmarks (Demšar, 2006).

Only models present in every task are included. Requires at least two tasks and at least two such models.

plot_learning_curves(*, ci: float = 0.95, metric_key: str | None = None, pretty_tasks: Mapping[str, str] | None = None, ylabel: str = 'Test accuracy', legend_loc: str = 'lower right') → tuple[matplotlib.pyplot.Figure, numpy.ndarray][source]¶: Mean ± t-CI learning curves, one panel per task, sharing y-axis.

class graphnetz.benchmark.ModelSpec(cls: type, task_type: frozenset[str] = <factory>, factory: ~collections.abc.Callable[[...], torch.nn.Module] | None = None)[source]¶

Bases: object

How to instantiate a model and which task tasks it supports.

cls: type¶

task_type: frozenset[str]¶

factory: Callable[[...], torch.nn.Module] | None = None¶

build(in_channels: int, hidden_channels: int, out_channels: int, *, task_type: str = 'node_cls') → torch.nn.Module[source]¶

class graphnetz.benchmark.Task(name: str, task_type: str, loader: Callable[[...], Any], epochs: int = 30)[source]¶

Bases: object

A single benchmark task_type: a dataset loader plus its training task.

name: str¶

task_type: str¶

loader: Callable[[...], Any]¶

epochs: int = 30¶

graphnetz.benchmark.iter_benchmark_tasks(category: str | None = None, task_type: str | None = None) → list[Task][source]¶

Flatten BENCHMARK_TASKS to a list, optionally filtered by category/task.

Examples

>>> [
...     t.name
...     for t in iter_benchmark_tasks(category="biology", task_type="graph_cls")
... ]
['mutag', 'proteins']

graphnetz.benchmark.plot_benchmark(results: BenchmarkReport | Mapping[str, Mapping[str, Mapping[str, list[float]]]], errors: Mapping[str, Mapping[str, float]] | None = None, ax: plt.Axes | None = None, title: str | None = None, annotate: bool = True, ci: float = 0.95) → tuple[plt.Figure, plt.Axes][source]¶

Grouped bar chart with mean ± CI error bars.

Accepts a BenchmarkReport (preferred) or the legacy dict form for a single seed. errors overrides the default t-CI half-width.

graphnetz.benchmark.register_model(cls: type | None = None, *, task_type: str | Iterable[str], factory: Callable[[...], torch.nn.Module] | None = None) → Callable[[type], type] | type[source]¶

Usable as a decorator (@register_model(task_type="node_cls")) or as a plain function (register_model(MyGNN, task_type={"graph_cls", "graph_reg"})).

graphnetz.benchmark.register_task(category: str, task_type: Task) → None[source]¶

The task becomes visible to run_benchmark(category) and to iter_benchmark_tasks(). Use unregister_task() to remove it (e.g. in tearDown of a test).

Run a benchmark across one or more (model, task, seed) combinations.

Two ways to choose tasks:

By category (default) – tasks come from BENCHMARK_TASKS indexed as [category][task_type] -> list[Task]. Pass category="social" (etc.) and optionally restrict with task_type and only=.
Ad-hoc – pass tasks=[Task(...), ...] to bypass the registry entirely. Useful for benchmarking custom datasets without mutating global state. category then defaults to "custom" and is used only to namespace root/ cache directories.

The runner trains every compatible (model, task) pair across each value in seeds (default (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)) and aggregates the per-seed histories into a BenchmarkReport.

graphnetz.benchmark.save_figure(fig: matplotlib.pyplot.Figure, path: str | Path, formats: Sequence[str] = ('pdf', 'png'), dpi: int = 300) → list[Path][source]¶: Save fig to one path stem in multiple formats; returns the saved paths.

graphnetz.benchmark.task_from_dataset(name: str, task_type: str, dataset: Any, *, epochs: int = 30) → Task[source]¶

Wrap an already-loaded dataset as a Task.

The dataset must satisfy the conventions for task: a PyG dataset or any object exposing ds[0] plus the relevant attributes (num_features / num_classes / num_relations). The benchmark dispatcher caches the dataset, so the same instance is reused across seeds without reloading.

graphnetz.benchmark.unregister_task(category: str, name: str) → Task | None[source]¶: Remove a previously registered task; returns it, or None if absent.

graphnetz.benchmark¶

`graphnetz.benchmark`¶