graphnetz.benchmark

Statistically robust benchmarks across a category for one or many models.

The dispatcher trains every compatible (model, task) pair across multiple seeds and returns a BenchmarkReport that exposes mean ± 95 % t-CI, paired t-tests with Holm-Bonferroni correction, publication-ready LaTeX tables, and plots.

Custom models are plugged in via the same three paths as before:

  1. Decorator / registry:

    from graphnetz import register_model
    
    
    @register_model(task_type="node_cls")
    class MyGNN(torch.nn.Module):
        def __init__(self, in_channels, hidden_channels, out_channels): ...
    
  2. Class attribute:

    class MyGNN(torch.nn.Module):
        task_types = {"node_cls"}
    
  3. Inline tuple (cls, tasks) or (cls, tasks, factory) in the models mapping:

    run_benchmark("social", {"MyGNN": (MyGNN, "node_cls")})
    

The default factory calls cls(in_channels, hidden_channels, out_channels); DGI-task models receive (in_channels, hidden_channels) (the third argument is dropped).

class graphnetz.benchmark.BenchmarkReport(seeds: tuple[int, ...], histories: dict[str, dict[str, list[dict[str, list[float]]]]], config: dict[str, ~typing.Any] = <factory>, ci_method: str = 't', bootstrap_n: int = 10000, bootstrap_seed: int = 0, pairwise_method: str = 't')[source]

Bases: object

Structured outcome of a multi-seed benchmark run.

histories[task][model] is a list with one history dict per seed (in seed order). The report is also a read-only mapping task -> {model: history_seed_0} for backward compatibility with single-seed callers.

seeds: tuple[int, ...]
histories: dict[str, dict[str, list[dict[str, list[float]]]]]
config: dict[str, Any]
ci_method: str = 't'
bootstrap_n: int = 10000
bootstrap_seed: int = 0
pairwise_method: str = 't'
items()[source]
keys()[source]
values()[source]
final_metrics(key: str | None = None) dict[str, dict[str, list[float]]][source]

Final metric value per (task, model, seed).

metric_name() str[source]
summary(ci: float = 0.95, method: str | None = None) pandas.DataFrame[source]

Per-(task, model) mean, std, sem, CI half-width and bounds.

method overrides self.ci_method for this call only; choose "t" for Student’s-t intervals (default) or "bootstrap" for percentile-bootstrap intervals (better for non-Gaussian metrics such as Hits@K, MRR, or AUC).

pairwise(alpha: float = 0.05, method: str | None = None) pandas.DataFrame[source]

Paired pairwise tests between models per task with Holm adjustment.

method overrides self.pairwise_method for this call only:

  • "t" (default) – paired Student’s t-test on per-seed final metrics.

  • "wilcoxon" – non-parametric Wilcoxon signed-rank test on the paired differences. Recommended at small seed counts where the paired t-test’s normality assumption is most fragile; see Benavoli et al., JMLR 17(5):1-36, 2016.

friedman(alpha: float = 0.05) dict[str, float | int | bool][source]

Friedman omnibus test on per-task ranks of seed-mean metrics.

Returns a dict with the statistic chi2, the asymptotic $chi^2_{k-1}$ p-value, the rejection flag at alpha, and the $(k, N)$ shape used. The Nemenyi post-hoc surfaced in plot_critical_difference() should only be interpreted when rejected is true (Demšar, 2006).

Only models present in every task are included; per-task ranks use the metric direction (lower-is-better for val_mae and train_loss).

to_latex(path: str | Path, *, ci: float = 0.95, bold_best: bool = True, pretty_tasks: Mapping[str, str] | None = None, caption: str | None = None, label: str | None = None, method: str | None = None) Path[source]

Booktabs LaTeX table of mean ± CI half-width with bold-best per task.

method overrides self.ci_method ("t" or "bootstrap").

pairwise_to_latex(path: str | Path, *, alpha: float = 0.05, caption: str | None = None, label: str | None = None, method: str | None = None) Path[source]

LaTeX booktabs table of pairwise Holm-adjusted p-values.

method overrides self.pairwise_method ("t" or "wilcoxon") for this call only.

plot(ax: plt.Axes | None = None, *, ci: float = 0.95, ylabel: str | None = None, title: str | None = None, annotate: bool = True, pretty_tasks: Mapping[str, str] | None = None) tuple[plt.Figure, plt.Axes][source]

Grouped bar chart of mean ± CI half-width across seeds.

plot_forest(ax: plt.Axes | None = None, *, ci: float = 0.95, pretty_tasks: Mapping[str, str] | None = None, xlabel: str | None = None, height_per_task: float = 0.42, sort_within: bool = False, band: bool = True) tuple[plt.Figure, plt.Axes][source]

Forest plot, one row per task with models jittered within the row.

Height scales with the number of tasks only – adding more models widens the within-row jitter rather than adding new rows – so the figure stays compact for many models.

sort_within=True orders the jittered positions per-task so the best mean lands at the top of the row (helps spot leaders when there are many models). Each model keeps a stable colour across tasks.

band=True shades alternating task rows (banded reading aid).

plot_pairwise(ax: plt.Axes | None = None, *, ci: float = 0.95, alpha: float = 0.05, pretty_tasks: Mapping[str, str] | None = None, layout: str = 'matrix', max_cols: int = 3, method: str | None = None) tuple[plt.Figure, Any][source]

Pairwise comparison plot, with two layouts that scale differently.

layout="matrix" (default) – one significance heatmap per task, with the lower triangle holding $-log_{10}(p_{text{Holm}})$ and the upper triangle holding the signed mean difference. Scales to many models and many tasks (panels arranged in a grid with at most max_cols columns).

layout="list" – one row per pairwise comparison with CI whiskers and a significance marker. Best for small numbers of comparisons.

method overrides self.pairwise_method ("t" or "wilcoxon") for this call only.

plot_critical_difference(*, alpha: float = 0.05, title: str | None = None) tuple[matplotlib.pyplot.Figure, matplotlib.pyplot.Axes][source]

Demšar critical-difference (CD) diagram.

Computes mean ranks of every model across tasks and overlays the Nemenyi critical difference at level alpha. Models within CD of each other are joined by a thick horizontal “clique” bar (i.e., not significantly different). This is the canonical scalable visualization for multi-method, multi-dataset benchmarks (Demšar, 2006).

Only models present in every task are included. Requires at least two tasks and at least two such models.

plot_learning_curves(*, ci: float = 0.95, metric_key: str | None = None, pretty_tasks: Mapping[str, str] | None = None, ylabel: str = 'Test accuracy', legend_loc: str = 'lower right') tuple[matplotlib.pyplot.Figure, numpy.ndarray][source]

Mean ± t-CI learning curves, one panel per task, sharing y-axis.

class graphnetz.benchmark.ModelSpec(cls: type, task_type: frozenset[str] = <factory>, factory: ~collections.abc.Callable[[...], torch.nn.Module] | None = None)[source]

Bases: object

How to instantiate a model and which task tasks it supports.

cls: type
task_type: frozenset[str]
factory: Callable[[...], torch.nn.Module] | None = None
build(in_channels: int, hidden_channels: int, out_channels: int, *, task_type: str = 'node_cls') torch.nn.Module[source]
class graphnetz.benchmark.Task(name: str, task_type: str, loader: Callable[[...], Any], epochs: int = 30)[source]

Bases: object

A single benchmark task_type: a dataset loader plus its training task.

name: str
task_type: str
loader: Callable[[...], Any]
epochs: int = 30
graphnetz.benchmark.iter_benchmark_tasks(category: str | None = None, task_type: str | None = None) list[Task][source]

Flatten BENCHMARK_TASKS to a list, optionally filtered by category/task.

Examples

>>> [
...     t.name
...     for t in iter_benchmark_tasks(category="biology", task_type="graph_cls")
... ]
['mutag', 'proteins']
graphnetz.benchmark.plot_benchmark(results: BenchmarkReport | Mapping[str, Mapping[str, Mapping[str, list[float]]]], errors: Mapping[str, Mapping[str, float]] | None = None, ax: plt.Axes | None = None, title: str | None = None, annotate: bool = True, ci: float = 0.95) tuple[plt.Figure, plt.Axes][source]

Grouped bar chart with mean ± CI error bars.

Accepts a BenchmarkReport (preferred) or the legacy dict form for a single seed. errors overrides the default t-CI half-width.

graphnetz.benchmark.register_model(cls: type | None = None, *, task_type: str | Iterable[str], factory: Callable[[...], torch.nn.Module] | None = None) Callable[[type], type] | type[source]

Register a model with the benchmark dispatcher.

Usable as a decorator (@register_model(task_type="node_cls")) or as a plain function (register_model(MyGNN, task_type={"graph_cls", "graph_reg"})).

graphnetz.benchmark.register_task(category: str, task_type: Task) None[source]

Register task under category in BENCHMARK_TASKS.

The task becomes visible to run_benchmark(category) and to iter_benchmark_tasks(). Use unregister_task() to remove it (e.g. in tearDown of a test).

graphnetz.benchmark.run_benchmark(category: str | None = None, models: type | tuple[Any, ...] | ModelSpec | dict[str, type | tuple[Any, ...] | ModelSpec] | None = None, root: str = 'data/benchmark', hidden_channels: int = 64, epochs: int | None = None, only: list[str] | None = None, verbose: bool = True, seeds: int | Iterable[int] | None = None, seed: int | None = None, task_type: str | None = None, tasks: Iterable[Task] | None = None, device: torch.device | str | None = 'auto') BenchmarkReport[source]

Run a benchmark across one or more (model, task, seed) combinations.

Two ways to choose tasks:

  1. By category (default) – tasks come from BENCHMARK_TASKS indexed as [category][task_type] -> list[Task]. Pass category="social" (etc.) and optionally restrict with task_type and only=.

  2. Ad-hoc – pass tasks=[Task(...), ...] to bypass the registry entirely. Useful for benchmarking custom datasets without mutating global state. category then defaults to "custom" and is used only to namespace root/ cache directories.

The runner trains every compatible (model, task) pair across each value in seeds (default (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)) and aggregates the per-seed histories into a BenchmarkReport.

graphnetz.benchmark.save_figure(fig: matplotlib.pyplot.Figure, path: str | Path, formats: Sequence[str] = ('pdf', 'png'), dpi: int = 300) list[Path][source]

Save fig to one path stem in multiple formats; returns the saved paths.

graphnetz.benchmark.task_from_dataset(name: str, task_type: str, dataset: Any, *, epochs: int = 30) Task[source]

Wrap an already-loaded dataset as a Task.

The dataset must satisfy the conventions for task: a PyG dataset or any object exposing ds[0] plus the relevant attributes (num_features / num_classes / num_relations). The benchmark dispatcher caches the dataset, so the same instance is reused across seeds without reloading.

graphnetz.benchmark.unregister_task(category: str, name: str) Task | None[source]

Remove a previously registered task; returns it, or None if absent.