Dataset taxonomy¶

GraphNetz organises 63 loaders across 10 scientific categories, each declaring the task types it can serve. The taxonomy is the single source of truth: the curated benchmark, the per-category notebooks, and the documented loader names all derive from graphnetz.datasets.LOADER_REGISTRY.

from graphnetz.datasets import LOADER_REGISTRY, list_datasets

list_datasets(category="biology")
# {'biology': {'graph_cls': [...], 'graph_reg': [...], 'node_cls': [...], 'link_pred': [...]}}

Tasks¶

Every cell in the benchmark — (category, task_type, dataset, model, seed) — maps to one of four task families. The default metric for each is what the report headlines:

Symbol	Kind	Default metric	Adapter
`node_cls`	Node classification	test accuracy	encoder used directly
`graph_cls`	Graph classification	val accuracy	mean-pool + linear head
`graph_reg`	Graph regression	val MAE	mean-pool + linear head
`link_pred`	Link prediction	test ROC-AUC	dot-product (homogeneous) or DistMult (relational) decoder

Note

Unlabelled graphs (Netzschleuder networks, synthetic combinatorial instances, the Ising lattice) enter the benchmark through link_pred on a held-out edge split, so every cell carries a real held-out metric — there is no self-supervised pretext loss in the headline report.

Categories¶

Category	#	Tasks	Loaders
Combinatorial	6	LP	random TSP, VRP, max-flow, bipartite matching, coloring, max-cut
Biology	12	GC · GR · LP	MUTAG, PROTEINS, ENZYMES, Peptides-func/struct, PPI, C. elegans, Budapest connectome, hospital / high-school contacts, ogbg-molhiv†, ogbg-molpcba†
Social	16	NC · LP	Cora, CiteSeer, PubMed, WikiCS, Roman-empire, Amazon-ratings, Minesweeper, Tolokers, Questions, MovieLens-100k, Karate, Facebook friends, DBLP coauthor, DNC emails, ogbn-arxiv†, ogbl-collab†
Knowledge	3	LP	FB15k-237, WordNet18-RR, WordNet (Netzschleuder)
Infrastructure	6	LP	power grid, EuroRoad, US roads, EU airlines, London transport, urban streets
Finance	5	NC · LP	Elliptic Bitcoin, product space, board of directors, US patents, ogbn-products†
Computing	4	LP	Internet AS, Internet topology, AS-Skitter, route views
Vision	5	GC · NC	MNIST/CIFAR-10 superpixels, ModelNet10/40, ShapeNet
Physics	3	GR · LP	QM9, ZINC, Ising lattice
Security	3	GC · LP	MalNet-Tiny, 9/11 terrorists, train terrorists

Note

Loaders marked with † come from OGB and require the optional ogb extra (pip install graphnetz[ogb]): ogbn-arxiv and ogbl-collab in Social, ogbn-products in Finance, and ogbg-molhiv and ogbg-molpcba in Biology. They are folded into their domain categories rather than exposed as a separate ogb category, so they appear in run_benchmark(category, ...) alongside the curated built-ins. Without the extra installed, the catalogue exposes 58 of the 63 loaders.

Loading individual datasets¶

Each category exposes thin loader functions returning a PyG dataset:

from graphnetz.datasets.social import cora, roman_empire
from graphnetz.datasets.biology import mutag, peptides_func
from graphnetz.datasets.computing import internet_as

ds_cora = cora("data/cora")                       # node_cls + link_pred
ds_rom  = roman_empire("data/roman_empire")       # heterophilic node_cls
ds_mut  = mutag("data/mutag")                     # graph_cls
ds_pep  = peptides_func("data/peptides_func")     # LRGB graph_cls
ds_inet = internet_as("data/internet_as")         # link_pred

# Optional OGB loaders live in their domain modules
# (require `pip install graphnetz[ogb]`):
from graphnetz.datasets.social import ogbn_arxiv, ogbl_collab
from graphnetz.datasets.biology import ogbg_molhiv

ds_arxiv  = ogbn_arxiv("data/ogb")                # node_cls (in Social)
ds_collab = ogbl_collab("data/ogb")               # link_pred (in Social)
ds_hiv    = ogbg_molhiv("data/ogb")               # graph_cls (in Biology)

The first call downloads + processes into the directory you pass; subsequent calls hit the on-disk cache.

Arbitrary Netzschleuder networks¶

The Netz loader fetches any network from the Netzschleuder catalogue on demand and converts it to the PyG format used by the rest of the library:

from graphnetz import Netz

ds = Netz(root="data", dataset_name="urban_streets", network_name="brasilia")
data = ds[0]   # PyG Data object

# Multiplex / transit / airline networks need parallel-edge support:
ds_air = Netz(
    root="data",
    dataset_name="eu_airlines",
    network_name="eu_airlines",
    multigraph=True,
)

Choosing a dataset¶

If you want…	Try
A small node-classification sanity check	`social.cora`, `social.citeseer`
Heterophilic node classification	`social.roman_empire`, `social.minesweeper`
Long-range graph classification	`biology.peptides_func` (LRGB)
Molecular regression	`physics.zinc`, `physics.qm9`
Knowledge-graph link prediction	`knowledge.fb15k_237`, `knowledge.wordnet18rr`
A real-world infrastructure network	`infrastructure.power_grid`, `infrastructure.euroroad`
Synthetic, deterministic, fast	`combinatorial.random_coloring`, `combinatorial.random_tsp`

Adding a new loader¶

See Contributing → Adding a dataset loader. The short version:

Write a thin loader function under the right category module.
Register it in LOADER_REGISTRY for each task it supports.
Optionally add a Task(...) to BENCHMARK_TASKS for the curated run.
Add a one-line smoke test in tests/test_smoke.py.