Dataset taxonomy¶
GraphNetz organises 63 loaders across 10 scientific categories, each declaring the task types it can serve. The taxonomy is the single source of truth: the curated benchmark, the per-category notebooks, and the documented loader names all derive from graphnetz.datasets.LOADER_REGISTRY.
from graphnetz.datasets import LOADER_REGISTRY, list_datasets
list_datasets(category="biology")
# {'biology': {'graph_cls': [...], 'graph_reg': [...], 'node_cls': [...], 'link_pred': [...]}}
Tasks¶
Every cell in the benchmark — (category, task_type, dataset, model, seed) —
maps to one of four task families. The default metric for each is what the
report headlines:
Symbol |
Kind |
Default metric |
Adapter |
|---|---|---|---|
|
Node classification |
test accuracy |
encoder used directly |
|
Graph classification |
val accuracy |
mean-pool + linear head |
|
Graph regression |
val MAE |
mean-pool + linear head |
|
Link prediction |
test ROC-AUC |
dot-product (homogeneous) or DistMult (relational) decoder |
Note
Unlabelled graphs (Netzschleuder networks, synthetic combinatorial
instances, the Ising lattice) enter the benchmark through link_pred on a
held-out edge split, so every cell carries a real held-out metric — there
is no self-supervised pretext loss in the headline report.
Categories¶
Category |
# |
Tasks |
Loaders |
|---|---|---|---|
Combinatorial |
6 |
LP |
random TSP, VRP, max-flow, bipartite matching, coloring, max-cut |
Biology |
12 |
GC · GR · LP |
MUTAG, PROTEINS, ENZYMES, Peptides-func/struct, PPI, C. elegans, Budapest connectome, hospital / high-school contacts, ogbg-molhiv†, ogbg-molpcba† |
Social |
16 |
NC · LP |
Cora, CiteSeer, PubMed, WikiCS, Roman-empire, Amazon-ratings, Minesweeper, Tolokers, Questions, MovieLens-100k, Karate, Facebook friends, DBLP coauthor, DNC emails, ogbn-arxiv†, ogbl-collab† |
Knowledge |
3 |
LP |
FB15k-237, WordNet18-RR, WordNet (Netzschleuder) |
Infrastructure |
6 |
LP |
power grid, EuroRoad, US roads, EU airlines, London transport, urban streets |
Finance |
5 |
NC · LP |
Elliptic Bitcoin, product space, board of directors, US patents, ogbn-products† |
Computing |
4 |
LP |
Internet AS, Internet topology, AS-Skitter, route views |
Vision |
5 |
GC · NC |
MNIST/CIFAR-10 superpixels, ModelNet10/40, ShapeNet |
Physics |
3 |
GR · LP |
QM9, ZINC, Ising lattice |
Security |
3 |
GC · LP |
MalNet-Tiny, 9/11 terrorists, train terrorists |
Note
Loaders marked with † come from OGB and require the optional ogb
extra (pip install graphnetz[ogb]): ogbn-arxiv and ogbl-collab
in Social, ogbn-products in Finance, and ogbg-molhiv and
ogbg-molpcba in Biology. They are folded into their domain
categories rather than exposed as a separate ogb category, so they
appear in run_benchmark(category, ...) alongside the curated
built-ins. Without the extra installed, the catalogue exposes 58 of
the 63 loaders.
Loading individual datasets¶
Each category exposes thin loader functions returning a PyG dataset:
from graphnetz.datasets.social import cora, roman_empire
from graphnetz.datasets.biology import mutag, peptides_func
from graphnetz.datasets.computing import internet_as
ds_cora = cora("data/cora") # node_cls + link_pred
ds_rom = roman_empire("data/roman_empire") # heterophilic node_cls
ds_mut = mutag("data/mutag") # graph_cls
ds_pep = peptides_func("data/peptides_func") # LRGB graph_cls
ds_inet = internet_as("data/internet_as") # link_pred
# Optional OGB loaders live in their domain modules
# (require `pip install graphnetz[ogb]`):
from graphnetz.datasets.social import ogbn_arxiv, ogbl_collab
from graphnetz.datasets.biology import ogbg_molhiv
ds_arxiv = ogbn_arxiv("data/ogb") # node_cls (in Social)
ds_collab = ogbl_collab("data/ogb") # link_pred (in Social)
ds_hiv = ogbg_molhiv("data/ogb") # graph_cls (in Biology)
The first call downloads + processes into the directory you pass; subsequent calls hit the on-disk cache.
Arbitrary Netzschleuder networks¶
The Netz loader fetches any network from the Netzschleuder
catalogue on demand and converts it to the PyG
format used by the rest of the library:
from graphnetz import Netz
ds = Netz(root="data", dataset_name="urban_streets", network_name="brasilia")
data = ds[0] # PyG Data object
# Multiplex / transit / airline networks need parallel-edge support:
ds_air = Netz(
root="data",
dataset_name="eu_airlines",
network_name="eu_airlines",
multigraph=True,
)
Choosing a dataset¶
If you want… |
Try |
|---|---|
A small node-classification sanity check |
|
Heterophilic node classification |
|
Long-range graph classification |
|
Molecular regression |
|
Knowledge-graph link prediction |
|
A real-world infrastructure network |
|
Synthetic, deterministic, fast |
|
Adding a new loader¶
See Contributing → Adding a dataset loader. The short version:
Write a thin loader function under the right category module.
Register it in
LOADER_REGISTRYfor each task it supports.Optionally add a
Task(...)toBENCHMARK_TASKSfor the curated run.Add a one-line smoke test in
tests/test_smoke.py.