graphnetz.datasets¶
Dataset registry, organized by research category and task.
Each category exposes thin loader functions that return a torch_geometric
dataset (PyG built-in or Netz). The
LOADER_REGISTRY table maps every loader to its category and the task it can serve, mirroring the structure used by
graphnetz.benchmark.BENCHMARK_TASKS ([category][task_type] -> [...]).
Categories¶
combinatorial: synthetic TSP, VRP, max-cut, max-flow, matching, coloring.biology: MUTAG, PROTEINS, ENZYMES, PPI, Peptides (LRGB), C. elegans, connectomes, contact networks.social: Cora, CiteSeer, PubMed, WikiCS, Heterophilous (Roman-empire, Amazon-ratings, Minesweeper, Tolokers, Questions), Karate, ego networks.knowledge: FB15k-237, WordNet18-RR, Netz WordNet.infrastructure: power grids, road and air networks.finance: product space, board interlocks, patents, Elliptic Bitcoin.computing: AS topology, Skitter, BGP route views.vision: MNIST/CIFAR10 superpixel graphs, ModelNet, ShapeNet.physics: QM9, ZINC, Ising lattice.security: terrorist association networks, MalNet-Tiny.
Tasks¶
node_cls, graph_cls, graph_reg, link_pred. A loader may
serve more than one task type (e.g. cora is used for both node_cls and
link_pred). Deep Graph Infomax is not a task: it is a
self-supervised training objective whose metric is its own loss, so the
benchmark routes unlabelled graphs through link_pred (a real held-out
edge split with an AUC metric) instead. train_dgi and the
DGIWrapper adapter remain available as utilities for users who want
unsupervised pre-training on top of any encoder.
- class graphnetz.datasets.Netz(*args: Any, **kwargs: Any)[source]¶
Bases:
InMemoryDatasetNetzschleuder network dataset.
- Parameters:
root (str) – Root directory where the dataset should be saved.
dataset_name (str) – Name of the dataset from the Netzschleuder repository.
network_name (str) – Name of the network within the dataset.
transform (callable, optional) – Standard PyG transform hooks.
pre_transform (callable, optional) – Standard PyG transform hooks.
Examples
>>> dataset = Netz( ... root="data", dataset_name="urban_streets", network_name="brasilia" ... ) >>> dataset[0] Data(...)
References
Tiago P. Peixoto. (2020). The Netzschleuder network catalogue and repository. https://networks.skewed.de/
- graphnetz.datasets.download_all_networks_netz(root: str, dataset_name: str, transform: Callable | None = None, pre_transform: Callable | None = None) None[source]¶
Download and process every network in a Netzschleuder dataset.
- graphnetz.datasets.list_datasets(category: str | None = None, task: str | None = None) dict[str, dict[str, list[str]]][source]¶
Return loader names organized by category and task.
Output shape:
{category: {task_type: [loader_name, ...]}}. Passcategoryand/ortaskto restrict the view.
Per-category loaders¶
Combinatorial-optimization graph datasets.
All instances are synthetic; canonical PyG benchmarks do not cover this category. The library provides:
RandomTSP/random_tsp()— Euclidean TSP (k-NN over 2D points).RandomVRP/random_vrp()— Capacitated VRP (multi-depot k-NN).RandomMaxFlow/random_maxflow()— random capacitated networks with a single source/sink, suitable for max-flow / min-cut benchmarks.RandomBipartiteMatching/random_bipartite_matching()— bipartite assignment instances with random weights.RandomColoring/random_coloring()— random Erdos-Renyi graphs for graph-coloring / max-cut experiments.RandomMaxCut/random_maxcut()— alias of RandomColoring.
- class graphnetz.datasets.combinatorial.RandomBipartiteMatching(*args: Any, **kwargs: Any)[source]¶
Bases:
InMemoryDatasetRandom bipartite assignment instances with weighted edges.
Two sides of size
sizeare connected by a Bernoulli mask of probabilityedge_prob; edge weights are uniform on(0, 1]and stored inedge_attr. Node features mark each node’s side.
- class graphnetz.datasets.combinatorial.RandomColoring(*args: Any, **kwargs: Any)[source]¶
Bases:
InMemoryDatasetRandom Erdos-Renyi graphs for graph-coloring and max-cut benchmarks.
- graphnetz.datasets.combinatorial.RandomMaxCut¶
alias of
RandomColoring
- class graphnetz.datasets.combinatorial.RandomMaxFlow(*args: Any, **kwargs: Any)[source]¶
Bases:
InMemoryDatasetRandom capacitated networks with a marked source/sink for max-flow tasks.
Nodes 0 and
num_nodes - 1are designated source and sink. Edges carry a positive capacity stored inedge_attr.
- class graphnetz.datasets.combinatorial.RandomTSP(*args: Any, **kwargs: Any)[source]¶
Bases:
InMemoryDatasetEuclidean TSP instances as k-NN graphs over 2D points.
- class graphnetz.datasets.combinatorial.RandomVRP(*args: Any, **kwargs: Any)[source]¶
Bases:
InMemoryDatasetCapacitated VRP instances: customers + multiple depots, k-NN connectivity.
Node features are
[x, y, demand, is_depot]. Demands are zero for depots and uniform on(0, 1]for customers.
- graphnetz.datasets.combinatorial.random_bipartite_matching(root: str, **kwargs: int | float) RandomBipartiteMatching[source]¶
- graphnetz.datasets.combinatorial.random_coloring(root: str, **kwargs: int | float) RandomColoring[source]¶
- graphnetz.datasets.combinatorial.random_maxcut(root: str, **kwargs: int | float) RandomColoring[source]¶
- graphnetz.datasets.combinatorial.random_maxflow(root: str, **kwargs: int | float) RandomMaxFlow[source]¶
Health and biology datasets.
Coverage:
Molecular: PyG TUDataset (MUTAG, PROTEINS, ENZYMES).
Long-range peptides: PyG
LRGBDataset(Peptides-func graph classification, Peptides-struct graph regression — Dwivedi et al., NeurIPS 2022 long-range graph benchmark).Protein-protein interaction: PyG
PPI(inductive multi-graph).Metabolic: Netzschleuder
celegans_metabolic.Brain connectomes: Netzschleuder
budapest_connectome.Epidemiology: Netzschleuder
sp_hospitalandsp_high_schoolcontact graphs.Open Graph Benchmark (optional
ogbextra):ogbg_molhiv(~41 K molecules, binary HIV-inhibition),ogbg_molpcba(~438 K molecules, 128 binary bioassay tasks). Both also need thechemextra for RDKit featurisation.
Patient-disease-treatment knowledge graphs have no canonical free dataset and are intentionally omitted.
- graphnetz.datasets.biology.budapest_connectome(root: str, network_name: str = '100m_avg') Netz[source]¶
Budapest reference connectome (mean connectivity across 100 subjects).
- graphnetz.datasets.biology.celegans(root: str) Netz[source]¶
elegans metabolic network (Netzschleuder).
- graphnetz.datasets.biology.enzymes(root: str) torch_geometric.datasets.TUDataset[source]¶
Enzymes: 600 graphs, 6 classes.
- graphnetz.datasets.biology.high_school_contacts(root: str) Netz[source]¶
Sociopatterns high-school contact network.
- graphnetz.datasets.biology.hospital_contacts(root: str) Netz[source]¶
Sociopatterns hospital ward face-to-face contact network.
- graphnetz.datasets.biology.mutag(root: str) torch_geometric.datasets.TUDataset[source]¶
Mutagenicity: 188 molecules, binary class.
- graphnetz.datasets.biology.ogbg_molhiv(root: str) Any[source]¶
OGB MolHIV: ~41 K molecules, binary HIV-inhibition labels.
- graphnetz.datasets.biology.ogbg_molpcba(root: str) Any[source]¶
OGB MolPCBA: ~438 K molecules, 128 binary bioassay labels.
- graphnetz.datasets.biology.peptides_func(root: str, split: str = 'train') torch_geometric.datasets.LRGBDataset[source]¶
Peptides-func: long-range graph classification (10-way multilabel).
- graphnetz.datasets.biology.peptides_struct(root: str, split: str = 'train') torch_geometric.datasets.LRGBDataset[source]¶
Peptides-struct: long-range graph regression (11 structural targets).
- graphnetz.datasets.biology.ppi(root: str, split: str = 'train') torch_geometric.datasets.PPI[source]¶
Protein-protein interaction (inductive node multi-label classification).
- graphnetz.datasets.biology.proteins(root: str) torch_geometric.datasets.TUDataset[source]¶
Proteins: 1113 graphs, binary class.
Coverage:
Social:
karate,facebook_friends(Netzschleuder).Citation/collaboration: Planetoid (Cora, CiteSeer, PubMed) + Netz
dblp_coauthor.Web/hyperlink: PyG
WikiCS.Heterophilic node classification: PyG
HeterophilousGraphDataset(Roman-empire, Amazon-ratings, Minesweeper, Tolokers, Questions) — the current SOTA stress test for GNNs whose accuracy collapses outside the homophilic Planetoid setting (Platonov et al., NeurIPS 2023).Communication: Netz
dnc(Democratic National Committee email leak).Recommendation: PyG
MovieLens100K.Open Graph Benchmark (optional
ogbextra):ogbn_arxiv(arXiv citation network for node classification),ogbl_collab(collaboration network for link prediction).
- graphnetz.datasets.social.amazon_ratings(root: str) torch_geometric.datasets.HeterophilousGraphDataset[source]¶
Amazon-ratings heterophilic node-classification benchmark.
- graphnetz.datasets.social.citeseer(root: str) torch_geometric.datasets.Planetoid[source]¶
CiteSeer citation network (3327 nodes, 6 classes).
- graphnetz.datasets.social.cora(root: str) torch_geometric.datasets.Planetoid[source]¶
Cora citation network (2708 nodes, 7 classes).
- graphnetz.datasets.social.dblp_coauthor(root: str) Netz[source]¶
DBLP co-authorship network (Netzschleuder).
- graphnetz.datasets.social.dnc_emails(root: str) Netz[source]¶
DNC email communication network (Netzschleuder).
- graphnetz.datasets.social.facebook_friends(root: str) Netz[source]¶
Facebook ego friendship network (Netzschleuder).
- graphnetz.datasets.social.karate(root: str) Netz[source]¶
Zachary’s karate club (the canonical small social network).
- graphnetz.datasets.social.minesweeper(root: str) torch_geometric.datasets.HeterophilousGraphDataset[source]¶
Minesweeper heterophilic node-classification benchmark.
- graphnetz.datasets.social.movielens100k(root: str) torch_geometric.datasets.MovieLens100K[source]¶
MovieLens 100K user-item bipartite ratings graph.
- graphnetz.datasets.social.ogbl_collab(root: str) torch_geometric.data.Data[source]¶
OGB collaboration network (~235 K author nodes, 128-d features).
Returns a single PyG
Datagraph; the benchmark runner re-splits viaRandomLinkSplitrather than using OGB’s official edge split.
- graphnetz.datasets.social.ogbn_arxiv(root: str) torch_geometric.data.Data[source]¶
OGB arXiv citation network (~169 K nodes, 40 subject classes).
- graphnetz.datasets.social.pubmed(root: str) torch_geometric.datasets.Planetoid[source]¶
PubMed citation network (19717 nodes, 3 classes).
- graphnetz.datasets.social.questions(root: str) torch_geometric.datasets.HeterophilousGraphDataset[source]¶
Questions heterophilic node-classification benchmark.
- graphnetz.datasets.social.roman_empire(root: str) torch_geometric.datasets.HeterophilousGraphDataset[source]¶
Roman-empire heterophilic node-classification benchmark.
- graphnetz.datasets.social.tolokers(root: str) torch_geometric.datasets.HeterophilousGraphDataset[source]¶
Tolokers heterophilic node-classification benchmark.
- graphnetz.datasets.social.wikics(root: str) torch_geometric.datasets.WikiCS[source]¶
Wikipedia computer-science article hyperlink graph.
Knowledge graph and language datasets.
Wraps PyG knowledge-graph benchmarks for relational link prediction.
- graphnetz.datasets.knowledge.fb15k_237(root: str) torch_geometric.datasets.RelLinkPredDataset[source]¶
FB15k-237 relational link prediction benchmark.
- graphnetz.datasets.knowledge.wordnet18rr(root: str) _WordNet18RRRel[source]¶
WordNet18-RR relational link prediction benchmark.
- graphnetz.datasets.knowledge.wordnet_netz(root: str) Netz[source]¶
WordNet semantic graph (Netzschleuder).
Infrastructure and physical-system networks (Netzschleuder).
- graphnetz.datasets.infrastructure.eu_airlines(root: str) Netz[source]¶
European airline route multiplex.
- graphnetz.datasets.infrastructure.london_transport(root: str) Netz[source]¶
London transport multiplex (rail + bus + underground).
- graphnetz.datasets.infrastructure.urban_streets(root: str, network_name: str = 'brasilia') Netz[source]¶
Urban street network for a given city (e.g.
brasilia,manhattan).
- graphnetz.datasets.infrastructure.us_roads(root: str, network_name: str = 'DC') Netz[source]¶
US road network for a given state (e.g.
DC,CA).
Finance and economics networks.
Coverage:
Trade:
product_space(economic complexity).Ownership / corporate control:
board_directors(Norwegian boards).Innovation:
us_patentscitation network.Transactions / fraud / AML: PyG
EllipticBitcoinDataset(illicit-wallet detection on Bitcoin transactions).Open Graph Benchmark (optional
ogbextra):ogbn_products(Amazon co-purchase graph for node classification, ~2.4 M nodes, 47 product categories).
Inter-bank exposure datasets are typically confidential and have no canonical public benchmark.
- graphnetz.datasets.finance.board_directors(root: str, network_name: str = 'net1m_2002-05-01') Netz[source]¶
Norwegian boards of directors interlock network (snapshot).
- graphnetz.datasets.finance.elliptic_bitcoin(root: str) torch_geometric.datasets.EllipticBitcoinDataset[source]¶
Elliptic Bitcoin transactions dataset for illicit-wallet detection.
- graphnetz.datasets.finance.ogbn_products(root: str) torch_geometric.data.Data[source]¶
OGB Amazon product co-purchase network (~2.4 M nodes, 47 classes).
Larger than
ogbn_arxiv— full-graph training is feasible on a workstation GPU but slow; reduceepochsfor quick iteration.
- graphnetz.datasets.finance.product_space(root: str) Netz[source]¶
Product space of international trade (economic complexity).
Computing and systems networks (Netzschleuder).
Internet topology, autonomous-system graphs, and routing snapshots.
- graphnetz.datasets.computing.internet_as(root: str, network_name: str = 'internet_as') Netz[source]¶
Internet AS-level topology snapshot (Karrer-Newman-Zdeborová, 2014).
- graphnetz.datasets.computing.route_views(root: str, network_name: str = '20030701') Netz[source]¶
Route Views BGP snapshot.
Geometry and vision datasets.
Coverage:
- Image-derived superpixel graphs: MNISTSuperpixels, CIFAR10 (GNN benchmark).
- Meshes / point clouds: PyG ModelNet (10/40 classes), ShapeNet part segmentation.
- graphnetz.datasets.vision.cifar10_superpixels(root: str, split: str = 'train') torch_geometric.datasets.GNNBenchmarkDataset[source]¶
CIFAR10 superpixel graphs (GNN benchmark suite).
- graphnetz.datasets.vision.mnist_superpixels(root: str, train: bool = True) torch_geometric.datasets.MNISTSuperpixels[source]¶
MNIST images converted to 75-superpixel graphs.
- graphnetz.datasets.vision.modelnet10(root: str, train: bool = True) torch_geometric.datasets.ModelNet[source]¶
ModelNet10 3D shapes (10 classes).
- graphnetz.datasets.vision.modelnet40(root: str, train: bool = True) torch_geometric.datasets.ModelNet[source]¶
ModelNet40 3D shapes (40 classes).
- graphnetz.datasets.vision.shapenet(root: str, categories: list[str] | None = None) torch_geometric.datasets.ShapeNet[source]¶
ShapeNet point clouds with part-segmentation labels.
Pass
categories=['Chair'](etc.) to limit to a subset.
Physics and chemistry datasets.
Coverage:
- Molecules: PyG QM9, ZINC.
- Spin systems / lattices: synthetic 2D Ising lattice graphs (IsingLattice).
Feynman diagrams, reaction networks, and large crystal-structure databases lack canonical PyG-format datasets and are intentionally omitted.
- class graphnetz.datasets.physics.IsingLattice(*args: Any, **kwargs: Any)[source]¶
Bases:
InMemoryDatasetSynthetic 2D Ising lattice ensemble.
Each graph is an
L x Lsquare lattice with periodic-free boundaries; node features are Bernoulli spins drawn at temperaturetemperature(Glauber-style independent sampling – not a thermalised configuration but a cheap proxy useful for representation-learning benchmarks).
- graphnetz.datasets.physics.qm9(root: str) torch_geometric.datasets.QM9[source]¶
QM9 quantum-chemistry benchmark (134k small molecules).
- graphnetz.datasets.physics.zinc(root: str, subset: bool = True, split: str = 'train') torch_geometric.datasets.ZINC[source]¶
ZINC molecular regression benchmark.
Security-related graph datasets.
Coverage:
Terrorism association networks (Krebs 9/11; Madrid 2004 train bombings) via Netzschleuder.
Malware function call graphs: PyG
MalNetTiny(5 malware families).
Generic attack graphs and threat-intelligence/IoC graphs lack canonical public benchmarks and are intentionally omitted.
- graphnetz.datasets.security.malnet_tiny(root: str, split: str = 'train') torch_geometric.datasets.MalNetTiny[source]¶
MalNet-Tiny: 5 malware family function-call graphs.