graphnetz.datasets

Dataset registry, organized by research category and task.

Each category exposes thin loader functions that return a torch_geometric dataset (PyG built-in or Netz). The LOADER_REGISTRY table maps every loader to its category and the task it can serve, mirroring the structure used by graphnetz.benchmark.BENCHMARK_TASKS ([category][task_type] -> [...]).

Categories

  • combinatorial: synthetic TSP, VRP, max-cut, max-flow, matching, coloring.

  • biology: MUTAG, PROTEINS, ENZYMES, PPI, Peptides (LRGB), C. elegans, connectomes, contact networks.

  • social: Cora, CiteSeer, PubMed, WikiCS, Heterophilous (Roman-empire, Amazon-ratings, Minesweeper, Tolokers, Questions), Karate, ego networks.

  • knowledge: FB15k-237, WordNet18-RR, Netz WordNet.

  • infrastructure: power grids, road and air networks.

  • finance: product space, board interlocks, patents, Elliptic Bitcoin.

  • computing: AS topology, Skitter, BGP route views.

  • vision: MNIST/CIFAR10 superpixel graphs, ModelNet, ShapeNet.

  • physics: QM9, ZINC, Ising lattice.

  • security: terrorist association networks, MalNet-Tiny.

Tasks

node_cls, graph_cls, graph_reg, link_pred. A loader may serve more than one task type (e.g. cora is used for both node_cls and link_pred). Deep Graph Infomax is not a task: it is a self-supervised training objective whose metric is its own loss, so the benchmark routes unlabelled graphs through link_pred (a real held-out edge split with an AUC metric) instead. train_dgi and the DGIWrapper adapter remain available as utilities for users who want unsupervised pre-training on top of any encoder.

class graphnetz.datasets.Netz(*args: Any, **kwargs: Any)[source]

Bases: InMemoryDataset

Netzschleuder network dataset.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • dataset_name (str) – Name of the dataset from the Netzschleuder repository.

  • network_name (str) – Name of the network within the dataset.

  • transform (callable, optional) – Standard PyG transform hooks.

  • pre_transform (callable, optional) – Standard PyG transform hooks.

Examples

>>> dataset = Netz(
...     root="data", dataset_name="urban_streets", network_name="brasilia"
... )
>>> dataset[0]
Data(...)

References

Tiago P. Peixoto. (2020). The Netzschleuder network catalogue and repository. https://networks.skewed.de/

property raw_dir: str
property processed_dir: str
property raw_file_names: list[str]
property processed_file_names: str
download() None[source]
process() None[source]
graphnetz.datasets.download_all_networks_netz(root: str, dataset_name: str, transform: Callable | None = None, pre_transform: Callable | None = None) None[source]

Download and process every network in a Netzschleuder dataset.

graphnetz.datasets.list_datasets(category: str | None = None, task: str | None = None) dict[str, dict[str, list[str]]][source]

Return loader names organized by category and task.

Output shape: {category: {task_type: [loader_name, ...]}}. Pass category and/or task to restrict the view.

Per-category loaders

Combinatorial-optimization graph datasets.

All instances are synthetic; canonical PyG benchmarks do not cover this category. The library provides:

class graphnetz.datasets.combinatorial.RandomBipartiteMatching(*args: Any, **kwargs: Any)[source]

Bases: InMemoryDataset

Random bipartite assignment instances with weighted edges.

Two sides of size size are connected by a Bernoulli mask of probability edge_prob; edge weights are uniform on (0, 1] and stored in edge_attr. Node features mark each node’s side.

property processed_file_names: str
process() None[source]
class graphnetz.datasets.combinatorial.RandomColoring(*args: Any, **kwargs: Any)[source]

Bases: InMemoryDataset

Random Erdos-Renyi graphs for graph-coloring and max-cut benchmarks.

property processed_file_names: str
process() None[source]
graphnetz.datasets.combinatorial.RandomMaxCut

alias of RandomColoring

class graphnetz.datasets.combinatorial.RandomMaxFlow(*args: Any, **kwargs: Any)[source]

Bases: InMemoryDataset

Random capacitated networks with a marked source/sink for max-flow tasks.

Nodes 0 and num_nodes - 1 are designated source and sink. Edges carry a positive capacity stored in edge_attr.

property processed_file_names: str
process() None[source]
class graphnetz.datasets.combinatorial.RandomTSP(*args: Any, **kwargs: Any)[source]

Bases: InMemoryDataset

Euclidean TSP instances as k-NN graphs over 2D points.

property processed_file_names: str
process() None[source]
class graphnetz.datasets.combinatorial.RandomVRP(*args: Any, **kwargs: Any)[source]

Bases: InMemoryDataset

Capacitated VRP instances: customers + multiple depots, k-NN connectivity.

Node features are [x, y, demand, is_depot]. Demands are zero for depots and uniform on (0, 1] for customers.

property processed_file_names: str
process() None[source]
graphnetz.datasets.combinatorial.random_bipartite_matching(root: str, **kwargs: int | float) RandomBipartiteMatching[source]
graphnetz.datasets.combinatorial.random_coloring(root: str, **kwargs: int | float) RandomColoring[source]
graphnetz.datasets.combinatorial.random_maxcut(root: str, **kwargs: int | float) RandomColoring[source]
graphnetz.datasets.combinatorial.random_maxflow(root: str, **kwargs: int | float) RandomMaxFlow[source]
graphnetz.datasets.combinatorial.random_tsp(root: str, **kwargs: int | float) RandomTSP[source]
graphnetz.datasets.combinatorial.random_vrp(root: str, **kwargs: int | float) RandomVRP[source]

Health and biology datasets.

Coverage:

  • Molecular: PyG TUDataset (MUTAG, PROTEINS, ENZYMES).

  • Long-range peptides: PyG LRGBDataset (Peptides-func graph classification, Peptides-struct graph regression — Dwivedi et al., NeurIPS 2022 long-range graph benchmark).

  • Protein-protein interaction: PyG PPI (inductive multi-graph).

  • Metabolic: Netzschleuder celegans_metabolic.

  • Brain connectomes: Netzschleuder budapest_connectome.

  • Epidemiology: Netzschleuder sp_hospital and sp_high_school contact graphs.

  • Open Graph Benchmark (optional ogb extra): ogbg_molhiv (~41 K molecules, binary HIV-inhibition), ogbg_molpcba (~438 K molecules, 128 binary bioassay tasks). Both also need the chem extra for RDKit featurisation.

Patient-disease-treatment knowledge graphs have no canonical free dataset and are intentionally omitted.

graphnetz.datasets.biology.budapest_connectome(root: str, network_name: str = '100m_avg') Netz[source]

Budapest reference connectome (mean connectivity across 100 subjects).

graphnetz.datasets.biology.celegans(root: str) Netz[source]
  1. elegans metabolic network (Netzschleuder).

graphnetz.datasets.biology.enzymes(root: str) torch_geometric.datasets.TUDataset[source]

Enzymes: 600 graphs, 6 classes.

graphnetz.datasets.biology.high_school_contacts(root: str) Netz[source]

Sociopatterns high-school contact network.

graphnetz.datasets.biology.hospital_contacts(root: str) Netz[source]

Sociopatterns hospital ward face-to-face contact network.

graphnetz.datasets.biology.mutag(root: str) torch_geometric.datasets.TUDataset[source]

Mutagenicity: 188 molecules, binary class.

graphnetz.datasets.biology.ogbg_molhiv(root: str) Any[source]

OGB MolHIV: ~41 K molecules, binary HIV-inhibition labels.

graphnetz.datasets.biology.ogbg_molpcba(root: str) Any[source]

OGB MolPCBA: ~438 K molecules, 128 binary bioassay labels.

graphnetz.datasets.biology.peptides_func(root: str, split: str = 'train') torch_geometric.datasets.LRGBDataset[source]

Peptides-func: long-range graph classification (10-way multilabel).

graphnetz.datasets.biology.peptides_struct(root: str, split: str = 'train') torch_geometric.datasets.LRGBDataset[source]

Peptides-struct: long-range graph regression (11 structural targets).

graphnetz.datasets.biology.ppi(root: str, split: str = 'train') torch_geometric.datasets.PPI[source]

Protein-protein interaction (inductive node multi-label classification).

graphnetz.datasets.biology.proteins(root: str) torch_geometric.datasets.TUDataset[source]

Proteins: 1113 graphs, binary class.

Social and information network datasets.

Coverage:

  • Social: karate, facebook_friends (Netzschleuder).

  • Citation/collaboration: Planetoid (Cora, CiteSeer, PubMed) + Netz dblp_coauthor.

  • Web/hyperlink: PyG WikiCS.

  • Heterophilic node classification: PyG HeterophilousGraphDataset (Roman-empire, Amazon-ratings, Minesweeper, Tolokers, Questions) — the current SOTA stress test for GNNs whose accuracy collapses outside the homophilic Planetoid setting (Platonov et al., NeurIPS 2023).

  • Communication: Netz dnc (Democratic National Committee email leak).

  • Recommendation: PyG MovieLens100K.

  • Open Graph Benchmark (optional ogb extra): ogbn_arxiv (arXiv citation network for node classification), ogbl_collab (collaboration network for link prediction).

graphnetz.datasets.social.amazon_ratings(root: str) torch_geometric.datasets.HeterophilousGraphDataset[source]

Amazon-ratings heterophilic node-classification benchmark.

graphnetz.datasets.social.citeseer(root: str) torch_geometric.datasets.Planetoid[source]

CiteSeer citation network (3327 nodes, 6 classes).

graphnetz.datasets.social.cora(root: str) torch_geometric.datasets.Planetoid[source]

Cora citation network (2708 nodes, 7 classes).

graphnetz.datasets.social.dblp_coauthor(root: str) Netz[source]

DBLP co-authorship network (Netzschleuder).

graphnetz.datasets.social.dnc_emails(root: str) Netz[source]

DNC email communication network (Netzschleuder).

graphnetz.datasets.social.facebook_friends(root: str) Netz[source]

Facebook ego friendship network (Netzschleuder).

graphnetz.datasets.social.karate(root: str) Netz[source]

Zachary’s karate club (the canonical small social network).

graphnetz.datasets.social.minesweeper(root: str) torch_geometric.datasets.HeterophilousGraphDataset[source]

Minesweeper heterophilic node-classification benchmark.

graphnetz.datasets.social.movielens100k(root: str) torch_geometric.datasets.MovieLens100K[source]

MovieLens 100K user-item bipartite ratings graph.

graphnetz.datasets.social.ogbl_collab(root: str) torch_geometric.data.Data[source]

OGB collaboration network (~235 K author nodes, 128-d features).

Returns a single PyG Data graph; the benchmark runner re-splits via RandomLinkSplit rather than using OGB’s official edge split.

graphnetz.datasets.social.ogbn_arxiv(root: str) torch_geometric.data.Data[source]

OGB arXiv citation network (~169 K nodes, 40 subject classes).

graphnetz.datasets.social.pubmed(root: str) torch_geometric.datasets.Planetoid[source]

PubMed citation network (19717 nodes, 3 classes).

graphnetz.datasets.social.questions(root: str) torch_geometric.datasets.HeterophilousGraphDataset[source]

Questions heterophilic node-classification benchmark.

graphnetz.datasets.social.roman_empire(root: str) torch_geometric.datasets.HeterophilousGraphDataset[source]

Roman-empire heterophilic node-classification benchmark.

graphnetz.datasets.social.tolokers(root: str) torch_geometric.datasets.HeterophilousGraphDataset[source]

Tolokers heterophilic node-classification benchmark.

graphnetz.datasets.social.wikics(root: str) torch_geometric.datasets.WikiCS[source]

Wikipedia computer-science article hyperlink graph.

Knowledge graph and language datasets.

Wraps PyG knowledge-graph benchmarks for relational link prediction.

graphnetz.datasets.knowledge.fb15k_237(root: str) torch_geometric.datasets.RelLinkPredDataset[source]

FB15k-237 relational link prediction benchmark.

graphnetz.datasets.knowledge.wordnet18rr(root: str) _WordNet18RRRel[source]

WordNet18-RR relational link prediction benchmark.

graphnetz.datasets.knowledge.wordnet_netz(root: str) Netz[source]

WordNet semantic graph (Netzschleuder).

Infrastructure and physical-system networks (Netzschleuder).

graphnetz.datasets.infrastructure.eu_airlines(root: str) Netz[source]

European airline route multiplex.

graphnetz.datasets.infrastructure.euroroad(root: str) Netz[source]

European road network.

graphnetz.datasets.infrastructure.london_transport(root: str) Netz[source]

London transport multiplex (rail + bus + underground).

graphnetz.datasets.infrastructure.power_grid(root: str) Netz[source]

US Western power grid.

graphnetz.datasets.infrastructure.urban_streets(root: str, network_name: str = 'brasilia') Netz[source]

Urban street network for a given city (e.g. brasilia, manhattan).

graphnetz.datasets.infrastructure.us_roads(root: str, network_name: str = 'DC') Netz[source]

US road network for a given state (e.g. DC, CA).

Finance and economics networks.

Coverage:

  • Trade: product_space (economic complexity).

  • Ownership / corporate control: board_directors (Norwegian boards).

  • Innovation: us_patents citation network.

  • Transactions / fraud / AML: PyG EllipticBitcoinDataset (illicit-wallet detection on Bitcoin transactions).

  • Open Graph Benchmark (optional ogb extra): ogbn_products (Amazon co-purchase graph for node classification, ~2.4 M nodes, 47 product categories).

Inter-bank exposure datasets are typically confidential and have no canonical public benchmark.

graphnetz.datasets.finance.board_directors(root: str, network_name: str = 'net1m_2002-05-01') Netz[source]

Norwegian boards of directors interlock network (snapshot).

graphnetz.datasets.finance.elliptic_bitcoin(root: str) torch_geometric.datasets.EllipticBitcoinDataset[source]

Elliptic Bitcoin transactions dataset for illicit-wallet detection.

graphnetz.datasets.finance.ogbn_products(root: str) torch_geometric.data.Data[source]

OGB Amazon product co-purchase network (~2.4 M nodes, 47 classes).

Larger than ogbn_arxiv — full-graph training is feasible on a workstation GPU but slow; reduce epochs for quick iteration.

graphnetz.datasets.finance.product_space(root: str) Netz[source]

Product space of international trade (economic complexity).

graphnetz.datasets.finance.us_patents(root: str) Netz[source]

US patents citation network.

Computing and systems networks (Netzschleuder).

Internet topology, autonomous-system graphs, and routing snapshots.

graphnetz.datasets.computing.as_skitter(root: str) Netz[source]

CAIDA Skitter AS-level network.

graphnetz.datasets.computing.internet_as(root: str, network_name: str = 'internet_as') Netz[source]

Internet AS-level topology snapshot (Karrer-Newman-Zdeborová, 2014).

graphnetz.datasets.computing.route_views(root: str, network_name: str = '20030701') Netz[source]

Route Views BGP snapshot.

graphnetz.datasets.computing.topology(root: str) Netz[source]

Internet router-level topology.

Geometry and vision datasets.

Coverage: - Image-derived superpixel graphs: MNISTSuperpixels, CIFAR10 (GNN benchmark). - Meshes / point clouds: PyG ModelNet (10/40 classes), ShapeNet part segmentation.

graphnetz.datasets.vision.cifar10_superpixels(root: str, split: str = 'train') torch_geometric.datasets.GNNBenchmarkDataset[source]

CIFAR10 superpixel graphs (GNN benchmark suite).

graphnetz.datasets.vision.mnist_superpixels(root: str, train: bool = True) torch_geometric.datasets.MNISTSuperpixels[source]

MNIST images converted to 75-superpixel graphs.

graphnetz.datasets.vision.modelnet10(root: str, train: bool = True) torch_geometric.datasets.ModelNet[source]

ModelNet10 3D shapes (10 classes).

graphnetz.datasets.vision.modelnet40(root: str, train: bool = True) torch_geometric.datasets.ModelNet[source]

ModelNet40 3D shapes (40 classes).

graphnetz.datasets.vision.shapenet(root: str, categories: list[str] | None = None) torch_geometric.datasets.ShapeNet[source]

ShapeNet point clouds with part-segmentation labels.

Pass categories=['Chair'] (etc.) to limit to a subset.

Physics and chemistry datasets.

Coverage: - Molecules: PyG QM9, ZINC. - Spin systems / lattices: synthetic 2D Ising lattice graphs (IsingLattice).

Feynman diagrams, reaction networks, and large crystal-structure databases lack canonical PyG-format datasets and are intentionally omitted.

class graphnetz.datasets.physics.IsingLattice(*args: Any, **kwargs: Any)[source]

Bases: InMemoryDataset

Synthetic 2D Ising lattice ensemble.

Each graph is an L x L square lattice with periodic-free boundaries; node features are Bernoulli spins drawn at temperature temperature (Glauber-style independent sampling – not a thermalised configuration but a cheap proxy useful for representation-learning benchmarks).

property processed_file_names: str
process() None[source]
graphnetz.datasets.physics.ising_lattice(root: str, **kwargs: int | float) IsingLattice[source]
graphnetz.datasets.physics.qm9(root: str) torch_geometric.datasets.QM9[source]

QM9 quantum-chemistry benchmark (134k small molecules).

graphnetz.datasets.physics.zinc(root: str, subset: bool = True, split: str = 'train') torch_geometric.datasets.ZINC[source]

ZINC molecular regression benchmark.

Security-related graph datasets.

Coverage:

  • Terrorism association networks (Krebs 9/11; Madrid 2004 train bombings) via Netzschleuder.

  • Malware function call graphs: PyG MalNetTiny (5 malware families).

Generic attack graphs and threat-intelligence/IoC graphs lack canonical public benchmarks and are intentionally omitted.

graphnetz.datasets.security.malnet_tiny(root: str, split: str = 'train') torch_geometric.datasets.MalNetTiny[source]

MalNet-Tiny: 5 malware family function-call graphs.

graphnetz.datasets.security.terrorists_911(root: str) Netz[source]

Krebs 9/11 terrorist association network.

graphnetz.datasets.security.train_terrorists(root: str) Netz[source]

Madrid 2004 train bombing terrorist network.