ogbnmag

package
v0.15.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 25, 2024 License: Apache-2.0 Imports: 36 Imported by: 0

Documentation

Overview

Package ogbnmag provides `Download` method for the corresponding dataset, and some dataset tools

See https://ogb.stanford.edu/ for all Open Graph Benchmark (OGB) datasets. See https://ogb.stanford.edu/docs/nodeprop/#ogbn-mag for the `ogbn-mag` dataset description.

The task is to predict the venue of publication of a paper, given its relations.

Index

Constants

This section is empty.

Variables

View Source
var (
	ZipURL         = "http://snap.stanford.edu/ogb/data/nodeproppred/mag.zip"
	ZipFile        = "mag.zip"
	ZipChecksum    = "2afe62ead87f2c301a7398796991d347db85b2d01c5442c95169372bf5a9fca4"
	DownloadSubdir = "downloads"
)
View Source
var (
	NumPapers        = 736389
	NumAuthors       = 1134649
	NumInstitutions  = 8740
	NumFieldsOfStudy = 59965

	// NumLabels is the number of labels for the papers. These correspond to publication venues.
	NumLabels = 349

	// PaperEmbeddingsSize is the size of the node features given.
	PaperEmbeddingsSize = 128

	// PapersEmbeddings contains the embeddings, shaped `(Float32)[NumPapers, PaperEmbeddingsSize]`
	PapersEmbeddings *tensors.Tensor

	// PapersYears for each paper, where year starts in 2000 (so 10 corresponds to 2010). Shaped `(Int16)[NumPapers, 1]`.
	PapersYears *tensors.Tensor

	// PapersLabels for each paper, values from 0 to 348 (so 349 in total). Shaped `(Int16)[NumPapers, 1]`.
	PapersLabels *tensors.Tensor

	// TrainSplit, ValidSplit, TestSplit  splits of the data.
	// These are indices to papers, values from `[0, NumPapers-1]`. Shaped `(Int32)[n, 1]
	// They have 629571, 41939 and 64879 elements each.
	TrainSplit, ValidSplit, TestSplit *tensors.Tensor

	// EdgesAffiliatedWith `(Int32)[1043998, 2]`, pairs with (author_id, institution_id).
	//
	// Thousands of institutions with only one affiliated author, and an exponential decreasing amount
	// of institutions with more affiliated authors, all the way to one institution that has 27K authors.
	//
	// Most authors are affiliated to 1 institution only, and an exponentially decreasing number affiliations up
	// to one author with 47 affiliations. ~300K authors with no affiliation.
	EdgesAffiliatedWith *tensors.Tensor

	// EdgesWrites `(Int32)[7145660, 2]`, pairs with (author_id, paper_id).
	//
	// Every author writes at least one paper, and every paper has at least one author.
	//
	// Most authors (~600K) wrote one paper, with a substantial tail with thousands of authors having written hundreds of
	// papers, and in the extreme one author wrote 1046 papers.
	//
	// Papers are written on average by 3 authors (140k papers), with a bell-curve distribution with a long
	// tail, with a dozen of papers written by thousands of authors (5050 authors in one case).
	EdgesWrites *tensors.Tensor

	// EdgesCites `(Int32)[5416271, 2]`, pairs with (paper_id, paper_id).
	//
	// ~120K papers don't cite anyone, 95K papers cite only one paper, and a long exponential decreasing tail,
	// in the extreme a paper cites 609 other papers.
	//
	// ~100K papers are never cited, 155K are cited once, and again a long exponential decreasing tail, in the extreme
	// one paper is cited by 4744 other papers.
	EdgesCites *tensors.Tensor

	// EdgesHasTopic `(Int32)[7505078, 2]`, pairs with (paper_id, topic_id).
	//
	// All papers have at least one "field of study" topic. Most (550K) papers have 12 or 13 topics. At most a paper has
	// 14 topics.
	//
	// All "fields of study" are associated to at least one topic. ~17K (out of ~60K) have only one paper associated.
	// ~50%+ topics have < 10 papers associated. Some ~30% have < 1000 papers associated. A handful have 10s of
	// thousands papers associated, and there is one topic that is associated to everyone.
	EdgesHasTopic *tensors.Tensor

	// Counts to the various edge types.
	// These are call shaped `(Int32)[NumElements, 1]` for each of their entities.
	CountAuthorsAffiliations, CountInstitutionsAffiliations *tensors.Tensor
	CountPapersCites, CountPapersIsCited                    *tensors.Tensor
	CountPapersFieldsOfStudy, CountFieldsOfStudyPapers      *tensors.Tensor
	CountAuthorsPapers, CountPapersAuthors                  *tensors.Tensor
)
View Source
var (
	// OgbnMagVariablesRef maps variable names to a reference to their values.
	// We keep a reference to the values because the actual values change during the call to `Download()`
	//
	// They will be stored under the "/ogbnmag" scope.
	OgbnMagVariablesRef = map[string]**tensors.Tensor{
		"PapersEmbeddings":              &PapersEmbeddings,
		"PapersLabels":                  &PapersLabels,
		"EdgesAffiliatedWith":           &EdgesAffiliatedWith,
		"EdgesWrites":                   &EdgesWrites,
		"EdgesCites":                    &EdgesCites,
		"EdgesHasTopic":                 &EdgesHasTopic,
		"CountAuthorsAffiliations":      &CountAuthorsAffiliations,
		"CountInstitutionsAffiliations": &CountInstitutionsAffiliations,
		"CountPapersCites":              &CountPapersCites,
		"CountPapersIsCited":            &CountPapersIsCited,
		"CountPapersFieldsOfStudy":      &CountPapersFieldsOfStudy,
		"CountFieldsOfStudyPapers":      &CountFieldsOfStudyPapers,
		"CountAuthorsPapers":            &CountAuthorsPapers,
		"CountPapersAuthors":            &CountPapersAuthors,
	}

	// OgbnMagVariablesScope is the absolute scope where the dataset variables are stored.
	OgbnMagVariablesScope = "/ogbnmag"
)
View Source
var (
	// ParamEmbedDropoutRate adds an extra dropout to learning embeddings.
	// This may be important because many embeddings are seen only once, so likely in testing many will have never
	//  been seen, and we want the model learn how to handle lack of embeddings (zero initialized) well.
	ParamEmbedDropoutRate = "mag_embed_dropout_rate"

	// ParamSplitEmbedTablesSize will make embed tables share entries across these many entries.
	// Default is 1, which means no splitting.
	ParamSplitEmbedTablesSize = "mag_split_embed_tables"
)
View Source
var (
	// BatchSize used for the sampler: the value was taken from the TF-GNN OGBN-MAG demo colab, and it was the
	// best found with some hyperparameter tuning. It does lead to using almost 7Gb of the GPU ram ...
	// but it works fine in an Nvidia RTX 2080 Ti (with 11Gb memory).
	BatchSize = 128

	// ReuseShareableKernels will share the kernels across similar messages in the strategy tree.
	// So the authors to papers messages will be the same if it comes from authors of the seed papers,
	// or of the coauthored-papers.
	// Default is true.
	ReuseShareableKernels = true

	// KeepDegrees will also make sampler keep the degrees of the edges as separate tensors.
	// These can be used by the GNN pooling functions to multiply the sum to the actual degree.
	KeepDegrees = true

	// IdentitySubSeeds controls whether to use an IdentitySubSeed, to allow more sharing of the kernel.
	IdentitySubSeeds = true
)
View Source
var (
	// ParamNumCheckpoints is the number of past checkpoints to keep.
	// The default is 10.
	ParamNumCheckpoints = "num_checkpoints"

	// ParamReuseKernels context parameter configs whether the kernels for similar sampling rules will be reused.
	ParamReuseKernels = "mag_reuse_kernels"

	// ParamIdentitySubSeeds controls whether to use an IdentitySubSeed, to allow more sharing of the kernel.
	ParamIdentitySubSeeds = "mag_identity_sub_seeds"

	// ParamDType controls the dtype to be used: either "float32" or "float16".
	ParamDType = "mag_dtype"
)
View Source
var WithReplacement = false

WithReplacement indicates whether the training dataset is created with replacement.

Functions

func BuildLayerWiseCustomMetricFn added in v0.10.0

func BuildLayerWiseCustomMetricFn(backend backends.Backend, ctx *context.Context, strategy *sampler.Strategy) plots.CustomMetricFn

func BuildLayerWiseInferenceModel added in v0.10.0

func BuildLayerWiseInferenceModel(strategy *sampler.Strategy, predictions bool) func(ctx *context.Context, g *Graph) *Node

BuildLayerWiseInferenceModel returns a function that builds the OGBN-MAG GNN inference model, that expects to run inference on the whole dataset in one go.

It takes as input the sampler.Strategy, and returns a function that can be used with `context.NewExec` and executed with the values of the MAG graph. Batch size is irrelevant.

The returned function returns the predictions for all seeds shaped `Int16[NumSeedNodes]` if `predictions == true`, or the readout layer shaped `Float32[NumSeedNodes, mag.NumLabels]` (or Float16) if `predictions == false`.

func Download

func Download(baseDir string) error

Download and prepares the tensors with the data into the `baseDir`.

If files are already there, it's assumed they were correctly generated and nothing is done.

The data files occupy ~415Mb, but to keep a copy of raw tensors (for faster start up), you'll need ~1Gb free disk.

func Eval

func Eval(backend backends.Backend, ctx *context.Context, dataDir, checkpointPath string, layerWise, skipTrain bool) error

func ExcludeOgbnMagVariablesFromSave added in v0.10.0

func ExcludeOgbnMagVariablesFromSave(ctx *context.Context, checkpoint *checkpoints.Handler)

ExcludeOgbnMagVariablesFromSave marks the OGBN-MAG variables as not to be saved by the given `checkpoint`. Since they are read separately and are constant, no need to repeat them at every checkpoint.

func ExtractLabelsFromInput added in v0.10.0

func ExtractLabelsFromInput(inputs, labels []*tensors.Tensor) ([]*tensors.Tensor, []*tensors.Tensor)

ExtractLabelsFromInput create the labels from the input seed indices. It returns the same inputs and the extracted labels (with mask).

func FeaturePreprocessing

func FeaturePreprocessing(ctx *context.Context, strategy *sampler.Strategy, inputs []*Node) (
	graphInputs map[string]*sampler.ValueMask[*Node], remainingInputs []*Node)

FeaturePreprocessing converts the `spec` and `inputs` given by the dataset into a map of node type name to its initial embeddings.

author/paper, so it is reasonable to expect that during validation/testing it will see many embeddings
zero initialized.

func InitTrainingSchedule added in v0.13.0

func InitTrainingSchedule(ctx *context.Context)

InitTrainingSchedule initializes custom scheduled training. It's enabled with the hyperparameter "scheduled_training".

func LayerWiseEvaluation added in v0.10.0

func LayerWiseEvaluation(backend backends.Backend, ctx *context.Context, strategy *sampler.Strategy) (train, validation, test float64)

LayerWiseEvaluation returns the train, validation and test accuracy of the model, using layer-wise inference.

func MagModelGraph

func MagModelGraph(ctx *context.Context, spec any, inputs []*Node) []*Node

MagModelGraph builds a OGBN-MAG GNN model that sends [ParamNumGraphUpdates] along its sampling strategy, and then adding a final layer on top of the seeds.

It returns 2 tensors: * Predictions for all seeds shaped `Float32[BatchSize, mag.NumLabels]` (or `Float16` or `Float64`). * Mask of the seeds, provided by the sampler, shaped `Bool[BatchSize]`.

func MakeDatasets

func MakeDatasets(dataDir string) (trainDS, trainEvalDS, validEvalDS, testEvalDS train.Dataset, err error)

MakeDatasets takes a directory where to store the downloaded data and return 4 datasets: "train", "trainEval", "validEval", "testEval".

It uses the package `ogbnmag` to download the data.

func NewSampler

func NewSampler(baseDir string) (*sampler.Sampler, error)

NewSampler will create a sampler.Sampler and configure it with the OGBN-MAG graph definition.

Usually, one will want to use the NewSamplerStrategy instead, which will calls this. Call this instead if crafting a custom sampling strategy.

`baseDir` is used to store a cached sampler called `sampler.bin` for faster startup. If empty, it will force re-creating the sampler.

func NewSamplerStrategy added in v0.10.0

func NewSamplerStrategy(magSampler *sampler.Sampler, batchSize int, seedIdsCandidates *tensors.Tensor) (strategy *sampler.Strategy)

NewSamplerStrategy creates a sampling strategy given the sampler, batch size and seeds candidates to sample from.

Args: . [magSampler] should have been created with ogbnmag.NewSampler . [batchSize] is the number of seed nodes ("Papers") to sample. . [seedIdsCandidates] is the seed of seed nodes to sample from, typically ogbnmag.TrainSplit, ogbnmag.ValidSplit or ogbnmag.TestSplit. If empty it will sample from all possible papers.

It returns a sampler.Strategy for OGBN-MAG.

func PapersSeedDatasets

func PapersSeedDatasets(manager backends.Backend) (trainDS, validDS, testDS *mldata.InMemoryDataset, err error)

PapersSeedDatasets returns the train, validation and test datasets (`data.InMemoryDataset`) with only the papers seed nodes, to be used with FNN (Feedforward Neural Networks). See [MakeDataset] to make a dataset with sampled sub-graphs for GNNs.

The datasets can be shuffled and batched as desired.

The yielded values are papers indices, and the corresponding labels.

func Train

func Train(backend backends.Backend, ctx *context.Context, dataDir, checkpointPath string, layerWiseEval, report bool, paramsSet []string) error

Train GNN model based on configuration in `ctx`.

func TrainingSchedule added in v0.13.0

func TrainingSchedule(ctx *context.Context, fromStep, toStep int) train.OnStepFn

TrainingSchedule is used to control hyperparameters during training. The parameters fromStep and toStep are the starting and final global_steps of training. It's enabled with the hyperparameter "scheduled_training".

func UploadOgbnMagVariables

func UploadOgbnMagVariables(backend backends.Backend, ctx *context.Context) *context.Context

UploadOgbnMagVariables creates frozen variables with the various static tables of the OGBN-MAG dataset, so it can be used by models.

They will be stored under the "ogbnmag" scope.

Types

This section is empty.

Directories

Path Synopsis
Package fnn implements a feed-forward neural network for the OGBN-MAG problem.
Package fnn implements a feed-forward neural network for the OGBN-MAG problem.
Package gnn implements a generic GNN modeling based on [TF-GNN MtAlbis].
Package gnn implements a generic GNN modeling based on [TF-GNN MtAlbis].

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL