ogbnmag

package

v0.15.3 Latest Latest Go to latest Published: Nov 25, 2024 License: Apache-2.0 Imports: 36 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/gomlx/gomlx

Links

Open Source Insights

Documentation ¶

Overview ¶

Package ogbnmag provides `Download` method for the corresponding dataset, and some dataset tools

See https://ogb.stanford.edu/ for all Open Graph Benchmark (OGB) datasets. See https://ogb.stanford.edu/docs/nodeprop/#ogbn-mag for the `ogbn-mag` dataset description.

The task is to predict the venue of publication of a paper, given its relations.

Index ¶

Variables
func BuildLayerWiseCustomMetricFn(backend backends.Backend, ctx *context.Context, strategy *sampler.Strategy) plots.CustomMetricFn
func BuildLayerWiseInferenceModel(strategy *sampler.Strategy, predictions bool) func(ctx *context.Context, g *Graph) *Node
func Download(baseDir string) error
func Eval(backend backends.Backend, ctx *context.Context, dataDir, checkpointPath string, ...) error
func ExcludeOgbnMagVariablesFromSave(ctx *context.Context, checkpoint *checkpoints.Handler)
func ExtractLabelsFromInput(inputs, labels []*tensors.Tensor) ([]*tensors.Tensor, []*tensors.Tensor)
func FeaturePreprocessing(ctx *context.Context, strategy *sampler.Strategy, inputs []*Node) (graphInputs map[string]*sampler.ValueMask[*Node], remainingInputs []*Node)
func InitTrainingSchedule(ctx *context.Context)
func LayerWiseEvaluation(backend backends.Backend, ctx *context.Context, strategy *sampler.Strategy) (train, validation, test float64)
func MagModelGraph(ctx *context.Context, spec any, inputs []*Node) []*Node
func MakeDatasets(dataDir string) (trainDS, trainEvalDS, validEvalDS, testEvalDS train.Dataset, err error)
func NewSampler(baseDir string) (*sampler.Sampler, error)
func NewSamplerStrategy(magSampler *sampler.Sampler, batchSize int, seedIdsCandidates *tensors.Tensor) (strategy *sampler.Strategy)
func PapersSeedDatasets(manager backends.Backend) (trainDS, validDS, testDS *mldata.InMemoryDataset, err error)
func Train(backend backends.Backend, ctx *context.Context, dataDir, checkpointPath string, ...) error
func TrainingSchedule(ctx *context.Context, fromStep, toStep int) train.OnStepFn
func UploadOgbnMagVariables(backend backends.Backend, ctx *context.Context) *context.Context

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	ZipURL         = "http://snap.stanford.edu/ogb/data/nodeproppred/mag.zip"
	ZipFile        = "mag.zip"
	ZipChecksum    = "2afe62ead87f2c301a7398796991d347db85b2d01c5442c95169372bf5a9fca4"
	DownloadSubdir = "downloads"
)

View Source

var (
	NumPapers        = 736389
	NumAuthors       = 1134649
	NumInstitutions  = 8740
	NumFieldsOfStudy = 59965

	// NumLabels is the number of labels for the papers. These correspond to publication venues.
	NumLabels = 349

	// PaperEmbeddingsSize is the size of the node features given.
	PaperEmbeddingsSize = 128

	// PapersEmbeddings contains the embeddings, shaped `(Float32)[NumPapers, PaperEmbeddingsSize]`
	PapersEmbeddings *tensors.Tensor

	// PapersYears for each paper, where year starts in 2000 (so 10 corresponds to 2010). Shaped `(Int16)[NumPapers, 1]`.
	PapersYears *tensors.Tensor

	// PapersLabels for each paper, values from 0 to 348 (so 349 in total). Shaped `(Int16)[NumPapers, 1]`.
	PapersLabels *tensors.Tensor

	// TrainSplit, ValidSplit, TestSplit  splits of the data.
	// These are indices to papers, values from `[0, NumPapers-1]`. Shaped `(Int32)[n, 1]
	// They have 629571, 41939 and 64879 elements each.
	TrainSplit, ValidSplit, TestSplit *tensors.Tensor

	// EdgesAffiliatedWith `(Int32)[1043998, 2]`, pairs with (author_id, institution_id).
	//
	// Thousands of institutions with only one affiliated author, and an exponential decreasing amount
	// of institutions with more affiliated authors, all the way to one institution that has 27K authors.
	//
	// Most authors are affiliated to 1 institution only, and an exponentially decreasing number affiliations up
	// to one author with 47 affiliations. ~300K authors with no affiliation.
	EdgesAffiliatedWith *tensors.Tensor

	// EdgesWrites `(Int32)[7145660, 2]`, pairs with (author_id, paper_id).
	//
	// Every author writes at least one paper, and every paper has at least one author.
	//
	// Most authors (~600K) wrote one paper, with a substantial tail with thousands of authors having written hundreds of
	// papers, and in the extreme one author wrote 1046 papers.
	//
	// Papers are written on average by 3 authors (140k papers), with a bell-curve distribution with a long
	// tail, with a dozen of papers written by thousands of authors (5050 authors in one case).
	EdgesWrites *tensors.Tensor

	// EdgesCites `(Int32)[5416271, 2]`, pairs with (paper_id, paper_id).
	//
	// ~120K papers don't cite anyone, 95K papers cite only one paper, and a long exponential decreasing tail,
	// in the extreme a paper cites 609 other papers.
	//
	// ~100K papers are never cited, 155K are cited once, and again a long exponential decreasing tail, in the extreme
	// one paper is cited by 4744 other papers.
	EdgesCites *tensors.Tensor

	// EdgesHasTopic `(Int32)[7505078, 2]`, pairs with (paper_id, topic_id).
	//
	// All papers have at least one "field of study" topic. Most (550K) papers have 12 or 13 topics. At most a paper has
	// 14 topics.
	//
	// All "fields of study" are associated to at least one topic. ~17K (out of ~60K) have only one paper associated.
	// ~50%+ topics have < 10 papers associated. Some ~30% have < 1000 papers associated. A handful have 10s of
	// thousands papers associated, and there is one topic that is associated to everyone.
	EdgesHasTopic *tensors.Tensor

	// Counts to the various edge types.
	// These are call shaped `(Int32)[NumElements, 1]` for each of their entities.
	CountAuthorsAffiliations, CountInstitutionsAffiliations *tensors.Tensor
	CountPapersCites, CountPapersIsCited                    *tensors.Tensor
	CountPapersFieldsOfStudy, CountFieldsOfStudyPapers      *tensors.Tensor
	CountAuthorsPapers, CountPapersAuthors                  *tensors.Tensor
)

View Source

var (
	// OgbnMagVariablesRef maps variable names to a reference to their values.
	// We keep a reference to the values because the actual values change during the call to `Download()`
	//
	// They will be stored under the "/ogbnmag" scope.
	OgbnMagVariablesRef = map[string]**tensors.Tensor{
		"PapersEmbeddings":              &PapersEmbeddings,
		"PapersLabels":                  &PapersLabels,
		"EdgesAffiliatedWith":           &EdgesAffiliatedWith,
		"EdgesWrites":                   &EdgesWrites,
		"EdgesCites":                    &EdgesCites,
		"EdgesHasTopic":                 &EdgesHasTopic,
		"CountAuthorsAffiliations":      &CountAuthorsAffiliations,
		"CountInstitutionsAffiliations": &CountInstitutionsAffiliations,
		"CountPapersCites":              &CountPapersCites,
		"CountPapersIsCited":            &CountPapersIsCited,
		"CountPapersFieldsOfStudy":      &CountPapersFieldsOfStudy,
		"CountFieldsOfStudyPapers":      &CountFieldsOfStudyPapers,
		"CountAuthorsPapers":            &CountAuthorsPapers,
		"CountPapersAuthors":            &CountPapersAuthors,
	}

	// OgbnMagVariablesScope is the absolute scope where the dataset variables are stored.
	OgbnMagVariablesScope = "/ogbnmag"
)

View Source

var (
	// ParamEmbedDropoutRate adds an extra dropout to learning embeddings.
	// This may be important because many embeddings are seen only once, so likely in testing many will have never
	//  been seen, and we want the model learn how to handle lack of embeddings (zero initialized) well.
	ParamEmbedDropoutRate = "mag_embed_dropout_rate"

	// ParamSplitEmbedTablesSize will make embed tables share entries across these many entries.
	// Default is 1, which means no splitting.
	ParamSplitEmbedTablesSize = "mag_split_embed_tables"
)

View Source

var (
	// BatchSize used for the sampler: the value was taken from the TF-GNN OGBN-MAG demo colab, and it was the
	// best found with some hyperparameter tuning. It does lead to using almost 7Gb of the GPU ram ...
	// but it works fine in an Nvidia RTX 2080 Ti (with 11Gb memory).
	BatchSize = 128

	// ReuseShareableKernels will share the kernels across similar messages in the strategy tree.
	// So the authors to papers messages will be the same if it comes from authors of the seed papers,
	// or of the coauthored-papers.
	// Default is true.
	ReuseShareableKernels = true

	// KeepDegrees will also make sampler keep the degrees of the edges as separate tensors.
	// These can be used by the GNN pooling functions to multiply the sum to the actual degree.
	KeepDegrees = true

	// IdentitySubSeeds controls whether to use an IdentitySubSeed, to allow more sharing of the kernel.
	IdentitySubSeeds = true
)

View Source

var (
	// ParamNumCheckpoints is the number of past checkpoints to keep.
	// The default is 10.
	ParamNumCheckpoints = "num_checkpoints"

	// ParamReuseKernels context parameter configs whether the kernels for similar sampling rules will be reused.
	ParamReuseKernels = "mag_reuse_kernels"

	// ParamIdentitySubSeeds controls whether to use an IdentitySubSeed, to allow more sharing of the kernel.
	ParamIdentitySubSeeds = "mag_identity_sub_seeds"

	// ParamDType controls the dtype to be used: either "float32" or "float16".
	ParamDType = "mag_dtype"
)

View Source

var NanLogger *nanlogger.NanLogger

View Source

var WithReplacement = false

WithReplacement indicates whether the training dataset is created with replacement.

Functions ¶

func BuildLayerWiseCustomMetricFn ¶ added in v0.10.0

func BuildLayerWiseCustomMetricFn(backend backends.Backend, ctx *context.Context, strategy *sampler.Strategy) plots.CustomMetricFn

func BuildLayerWiseInferenceModel ¶ added in v0.10.0

func BuildLayerWiseInferenceModel(strategy *sampler.Strategy, predictions bool) func(ctx *context.Context, g *Graph) *Node

BuildLayerWiseInferenceModel returns a function that builds the OGBN-MAG GNN inference model, that expects to run inference on the whole dataset in one go.

It takes as input the sampler.Strategy, and returns a function that can be used with `context.NewExec` and executed with the values of the MAG graph. Batch size is irrelevant.

The returned function returns the predictions for all seeds shaped `Int16[NumSeedNodes]` if `predictions == true`, or the readout layer shaped `Float32[NumSeedNodes, mag.NumLabels]` (or Float16) if `predictions == false`.

func Download ¶

func Download(baseDir string) error

Download and prepares the tensors with the data into the `baseDir`.

If files are already there, it's assumed they were correctly generated and nothing is done.

The data files occupy ~415Mb, but to keep a copy of raw tensors (for faster start up), you'll need ~1Gb free disk.

func Eval ¶

func Eval(backend backends.Backend, ctx *context.Context, dataDir, checkpointPath string, layerWise, skipTrain bool) error

func ExcludeOgbnMagVariablesFromSave ¶ added in v0.10.0

func ExcludeOgbnMagVariablesFromSave(ctx *context.Context, checkpoint *checkpoints.Handler)

ExcludeOgbnMagVariablesFromSave marks the OGBN-MAG variables as not to be saved by the given `checkpoint`. Since they are read separately and are constant, no need to repeat them at every checkpoint.

func ExtractLabelsFromInput ¶ added in v0.10.0

func ExtractLabelsFromInput(inputs, labels []*tensors.Tensor) ([]*tensors.Tensor, []*tensors.Tensor)

ExtractLabelsFromInput create the labels from the input seed indices. It returns the same inputs and the extracted labels (with mask).

func FeaturePreprocessing ¶

func FeaturePreprocessing(ctx *context.Context, strategy *sampler.Strategy, inputs []*Node) (
	graphInputs map[string]*sampler.ValueMask[*Node], remainingInputs []*Node)

FeaturePreprocessing converts the `spec` and `inputs` given by the dataset into a map of node type name to its initial embeddings.

author/paper, so it is reasonable to expect that during validation/testing it will see many embeddings
zero initialized.

func InitTrainingSchedule ¶ added in v0.13.0

func InitTrainingSchedule(ctx *context.Context)

InitTrainingSchedule initializes custom scheduled training. It's enabled with the hyperparameter "scheduled_training".

func LayerWiseEvaluation ¶ added in v0.10.0

func LayerWiseEvaluation(backend backends.Backend, ctx *context.Context, strategy *sampler.Strategy) (train, validation, test float64)

LayerWiseEvaluation returns the train, validation and test accuracy of the model, using layer-wise inference.

func MagModelGraph ¶

func MagModelGraph(ctx *context.Context, spec any, inputs []*Node) []*Node

MagModelGraph builds a OGBN-MAG GNN model that sends [ParamNumGraphUpdates] along its sampling strategy, and then adding a final layer on top of the seeds.

It returns 2 tensors: * Predictions for all seeds shaped `Float32[BatchSize, mag.NumLabels]` (or `Float16` or `Float64`). * Mask of the seeds, provided by the sampler, shaped `Bool[BatchSize]`.

func MakeDatasets ¶

func MakeDatasets(dataDir string) (trainDS, trainEvalDS, validEvalDS, testEvalDS train.Dataset, err error)

MakeDatasets takes a directory where to store the downloaded data and return 4 datasets: "train", "trainEval", "validEval", "testEval".

It uses the package `ogbnmag` to download the data.

func NewSampler ¶

func NewSampler(baseDir string) (*sampler.Sampler, error)

NewSampler will create a sampler.Sampler and configure it with the OGBN-MAG graph definition.

Usually, one will want to use the NewSamplerStrategy instead, which will calls this. Call this instead if crafting a custom sampling strategy.

`baseDir` is used to store a cached sampler called `sampler.bin` for faster startup. If empty, it will force re-creating the sampler.

func NewSamplerStrategy ¶ added in v0.10.0

func NewSamplerStrategy(magSampler *sampler.Sampler, batchSize int, seedIdsCandidates *tensors.Tensor) (strategy *sampler.Strategy)

NewSamplerStrategy creates a sampling strategy given the sampler, batch size and seeds candidates to sample from.

Args: . [magSampler] should have been created with ogbnmag.NewSampler . [batchSize] is the number of seed nodes ("Papers") to sample. . [seedIdsCandidates] is the seed of seed nodes to sample from, typically ogbnmag.TrainSplit, ogbnmag.ValidSplit or ogbnmag.TestSplit. If empty it will sample from all possible papers.

It returns a sampler.Strategy for OGBN-MAG.

func PapersSeedDatasets ¶

func PapersSeedDatasets(manager backends.Backend) (trainDS, validDS, testDS *mldata.InMemoryDataset, err error)

PapersSeedDatasets returns the train, validation and test datasets (`data.InMemoryDataset`) with only the papers seed nodes, to be used with FNN (Feedforward Neural Networks). See [MakeDataset] to make a dataset with sampled sub-graphs for GNNs.

The datasets can be shuffled and batched as desired.

The yielded values are papers indices, and the corresponding labels.

func Train ¶

func Train(backend backends.Backend, ctx *context.Context, dataDir, checkpointPath string, layerWiseEval, report bool, paramsSet []string) error

Train GNN model based on configuration in `ctx`.

func TrainingSchedule ¶ added in v0.13.0

func TrainingSchedule(ctx *context.Context, fromStep, toStep int) train.OnStepFn

TrainingSchedule is used to control hyperparameters during training. The parameters fromStep and toStep are the starting and final global_steps of training. It's enabled with the hyperparameter "scheduled_training".

func UploadOgbnMagVariables ¶

func UploadOgbnMagVariables(backend backends.Backend, ctx *context.Context) *context.Context

UploadOgbnMagVariables creates frozen variables with the various static tables of the OGBN-MAG dataset, so it can be used by models.

They will be stored under the "ogbnmag" scope.

Types ¶

This section is empty.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
demo
fnn Package fnn implements a feed-forward neural network for the OGBN-MAG problem.	Package fnn implements a feed-forward neural network for the OGBN-MAG problem.
gnn Package gnn implements a generic GNN modeling based on [TF-GNN MtAlbis].	Package gnn implements a generic GNN modeling based on [TF-GNN MtAlbis].
sampler

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL