Documentation ¶
Overview ¶
Package ogbnmag provides `Download` method for the corresponding dataset, and some dataset tools
See https://ogb.stanford.edu/ for all Open Graph Benchmark (OGB) datasets. See https://ogb.stanford.edu/docs/nodeprop/#ogbn-mag for the `ogbn-mag` dataset description.
The task is to predict the venue of publication of a paper, given its relations.
Index ¶
- Variables
- func BuildLayerWiseCustomMetricFn(backend backends.Backend, ctx *context.Context, strategy *sampler.Strategy) plots.CustomMetricFn
- func BuildLayerWiseInferenceModel(strategy *sampler.Strategy, predictions bool) func(ctx *context.Context, g *Graph) *Node
- func Download(baseDir string) error
- func Eval(backend backends.Backend, ctx *context.Context, dataDir, checkpointPath string, ...) error
- func ExcludeOgbnMagVariablesFromSave(ctx *context.Context, checkpoint *checkpoints.Handler)
- func ExtractLabelsFromInput(inputs, labels []*tensors.Tensor) ([]*tensors.Tensor, []*tensors.Tensor)
- func FeaturePreprocessing(ctx *context.Context, strategy *sampler.Strategy, inputs []*Node) (graphInputs map[string]*sampler.ValueMask[*Node], remainingInputs []*Node)
- func InitTrainingSchedule(ctx *context.Context)
- func LayerWiseEvaluation(backend backends.Backend, ctx *context.Context, strategy *sampler.Strategy) (train, validation, test float64)
- func MagModelGraph(ctx *context.Context, spec any, inputs []*Node) []*Node
- func MakeDatasets(dataDir string) (trainDS, trainEvalDS, validEvalDS, testEvalDS train.Dataset, err error)
- func NewSampler(baseDir string) (*sampler.Sampler, error)
- func NewSamplerStrategy(magSampler *sampler.Sampler, batchSize int, seedIdsCandidates *tensors.Tensor) (strategy *sampler.Strategy)
- func PapersSeedDatasets(manager backends.Backend) (trainDS, validDS, testDS *mldata.InMemoryDataset, err error)
- func Train(backend backends.Backend, ctx *context.Context, dataDir, checkpointPath string, ...) error
- func TrainingSchedule(ctx *context.Context, fromStep, toStep int) train.OnStepFn
- func UploadOgbnMagVariables(backend backends.Backend, ctx *context.Context) *context.Context
Constants ¶
This section is empty.
Variables ¶
var ( ZipURL = "http://snap.stanford.edu/ogb/data/nodeproppred/mag.zip" ZipFile = "mag.zip" ZipChecksum = "2afe62ead87f2c301a7398796991d347db85b2d01c5442c95169372bf5a9fca4" DownloadSubdir = "downloads" )
var ( NumPapers = 736389 NumAuthors = 1134649 NumInstitutions = 8740 NumFieldsOfStudy = 59965 // NumLabels is the number of labels for the papers. These correspond to publication venues. NumLabels = 349 // PaperEmbeddingsSize is the size of the node features given. PaperEmbeddingsSize = 128 // PapersEmbeddings contains the embeddings, shaped `(Float32)[NumPapers, PaperEmbeddingsSize]` PapersEmbeddings *tensors.Tensor // PapersYears for each paper, where year starts in 2000 (so 10 corresponds to 2010). Shaped `(Int16)[NumPapers, 1]`. PapersYears *tensors.Tensor // PapersLabels for each paper, values from 0 to 348 (so 349 in total). Shaped `(Int16)[NumPapers, 1]`. PapersLabels *tensors.Tensor // TrainSplit, ValidSplit, TestSplit splits of the data. // These are indices to papers, values from `[0, NumPapers-1]`. Shaped `(Int32)[n, 1] // They have 629571, 41939 and 64879 elements each. TrainSplit, ValidSplit, TestSplit *tensors.Tensor // EdgesAffiliatedWith `(Int32)[1043998, 2]`, pairs with (author_id, institution_id). // // Thousands of institutions with only one affiliated author, and an exponential decreasing amount // of institutions with more affiliated authors, all the way to one institution that has 27K authors. // // Most authors are affiliated to 1 institution only, and an exponentially decreasing number affiliations up // to one author with 47 affiliations. ~300K authors with no affiliation. EdgesAffiliatedWith *tensors.Tensor // EdgesWrites `(Int32)[7145660, 2]`, pairs with (author_id, paper_id). // // Every author writes at least one paper, and every paper has at least one author. // // Most authors (~600K) wrote one paper, with a substantial tail with thousands of authors having written hundreds of // papers, and in the extreme one author wrote 1046 papers. // // Papers are written on average by 3 authors (140k papers), with a bell-curve distribution with a long // tail, with a dozen of papers written by thousands of authors (5050 authors in one case). EdgesWrites *tensors.Tensor // EdgesCites `(Int32)[5416271, 2]`, pairs with (paper_id, paper_id). // // ~120K papers don't cite anyone, 95K papers cite only one paper, and a long exponential decreasing tail, // in the extreme a paper cites 609 other papers. // // ~100K papers are never cited, 155K are cited once, and again a long exponential decreasing tail, in the extreme // one paper is cited by 4744 other papers. EdgesCites *tensors.Tensor // EdgesHasTopic `(Int32)[7505078, 2]`, pairs with (paper_id, topic_id). // // All papers have at least one "field of study" topic. Most (550K) papers have 12 or 13 topics. At most a paper has // 14 topics. // // All "fields of study" are associated to at least one topic. ~17K (out of ~60K) have only one paper associated. // ~50%+ topics have < 10 papers associated. Some ~30% have < 1000 papers associated. A handful have 10s of // thousands papers associated, and there is one topic that is associated to everyone. EdgesHasTopic *tensors.Tensor // Counts to the various edge types. // These are call shaped `(Int32)[NumElements, 1]` for each of their entities. CountAuthorsAffiliations, CountInstitutionsAffiliations *tensors.Tensor CountPapersCites, CountPapersIsCited *tensors.Tensor CountPapersFieldsOfStudy, CountFieldsOfStudyPapers *tensors.Tensor CountAuthorsPapers, CountPapersAuthors *tensors.Tensor )
var ( // OgbnMagVariablesRef maps variable names to a reference to their values. // We keep a reference to the values because the actual values change during the call to `Download()` // // They will be stored under the "/ogbnmag" scope. OgbnMagVariablesRef = map[string]**tensors.Tensor{ "PapersEmbeddings": &PapersEmbeddings, "PapersLabels": &PapersLabels, "EdgesAffiliatedWith": &EdgesAffiliatedWith, "EdgesWrites": &EdgesWrites, "EdgesCites": &EdgesCites, "EdgesHasTopic": &EdgesHasTopic, "CountAuthorsAffiliations": &CountAuthorsAffiliations, "CountInstitutionsAffiliations": &CountInstitutionsAffiliations, "CountPapersCites": &CountPapersCites, "CountPapersIsCited": &CountPapersIsCited, "CountPapersFieldsOfStudy": &CountPapersFieldsOfStudy, "CountFieldsOfStudyPapers": &CountFieldsOfStudyPapers, "CountAuthorsPapers": &CountAuthorsPapers, "CountPapersAuthors": &CountPapersAuthors, } // OgbnMagVariablesScope is the absolute scope where the dataset variables are stored. OgbnMagVariablesScope = "/ogbnmag" )
var ( // ParamEmbedDropoutRate adds an extra dropout to learning embeddings. // This may be important because many embeddings are seen only once, so likely in testing many will have never // been seen, and we want the model learn how to handle lack of embeddings (zero initialized) well. ParamEmbedDropoutRate = "mag_embed_dropout_rate" // ParamSplitEmbedTablesSize will make embed tables share entries across these many entries. // Default is 1, which means no splitting. ParamSplitEmbedTablesSize = "mag_split_embed_tables" )
var ( // BatchSize used for the sampler: the value was taken from the TF-GNN OGBN-MAG demo colab, and it was the // best found with some hyperparameter tuning. It does lead to using almost 7Gb of the GPU ram ... // but it works fine in an Nvidia RTX 2080 Ti (with 11Gb memory). BatchSize = 128 // So the authors to papers messages will be the same if it comes from authors of the seed papers, // or of the coauthored-papers. // Default is true. ReuseShareableKernels = true // KeepDegrees will also make sampler keep the degrees of the edges as separate tensors. // These can be used by the GNN pooling functions to multiply the sum to the actual degree. KeepDegrees = true // IdentitySubSeeds controls whether to use an IdentitySubSeed, to allow more sharing of the kernel. IdentitySubSeeds = true )
var ( // ParamNumCheckpoints is the number of past checkpoints to keep. // The default is 10. ParamNumCheckpoints = "num_checkpoints" // ParamReuseKernels context parameter configs whether the kernels for similar sampling rules will be reused. ParamReuseKernels = "mag_reuse_kernels" // ParamIdentitySubSeeds controls whether to use an IdentitySubSeed, to allow more sharing of the kernel. ParamIdentitySubSeeds = "mag_identity_sub_seeds" // ParamDType controls the dtype to be used: either "float32" or "float16". ParamDType = "mag_dtype" )
var NanLogger *nanlogger.NanLogger
var WithReplacement = false
WithReplacement indicates whether the training dataset is created with replacement.
Functions ¶
func BuildLayerWiseCustomMetricFn ¶ added in v0.10.0
func BuildLayerWiseInferenceModel ¶ added in v0.10.0
func BuildLayerWiseInferenceModel(strategy *sampler.Strategy, predictions bool) func(ctx *context.Context, g *Graph) *Node
BuildLayerWiseInferenceModel returns a function that builds the OGBN-MAG GNN inference model, that expects to run inference on the whole dataset in one go.
It takes as input the sampler.Strategy, and returns a function that can be used with `context.NewExec` and executed with the values of the MAG graph. Batch size is irrelevant.
The returned function returns the predictions for all seeds shaped `Int16[NumSeedNodes]` if `predictions == true`, or the readout layer shaped `Float32[NumSeedNodes, mag.NumLabels]` (or Float16) if `predictions == false`.
func Download ¶
Download and prepares the tensors with the data into the `baseDir`.
If files are already there, it's assumed they were correctly generated and nothing is done.
The data files occupy ~415Mb, but to keep a copy of raw tensors (for faster start up), you'll need ~1Gb free disk.
func ExcludeOgbnMagVariablesFromSave ¶ added in v0.10.0
func ExcludeOgbnMagVariablesFromSave(ctx *context.Context, checkpoint *checkpoints.Handler)
ExcludeOgbnMagVariablesFromSave marks the OGBN-MAG variables as not to be saved by the given `checkpoint`. Since they are read separately and are constant, no need to repeat them at every checkpoint.
func ExtractLabelsFromInput ¶ added in v0.10.0
func ExtractLabelsFromInput(inputs, labels []*tensors.Tensor) ([]*tensors.Tensor, []*tensors.Tensor)
ExtractLabelsFromInput create the labels from the input seed indices. It returns the same inputs and the extracted labels (with mask).
func FeaturePreprocessing ¶
func FeaturePreprocessing(ctx *context.Context, strategy *sampler.Strategy, inputs []*Node) ( graphInputs map[string]*sampler.ValueMask[*Node], remainingInputs []*Node)
FeaturePreprocessing converts the `spec` and `inputs` given by the dataset into a map of node type name to its initial embeddings.
author/paper, so it is reasonable to expect that during validation/testing it will see many embeddings zero initialized.
func InitTrainingSchedule ¶ added in v0.13.0
InitTrainingSchedule initializes custom scheduled training. It's enabled with the hyperparameter "scheduled_training".
func LayerWiseEvaluation ¶ added in v0.10.0
func LayerWiseEvaluation(backend backends.Backend, ctx *context.Context, strategy *sampler.Strategy) (train, validation, test float64)
LayerWiseEvaluation returns the train, validation and test accuracy of the model, using layer-wise inference.
func MagModelGraph ¶
MagModelGraph builds a OGBN-MAG GNN model that sends [ParamNumGraphUpdates] along its sampling strategy, and then adding a final layer on top of the seeds.
It returns 2 tensors: * Predictions for all seeds shaped `Float32[BatchSize, mag.NumLabels]` (or `Float16` or `Float64`). * Mask of the seeds, provided by the sampler, shaped `Bool[BatchSize]`.
func MakeDatasets ¶
func MakeDatasets(dataDir string) (trainDS, trainEvalDS, validEvalDS, testEvalDS train.Dataset, err error)
MakeDatasets takes a directory where to store the downloaded data and return 4 datasets: "train", "trainEval", "validEval", "testEval".
It uses the package `ogbnmag` to download the data.
func NewSampler ¶
NewSampler will create a sampler.Sampler and configure it with the OGBN-MAG graph definition.
Usually, one will want to use the NewSamplerStrategy instead, which will calls this. Call this instead if crafting a custom sampling strategy.
`baseDir` is used to store a cached sampler called `sampler.bin` for faster startup. If empty, it will force re-creating the sampler.
func NewSamplerStrategy ¶ added in v0.10.0
func NewSamplerStrategy(magSampler *sampler.Sampler, batchSize int, seedIdsCandidates *tensors.Tensor) (strategy *sampler.Strategy)
NewSamplerStrategy creates a sampling strategy given the sampler, batch size and seeds candidates to sample from.
Args: . [magSampler] should have been created with ogbnmag.NewSampler . [batchSize] is the number of seed nodes ("Papers") to sample. . [seedIdsCandidates] is the seed of seed nodes to sample from, typically ogbnmag.TrainSplit, ogbnmag.ValidSplit or ogbnmag.TestSplit. If empty it will sample from all possible papers.
It returns a sampler.Strategy for OGBN-MAG.
func PapersSeedDatasets ¶
func PapersSeedDatasets(manager backends.Backend) (trainDS, validDS, testDS *mldata.InMemoryDataset, err error)
PapersSeedDatasets returns the train, validation and test datasets (`data.InMemoryDataset`) with only the papers seed nodes, to be used with FNN (Feedforward Neural Networks). See [MakeDataset] to make a dataset with sampled sub-graphs for GNNs.
The datasets can be shuffled and batched as desired.
The yielded values are papers indices, and the corresponding labels.
func Train ¶
func Train(backend backends.Backend, ctx *context.Context, dataDir, checkpointPath string, layerWiseEval, report bool, paramsSet []string) error
Train GNN model based on configuration in `ctx`.
func TrainingSchedule ¶ added in v0.13.0
TrainingSchedule is used to control hyperparameters during training. The parameters fromStep and toStep are the starting and final global_steps of training. It's enabled with the hyperparameter "scheduled_training".
func UploadOgbnMagVariables ¶
UploadOgbnMagVariables creates frozen variables with the various static tables of the OGBN-MAG dataset, so it can be used by models.
They will be stored under the "ogbnmag" scope.
Types ¶
This section is empty.
Directories ¶
Path | Synopsis |
---|---|
Package fnn implements a feed-forward neural network for the OGBN-MAG problem.
|
Package fnn implements a feed-forward neural network for the OGBN-MAG problem. |
Package gnn implements a generic GNN modeling based on [TF-GNN MtAlbis].
|
Package gnn implements a generic GNN modeling based on [TF-GNN MtAlbis]. |