nebula-importer

module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 9, 2020 License: Apache-2.0

README

Nebula Importer

test star this repo fork this repo

Introduction

Nebula Importer is a CSV import tool for Nebula Graph. It reads data from CSV files and inserts it into Nebula Graph.

Before you start Nebula Importer, ensure:

  • Nebula Graph is deployed
  • Schema is created

Currently, there are three ways to deploy Nebula Graph:

  1. nebula-docker-compose
  2. rpm package
  3. from source

The quickest way to deploy Nebula Graph is using docker-compose.

Use This Importer Tool by Source Code or Docker

After completing the configuration of the YAML file and the preparation of the (to be imported) CSV data file, you can use this tool to batch write to Nebula Graph.

From Source code

Nebula Importer is compiled with golang higher than >=1.13, so make sure that golang is installed on your system. The installation and configuration tutorial is referenced here.

Use git to clone the repository to local, go to the nebula-importer/ directory and make build.

$ git clone https://github.com/vesoft-inc/nebula-importer.git
$ cd nebula-importer
$ make build
$ ./nebula-importer --config /path/to/yaml/config/file

--config is used to pass in the path to the YAML configuration file.

From Docker

With Docker, you don't have to install golang locally. Pull Nebula Importer's Docker Image to import. The only thing to do is to mount the local configuration file and the CSV data files into the container as follows:

$ docker run --rm -ti \
    --network=host \
    -v {your-config-file}:{your-config-file} \
    -v {your-csv-data-dir}:{your-csv-data-dir} \
    vesoft/nebula-importer
    --config {your-config-file}
  • {your-config-file}: Replace with the absolute path of the local YAML configuration file
  • {your-csv-data-dir}: Replace with the absolute path of the local CSV data file.

Note: It is recommended to use relative paths in files.path. But if you use a local absolute path, you need to carefully check the path mapped to Docker with this path.

Prepare Configuration File

Nebula Importer reads the CSV file to be imported and Nebula Graph Server data through the YAML configuration file. Here's an example of the configuration file and the CSV file. Detail descriptions for the configuration file see the following section.

version: v1
description: example
removeTempFiles: false
  • version is a required parameter that indicates the configure file's version, the default version is v1.
  • description is an optional parameter that describes the configure file.
  • removeTempFiles is an optional parameter that confirms whether to remove generated temporary log and data files, default value: false.
  • clientSettings takes care of all the Nebula Graph related configurations.
clientSettings:
  retry: 3
  concurrency: 10
  channelBufferSize: 128
  space: test
  connection:
    user: user
    password: password
    address: 192.168.8.1:3699,192.168.8.2:3699
  postStart:
    commands: |
      UPDATE CONFIGS storage:wal_ttl=3600;
      UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = true };
    afterPeriod: 8s
  preStop:
    commands: |
      UPDATE CONFIGS storage:wal_ttl=86400;
      UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = false };
  • clientSettings.retry is an optional parameter that shows the number of retrying to execute failed nGQL in Nebula Graph client.
  • clientSettings.concurrency is an optional parameter that shows the concurrency of Nebula Graph Client, i.e. the connection number of Nebula Graph Server, the default value is 10.
  • clientSettings.channelBufferSize is an optional parameter that shows the buffer size of the cache queue for each Nebula Graph Client, the default value is 128.
  • clientSettings.space is a required parameter that specifies which space the data will be importing into. Do not import data to multiple spaces at one time for performance sake.
  • clientSettings.connection is a required parameter that contains the user, password and address information of Nebula Graph Server.
  • clientSettings.postStart is an optional parameter that describes post scripts after connecting Nebula Graph Server:
    • clientSettings.postStart.commands define some commands to run after connecting Nebula Graph Server.
    • clientSettings.postStart.afterPeriod define the period time between running above commands and inserting data to Nebula Graph Server.
  • clientSettings.preStop is an optional parameter that describes prescripts to run before disconnecting Nebula Graph Server.
    • clientSettings.preStop.commands define some commands to run before disconnecting Nebula Graph Server.

Files

The log and data file related configurations are:

  • logPath: Optional. Specifies log directory when importing data, default path is /tmp/nebula-importer-{timestamp}.log.
  • files: Required. An array type to configure different CSV files. You can also import data from a HTTP link by input the link in the file path.
logPath: ./err/test.log
files:
  - path: ./edge.csv
    failDataPath: ./err/edge.csv
    batchSize: 100
    type: csv
    csv:
      withHeader: false
      withLabel: false
      delimiter: ","

CSV Data Files

One CSV file can only store one type of vertex or edge. Vertices and edges of the different schema should be stored in different files.

  • path: Required. Specifies the path where the CSV data file is stored. If a relative path is used, the path and directory of the current configuration file are spliced.

  • failDataPath: Required. Specifies the file to insert the failed data output so that the error data is appended later.

  • batchSize: Optional. Specifies the batch size of the inserted data, the default value is 128.

  • type & csv: Required. Specifies the file type. Currently, only CSV is supported. You can specify whether to include the header and the inserted and deleted labels in the CSV file.

    • withHeader: The default value is false, the format of the header is described below.
    • withLabel: The default value is false, the format of the label is described below.
    • delimiter: Optional. The delimiter to separate different columns, default value is ",".
  • schema: Required. Describes the metadata information of the current data file. The schema.type has only two values: vertex and edge.

    • When type is specified as vertex, details should be described in the vertex field.
    • When type is specified as edge, details should be described in edge field.
schema.vertex
schema:
  type: vertex
  vertex:
    tags:
      - name: student
        props:
          - name: name
            type: string
          - name: age
            type: int
          - name: gender
            type: string

schema.vertex is a required parameter that describes the schema information such as tags of the inserted vertex. Since sometimes one vertex contains several tags, different tags should be given in the schema.vertex.tags array.

Each tag contains the following two properties:

  • name: The tag's name.
  • prop: The tag's properties. Each property contains the following two fields:
    • name: The property name, the same with the tag property in Nebula Graph
    • type: The property type, currently support bool, int, float, double, timestamp and string.

Note: The order of properties in the above props must be the same as that of the corresponding data in the CSV data file.

schema.edge
schema:
  type: edge
  edge:
    name: choose
    withRanking: false
    props:
      - name: grade
        type: int

schema.edge is a required parameter that describes the schema information of the inserted edge. Each edge contains the following three properties:

  • name: The edge's name.
  • withRanking: Specifies the rank value of the given edge, used to tell different edges to share the same edge type and vertices.
  • props: Same as the above tag. Please be noted the property order here must be the same with that of the corresponding data in the CSV data file.

Details of all the configurations please refer to Configuration Reference.

About the CSV Header

Usually, you can add some descriptions in the first row of the CSV file to specify each column's type.

Data Without Header

If the csv.withHeader is set to false, the CSV file only contains the data (no descriptions of the first row). Example of vertices and edges are as follow:

Vertex Example

Take tag course for example:

101,Math,3,No5
102,English,6,No11

The first column is the vertex ID, the following three columns are the properties, corresponding to the course.name, course.credits and building.name in the configuration file. (See vertex.tags.props).

Edge Example

Take edge type choose for example:

200,101,5
200,102,3

Data With Header

If the csv.withHeader is set to false, the CSV file only contains the data (no descriptions of the first row). Example of vertices and edges are as follow:

The first two columns indicate source vertex and dest vertex ID, the third is the property, corresponding to choose.likeness in the configuration file. (If ranking is included, the third column should be rankings. The properties should follow behind ranking column.)

CSV Data Example

There will be two CSV data formats supported in the future. But now please use the first format which has no header line in your CSV data file.

With Header Line

If the csv.withHeader is set to true, the first row of the CSV file is the header. The format of each column is <tag_name/edge_name>.<prop_name>:<prop_type>:

  • <tag_name/edge_name> is the name of the vertex or edge.
  • <prop_name> is the property name.
  • <prop_type> is the property type. It can be bool, int, float, double, string and timestamp, the default type is string.

In the above <prop_type> field, the following keywords contain special semantics:

  • :VID is the vertex ID.
  • :SRC_VID is the source vertex VID.
  • :DST_VID is the dest vertex VID.
  • :RANK is the rank of the edge.
  • :IGNORE indicates this column will be ignored.
  • :LABEL indicates the columns that insert/delete +/-.

If the CSV file contains the header, the importer parses the schema of each row according to the header and ignores the props in YAML.

Example of Vertex CSV File With Header

Take vertex course as example:

:LABEL,:VID,course.name,building.name:string,:IGNORE,course.credits:int
+,"hash(""Math"")",Math,No5,1,3
+,"uuid(""English"")",English,"No11 B\",2,6
LABEL (Optional)
:LABEL,
+,
-,

Indicates the column is inserting (+) or deleting (-) operation.

:VID (Required)
:VID
123,
"hash(""Math"")",
"uuid(""English"")"

In the :VID column, in addition to the common integer values (such as 123), you can also use the two built-in functions hash and uuid to automatically calculate the VID of the generated vertex (for example, hash("Math")).

Note that the double quotes (") are escaped in the CSV file. For example, hash("Math") should be written as "hash(""Math"")".

Other Properties
course.name,:IGNORE,course.credits:int
Math,1,3
English,2,6

:IGNORE is to specify the column that you want to ignore when importing data. All columns except the :LABEL column can be in any order. Thus, for a large CSV file, you can flexibly select the columns you need by setting the header.

Because a VERTEX can contain multiple TAGs, the TAG name should be added to the header of the specified column (for example, it must be course.credits, rather than the abbreviated credits).

Example of Edge CSV File With Header

Take edge follow for example:

:DST_VID,follow.likeness:double,:SRC_VID,:RANK
201,92.5,200,0
200,85.6,201,1

In the preceding example, the source vertex of the edge is :SRC_VID (in column 4), the dest vertex of the edge is :DST_VID (in column 1), and the property on the edge is follow.likeness:double(in column 2), the ranking field of the edge is :RANK (in column 5, the default value is 0 if you do not specify).

Label(Optional)
  • + means inserting
  • - means deleting

The same with vertex, you can specify label in edge CSV file.

TODO

  • Summary statistics of response
  • Write error log and data
  • Configure file
  • Concurrent request to Graph server
  • Create space and tag/edge automatically
  • Configure retry option for Nebula client
  • Support edge rank
  • Support label for add/delete(+/-) in first column
  • Support column header in the first line
  • Support vid partition
  • Support multi-tags insertion in vertex
  • Provide docker image and usage
  • Make header adapt to props order defined in the schema of the configuration file
  • Handle string column in an elegant way
  • Update concurrency and batch size online
  • Count duplicate vids
  • Support VID generation automatically
  • Output logs to file

Directories

Path Synopsis
pkg
cmd
csv
web

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL