hdfs

package module
v1.1.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 22, 2018 License: MIT Imports: 17 Imported by: 0

README

HDFS for Go

GoDoc build

This is a native golang client for hdfs. It connects directly to the namenode using the protocol buffers API.

It tries to be idiomatic by aping the stdlib os package, where possible, and implements the interfaces from it, including os.FileInfo and os.PathError.

Here's what it looks like in action:

client, _ := hdfs.New("namenode:8020")

file, _ := client.Open("/mobydick.txt")

buf := make([]byte, 59)
file.ReadAt(buf, 48847)

fmt.Println(string(buf))
// => Abominable are the tumblers into which he pours his poison.

For complete documentation, check out the Godoc.

The hdfs Binary

Along with the library, this repo contains a commandline client for HDFS. Like the library, its primary aim is to be idiomatic, by enabling your favorite unix verbs:

$ hdfs --help
Usage: hdfs COMMAND
The flags available are a subset of the POSIX ones, but should behave similarly.

Valid commands:
  ls [-lah] [FILE]...
  rm [-rf] FILE...
  mv [-fT] SOURCE... DEST
  mkdir [-p] FILE...
  touch [-amc] FILE...
  chmod [-R] OCTAL-MODE FILE...
  chown [-R] OWNER[:GROUP] FILE...
  cat SOURCE...
  head [-n LINES | -c BYTES] SOURCE...
  tail [-n LINES | -c BYTES] SOURCE...
  du [-sh] FILE...
  checksum FILE...
  get SOURCE [DEST]
  getmerge SOURCE DEST
  put SOURCE DEST

Since it doesn't have to wait for the JVM to start up, it's also a lot faster hadoop -fs:

$ time hadoop fs -ls / > /dev/null

real  0m2.218s
user  0m2.500s
sys 0m0.376s

$ time hdfs ls / > /dev/null

real  0m0.015s
user  0m0.004s
sys 0m0.004s

Best of all, it comes with bash tab completion for paths!

Installing the library

To install the library, once you have Go all set up:

$ go get -u github.com/colinmarc/hdfs

Installing the commandline client

Grab a tarball from the releases page and unzip it wherever you like.

You'll want to add the following line to your .bashrc or .profile:

export HADOOP_NAMENODE="namenode:8020"

To install tab completion globally on linux, copy or link the bash_completion file which comes with the tarball into the right place:

ln -sT bash_completion /etc/bash_completion.d/gohdfs

By default, the HDFS user is set to the currently-logged-in user. You can override this in your .bashrc or .profile:

export HADOOP_USER_NAME=username

Compatibility

This library uses "Version 9" of the HDFS protocol, which means it should work with hadoop distributions based on 2.2.x and above. The tests run against CDH 5.x and HDP 2.x.

Acknowledgements

This library is heavily indebted to snakebite.

Documentation

Overview

Package hdfs provides a native, idiomatic interface to HDFS. Where possible, it mimics the functionality and signatures of the standard `os` package.

Example:

client, _ := hdfs.New("namenode:8020")

file, _ := client.Open("/mobydick.txt")

buf := make([]byte, 59)
file.ReadAt(buf, 48847)

fmt.Println(string(buf))
// => Abominable are the tumblers into which he pours his poison.

Index

Constants

This section is empty.

Variables

View Source
var StatFsError = errors.New("Failed to get HDFS usage")

Functions

func Username added in v1.0.0

func Username() (string, error)

Username returns the value of HADOOP_USER_NAME in the environment, or the current system user if it is not set.

Types

type Client

type Client struct {
	// contains filtered or unexported fields
}

A Client represents a connection to an HDFS cluster

func New

func New(address string) (*Client, error)

New returns a connected Client, or an error if it can't connect. The user will be the user the code is running under. If address is an empty string it will try and get the namenode address from the hadoop configuration files.

func NewClient added in v1.1.0

func NewClient(options ClientOptions) (*Client, error)

NewClient returns a connected Client for the given options, or an error if the client could not be created.

func NewForConnection deprecated added in v1.0.2

func NewForConnection(namenode *rpc.NamenodeConnection) *Client

NewForConnection returns Client with the specified, underlying rpc.NamenodeConnection. You can use rpc.WrapNamenodeConnection to wrap your own net.Conn.

Deprecated: Use NewClient with ClientOptions instead.

func NewForUser deprecated

func NewForUser(address string, user string) (*Client, error)

NewForUser returns a connected Client with the user specified, or an error if it can't connect.

Deprecated: Use NewClient with ClientOptions instead.

func (*Client) Append added in v1.0.0

func (c *Client) Append(name string) (*FileWriter, error)

Append opens an existing file in HDFS and returns an io.WriteCloser for writing to it. Because of the way that HDFS writes are buffered and acknowledged asynchronously, it is very important that Close is called after all data has been written.

func (*Client) Chmod

func (c *Client) Chmod(name string, perm os.FileMode) error

Chmod changes the mode of the named file to mode.

func (*Client) Chown

func (c *Client) Chown(name string, user, group string) error

Chown changes the user and group of the file. Unlike os.Chown, this takes a string username and group (since that's what HDFS uses.)

If an empty string is passed for user or group, that field will not be changed remotely.

func (*Client) Chtimes

func (c *Client) Chtimes(name string, atime time.Time, mtime time.Time) error

Chtimes changes the access and modification times of the named file.

func (*Client) Close added in v1.0.0

func (c *Client) Close() error

Close terminates all underlying socket connections to remote server.

func (*Client) CopyToLocal

func (c *Client) CopyToLocal(src string, dst string) error

CopyToLocal copies the HDFS file specified by src to the local file at dst. If dst already exists, it will be overwritten.

func (*Client) CopyToRemote added in v1.0.0

func (c *Client) CopyToRemote(src string, dst string) error

CopyToRemote copies the local file specified by src to the HDFS file at dst.

func (*Client) Create added in v1.0.0

func (c *Client) Create(name string) (*FileWriter, error)

Create opens a new file in HDFS with the default replication, block size, and permissions (0644), and returns an io.WriteCloser for writing to it. Because of the way that HDFS writes are buffered and acknowledged asynchronously, it is very important that Close is called after all data has been written.

func (*Client) CreateEmptyFile

func (c *Client) CreateEmptyFile(name string) error

CreateEmptyFile creates a empty file at the given name, with the permissions 0644.

func (*Client) CreateFile added in v1.0.0

func (c *Client) CreateFile(name string, replication int, blockSize int64, perm os.FileMode) (*FileWriter, error)

CreateFile opens a new file in HDFS with the given replication, block size, and permissions, and returns an io.WriteCloser for writing to it. Because of the way that HDFS writes are buffered and acknowledged asynchronously, it is very important that Close is called after all data has been written.

func (*Client) GetContentSummary added in v0.1.4

func (c *Client) GetContentSummary(name string) (*ContentSummary, error)

GetContentSummary returns a ContentSummary representing the named file or directory. The summary contains information about the entire tree rooted in the named file; for instance, it can return the total size of all

func (*Client) Mkdir

func (c *Client) Mkdir(dirname string, perm os.FileMode) error

Mkdir creates a new directory with the specified name and permission bits.

func (*Client) MkdirAll

func (c *Client) MkdirAll(dirname string, perm os.FileMode) error

MkdirAll creates a directory for dirname, along with any necessary parents, and returns nil, or else returns an error. The permission bits perm are used for all directories that MkdirAll creates. If dirname is already a directory, MkdirAll does nothing and returns nil.

func (*Client) Open

func (c *Client) Open(name string) (*FileReader, error)

Open returns an FileReader which can be used for reading.

func (*Client) ReadDir

func (c *Client) ReadDir(dirname string) ([]os.FileInfo, error)

ReadDir reads the directory named by dirname and returns a list of sorted directory entries.

func (*Client) ReadFile

func (c *Client) ReadFile(filename string) ([]byte, error)

ReadFile reads the file named by filename and returns the contents.

func (*Client) Remove

func (c *Client) Remove(name string) error

Remove removes the named file or directory.

func (*Client) Rename

func (c *Client) Rename(oldpath, newpath string) error

Rename renames (moves) a file.

func (*Client) Stat

func (c *Client) Stat(name string) (os.FileInfo, error)

Stat returns an os.FileInfo describing the named file or directory.

func (*Client) StatFs added in v1.0.3

func (c *Client) StatFs() (FsInfo, error)

func (*Client) Walk added in v1.1.1

func (c *Client) Walk(root string, walkFn filepath.WalkFunc) error

Walk walks the file tree rooted at root, calling walkFn for each file or directory in the tree, including root. All errors that arise visiting files and directories are filtered by walkFn. The files are walked in lexical order, which makes the output deterministic but means that for very large directories Walk can be inefficient. Walk does not follow symbolic links.

type ClientOptions added in v1.1.0

type ClientOptions struct {
	Addresses []string
	Namenode  *rpc.NamenodeConnection
	User      string
}

ClientOptions represents the configurable options for a client.

type ContentSummary added in v0.1.4

type ContentSummary struct {
	// contains filtered or unexported fields
}

ContentSummary represents a set of information about a file or directory in HDFS. It's provided directly by the namenode, and has no unix filesystem analogue.

func (*ContentSummary) DirectoryCount added in v0.1.4

func (cs *ContentSummary) DirectoryCount() int

DirectoryCount returns the number of directories under the named one, including any subdirectories, and including the root directory itself. If the named path is a file, this returns 0.

func (*ContentSummary) FileCount added in v0.1.4

func (cs *ContentSummary) FileCount() int

FileCount returns the number of files under the named path, including any subdirectories. If the named path is a file, FileCount returns 1.

func (*ContentSummary) NameQuota added in v0.1.4

func (cs *ContentSummary) NameQuota() int

NameQuota returns the HDFS configured "name quota" for the named path. The name quota is a hard limit on the number of directories and files inside a directory; see http://goo.gl/sOSJmJ for more information.

func (*ContentSummary) Size added in v0.1.4

func (cs *ContentSummary) Size() int64

Size returns the total size of the named path, including any subdirectories.

func (*ContentSummary) SizeAfterReplication added in v0.1.4

func (cs *ContentSummary) SizeAfterReplication() int64

SizeAfterReplication returns the total size of the named path, including any subdirectories. Unlike Size, it counts the total replicated size of each file, and represents the total on-disk footprint for a tree in HDFS.

func (*ContentSummary) SpaceQuota added in v0.1.4

func (cs *ContentSummary) SpaceQuota() int64

SpaceQuota returns the HDFS configured "name quota" for the named path. The name quota is a hard limit on the number of directories and files inside a directory; see http://goo.gl/sOSJmJ for more information.

type FileInfo

type FileInfo struct {
	// contains filtered or unexported fields
}

FileInfo implements os.FileInfo, and provides information about a file or directory in HDFS.

func (*FileInfo) AccessTime

func (fi *FileInfo) AccessTime() time.Time

AccessTime returns the last time the file was accessed. It's not part of the os.FileInfo interface.

func (*FileInfo) IsDir

func (fi *FileInfo) IsDir() bool

func (*FileInfo) ModTime

func (fi *FileInfo) ModTime() time.Time

func (*FileInfo) Mode

func (fi *FileInfo) Mode() os.FileMode

func (*FileInfo) Name

func (fi *FileInfo) Name() string

func (*FileInfo) Owner

func (fi *FileInfo) Owner() string

Owner returns the name of the user that owns the file or directory. It's not part of the os.FileInfo interface.

func (*FileInfo) OwnerGroup

func (fi *FileInfo) OwnerGroup() string

OwnerGroup returns the name of the group that owns the file or directory. It's not part of the os.FileInfo interface.

func (*FileInfo) Size

func (fi *FileInfo) Size() int64

func (*FileInfo) Sys

func (fi *FileInfo) Sys() interface{}

Sys returns the raw *hadoop_hdfs.HdfsFileStatusProto message from the namenode.

type FileReader

type FileReader struct {
	// contains filtered or unexported fields
}

A FileReader represents an existing file or directory in HDFS. It implements io.Reader, io.ReaderAt, io.Seeker, and io.Closer, and can only be used for reads. For writes, see FileWriter and Client.Create.

func (*FileReader) Checksum

func (f *FileReader) Checksum() ([]byte, error)

Checksum returns HDFS's internal "MD5MD5CRC32C" checksum for a given file.

Internally to HDFS, it works by calculating the MD5 of all the CRCs (which are stored alongside the data) for each block, and then calculating the MD5 of all of those.

func (*FileReader) Close

func (f *FileReader) Close() error

Close implements io.Closer.

func (*FileReader) Name

func (f *FileReader) Name() string

Name returns the name of the file.

func (*FileReader) Read

func (f *FileReader) Read(b []byte) (int, error)

Read implements io.Reader.

func (*FileReader) ReadAt

func (f *FileReader) ReadAt(b []byte, off int64) (int, error)

ReadAt implements io.ReaderAt.

func (*FileReader) Readdir

func (f *FileReader) Readdir(n int) ([]os.FileInfo, error)

Readdir reads the contents of the directory associated with file and returns a slice of up to n os.FileInfo values, as would be returned by Stat, in directory order. Subsequent calls on the same file will yield further os.FileInfos.

If n > 0, Readdir returns at most n os.FileInfo values. In this case, if Readdir returns an empty slice, it will return a non-nil error explaining why. At the end of a directory, the error is io.EOF.

If n <= 0, Readdir returns all the os.FileInfo from the directory in a single slice. In this case, if Readdir succeeds (reads all the way to the end of the directory), it returns the slice and a nil error. If it encounters an error before the end of the directory, Readdir returns the os.FileInfo read until that point and a non-nil error.

func (*FileReader) Readdirnames

func (f *FileReader) Readdirnames(n int) ([]string, error)

Readdirnames reads and returns a slice of names from the directory f.

If n > 0, Readdirnames returns at most n names. In this case, if Readdirnames returns an empty slice, it will return a non-nil error explaining why. At the end of a directory, the error is io.EOF.

If n <= 0, Readdirnames returns all the names from the directory in a single slice. In this case, if Readdirnames succeeds (reads all the way to the end of the directory), it returns the slice and a nil error. If it encounters an error before the end of the directory, Readdirnames returns the names read until that point and a non-nil error.

func (*FileReader) Seek

func (f *FileReader) Seek(offset int64, whence int) (int64, error)

Seek implements io.Seeker.

The seek is virtual - it starts a new block read at the new position.

func (*FileReader) Stat

func (f *FileReader) Stat() os.FileInfo

Stat returns the FileInfo structure describing file.

type FileWriter added in v1.0.0

type FileWriter struct {
	// contains filtered or unexported fields
}

A FileWriter represents a writer for an open file in HDFS. It implements Writer and Closer, and can only be used for writes. For reads, see FileReader and Client.Open.

func (*FileWriter) Close added in v1.0.0

func (f *FileWriter) Close() error

Close closes the file, writing any remaining data out to disk and waiting for acknowledgements from the datanodes. It is important that Close is called after all data has been written.

func (*FileWriter) Flush added in v1.1.1

func (f *FileWriter) Flush() error

Flush flushes any buffered data out to the datanodes. Even immediately after a call to Flush, it is still necessary to call Close once all data has been written.

func (*FileWriter) Write added in v1.0.0

func (f *FileWriter) Write(b []byte) (int, error)

Write implements io.Writer for writing to a file in HDFS. Internally, it writes data to an internal buffer first, and then later out to HDFS. Because of this, it is important that Close is called after all data has been written.

type FsInfo added in v1.0.3

type FsInfo struct {
	Capacity              uint64
	Used                  uint64
	Remaining             uint64
	UnderReplicated       uint64
	CorruptBlocks         uint64
	MissingBlocks         uint64
	MissingReplOneBlocks  uint64
	BlocksInFuture        uint64
	PendingDeletionBlocks uint64
}

FsInfo provides information about HDFS

type HadoopConf added in v1.0.0

type HadoopConf map[string]string

HadoopConf represents a map of all the key value configutation pairs found in a user's hadoop configuration files.

func LoadHadoopConf added in v1.0.0

func LoadHadoopConf(path string) HadoopConf

LoadHadoopConf returns a HadoopConf object representing configuration from the specified path, or finds the correct path in the environment. If path or the env variable HADOOP_CONF_DIR is specified, it should point directly to the directory where the xml files are. If neither is specified, ${HADOOP_HOME}/conf will be used.

func (HadoopConf) Namenodes added in v1.0.0

func (conf HadoopConf) Namenodes() ([]string, error)

Namenodes returns the namenode hosts present in the configuration. The returned slice will be sorted and deduped.

type Property added in v1.0.0

type Property struct {
	Name  string `xml:"name"`
	Value string `xml:"value"`
}

Property is the struct representation of hadoop configuration key value pair.

Directories

Path Synopsis
cmd
protocol
hadoop_common
Package hadoop_common is a generated protocol buffer package.
Package hadoop_common is a generated protocol buffer package.
hadoop_hdfs
Package hadoop_hdfs is a generated protocol buffer package.
Package hadoop_hdfs is a generated protocol buffer package.
Package rpc implements some of the lower-level functionality required to communicate with the namenode and datanodes.
Package rpc implements some of the lower-level functionality required to communicate with the namenode and datanodes.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL