datareader

package module
v0.0.0-...-816b6ff Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 25, 2021 License: BSD-3-Clause Imports: 13 Imported by: 0

README

Build Status Go Report Card codecov GoDoc

datareader : read SAS and Stata files in Go

datareader is a pure Go (Golang) package that can read binary SAS format (SAS7BDAT) and Stata format (dta) data files into native Go data structures. For non-Go users, there are command line utilities that convert SAS and Stata files into text/csv and parquet files.

The Stata reader is based on the Stata documentation for the dta file format and supports dta versions 115, 117, and 118.

There is no official documentation for SAS binary format files. The code here is translated from the Python sas7bdat package, which in turn is based on an R package. Also see here for more information about the SAS7BDAT file structure.

This package also provides a simple column-oriented data container called a Series. Both the SAS reader and Stata reader return the data as an array of Series objects, corresponding to the columns of the data file. These can in turn be converted to other formats as needed.

Both the Stata and SAS reader support streaming access to the data (i.e. reading the file by chunks of consecutive records).

SAS

Here is an example of how the SAS reader can be used in a Go program (error handling omitted for brevity):

import (
        "datareader"
        "os"
)

// Create a SAS7BDAT object
f, _ := os.Open("filename.sas7bdat")
sas, _ := datareader.NewSAS7BDATReader(f)

// Read the first 10000 records (rows)
ds, _ := sas.Read(10000)

// If column 0 contains numeric data
// x is a []float64 containing the dta
// m is a []bool containing missingness indicators
x, m, _ := ds[0].AsFloat64Slice()

// If column 1 contains text data
// x is a []string containing the dta
// m is a []bool containing missingness indicators
x, m, _ := ds[1].AsStringSlice()

Stata

Here is an example of how the Stata reader can be used in a Go program (again with no error handling):

import (
        "datareader"
        "os"
)

// Create a StataReader object
f,_ := os.Open("filename.dta")
stata, _ := datareader.NewStataReader(f)

// Read the first 10000 records (rows)
ds, _ := stata.Read(10000)

CSV

The package includes a CSV reader with type inference for the column data types.

import (
        "datareader"
)

f, _ := os.Open("filename.csv")
rt := datareader.NewCSVReader(f)
rt.HasHeader = true
dt, _ := rt.Read(-1)
// obtain data from dt as in the SAS example above

Command line utilities

We provide two command-line utilities allowing conversion of SAS and Stata datasets to other formats without using Go directly. Executables for several OS's and architectures are contained in the bin directory. The script used to cross-compile these binaries is build.sh. To build and install the commands for your local architecture only, run the Makefile (the executables will be copied into your GOBIN directory).

The stattocsv command converts a SAS7BDAT or Stata dta file to a csv file, it can be used as follows:

> stattocsv file.sas7bdat > file.csv
> stattocsv file.dta > file.csv

The columnize command takes the data from either a SAS7BDAT or a Stata dta file, and writes the data from each column into a separate file. Numeric data can be stored in either binary (native 8 byte floats) or text format (binary is considerably faster).

> columnize -in=file.sas7bdat -out=cols -mode=binary
> columnize -in=file.dta -out=cols -mode=text

Parquet conversion

We provide a simple and efficient way to convert a SAS7BDAT file to parquet format, using the parquet-go package. To convert a SAS file called 'mydata.sas7bdat' to Parquet format, begin by running sas_to_parquet as follows:

sas_to_parquet -sasfile=mydata.sas7bdat -outdir=. -structname=MyStruct -pkgname=mypackage

If you want the Parquet file for use outside of Go, you can specify any values for structname and pkgname. The sas_to_parquet command generates a Go program called 'convert_data.go' that you can use to perform the data conversion.

The parquet file will be written to the specified destination directory, which in the above example is the current working directory. The parquet file name will be based on the SAS file name, e.g. in the above example it will be 'mydata.parquet'.

To facilitate reading the Parquet file into Go using the parquet-go package, a Go struct definition will be written to the directory specified by 'mypackage' above. See the sas_to_parquet_check.go script to see how to read the file into Go using these struct definitions.

Testing

Automated testing is implemented against the Stata files used to test the pandas Stata reader (for versions 115+):

https://github.com/pydata/pandas/tree/master/pandas/io/tests/data

A CSV data file for testing is generated by the gendat.go script. There are scripts make.sas and make.stata in the test directory that generate SAS and Stata files for testing. SAS and Stata software are required to run these scripts. The generated files are provided in the test_files/data directory, so go test can be run without having access to SAS or Stata.

The columnize_test.go and stattocsv_test.go scripts test the commands against stored output.

Feedback

Please file an issue if you encounter a file that is not properly handled. If possible, share the file that causes the problem.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CSVReader

type CSVReader struct {

	// Skip this number of rows before reading the header.
	SkipRows int

	// If true, there is a header to read, otherwise default column names are used
	HasHeader bool

	// The column names, in the order that they appear in the
	// file.  Can be set by caller.
	ColumnNames []string

	// User-specified data types (maps column name to type name).
	TypeHintsName map[string]string

	// User-specified data types (indexed by column number).
	TypeHintsPos []string

	// The data type for each column.
	DataTypes []string
	// contains filtered or unexported fields
}

A CSVReader specifies how a data set in CSV format can be read from a text file.

func NewCSVReader

func NewCSVReader(r io.Reader) *CSVReader

NewCSVReader returns a CSVReader that reads CSV data from the given io.reader, with type inference and chunking.

func (*CSVReader) Read

func (rdr *CSVReader) Read(lines int) ([]*Series, error)

Read reads up lines rows of data and returns the results as an array of Series objects. If lines is negative the whole file is read. Data types of the Series objects are inferred from the file. Use type hints in the CSVReader struct to control the types directly.

type ColumnTypeT

type ColumnTypeT uint16

ColumnTypeT is the type of a data column in a SAS or Stata file.

const (
	SASNumericType ColumnTypeT = iota
	SASStringType
)
const (
	StataFloat64Type ColumnTypeT = 65526
	StataFloat32Type ColumnTypeT = 65527
	StataInt32Type   ColumnTypeT = 65528
	StataInt16Type   ColumnTypeT = 65529
	StataInt8Type    ColumnTypeT = 65530
	StataStrlType    ColumnTypeT = 32768
)

These are constants used in Dta files to represent different data types.

type SAS7BDAT

type SAS7BDAT struct {

	// Formats for the columns
	ColumnFormats []string

	// If true, trim whitespace from right of each string variable
	// (SAS7BDAT strings are fixed width)
	TrimStrings bool

	// If true, converts some date formats to Go date values (does
	// not work for all SAS date formats)
	ConvertDates bool

	// If true, strings are represented as uint64 values.  Call
	// the StringFactorMap method to obtain the mapping from these
	// coded values to the actual strings that they represent.
	FactorizeStrings bool

	// If true, turns off alignment correction when reading mix-type pages.
	// In general this should be set to false.  However some files
	// are read incorrectly and need this flag set to true.  At present,
	// we do not know how to automatically detect the correct setting, so
	// we leave this as a configurable option.
	NoAlignCorrection bool

	// The creation date of the file
	DateCreated time.Time

	// The modification date of the file
	DateModified time.Time

	// The name of the data set
	Name string

	// The platform used to create the file
	Platform string

	// The SAS release used to create the file
	SASRelease string

	// The server type used to create the file
	ServerType string

	// The operating system type used to create the file
	OSType string

	// The operating system name used to create the file
	OSName string

	// The SAS file type
	FileType string

	// The encoding name
	FileEncoding string

	// True if the file was created on a 64 bit architecture
	U64 bool

	// The byte order of the file
	ByteOrder binary.ByteOrder

	// The compression mode of the file
	Compression string

	// A decoder for decoding text to unicode
	TextDecoder *xencoding.Decoder
	// contains filtered or unexported fields
}

SAS7BDAT represents a SAS data file in SAS7BDAT format.

func NewSAS7BDATReader

func NewSAS7BDATReader(r io.ReadSeeker) (*SAS7BDAT, error)

NewSAS7BDATReader returns a new reader object for SAS7BDAT files. Call the Read method to obtain the data.

func (*SAS7BDAT) ColumnLabels

func (sas *SAS7BDAT) ColumnLabels() []string

ColumnLabels returns the column labels.

func (*SAS7BDAT) ColumnNames

func (sas *SAS7BDAT) ColumnNames() []string

ColumnNames returns the names of the columns.

func (*SAS7BDAT) ColumnTypes

func (sas *SAS7BDAT) ColumnTypes() []ColumnTypeT

ColumnTypes returns integer codes for the column data types.

func (*SAS7BDAT) Read

func (sas *SAS7BDAT) Read(num_rows int) ([]*Series, error)

Read returns up to num_rows rows of data from the SAS7BDAT file, as an array of Series objects. The Series data types are either float64 or string. If num_rows is negative, the remainder of the file is read. Returns (nil, io.EOF) when no rows remain.

SAS strings variables have a fixed width and are right-padded with whitespace. The TrimRight field of the SAS7BDAT struct can be set to true to automatically trim this whitespace.

func (*SAS7BDAT) RowCount

func (sas *SAS7BDAT) RowCount() int

RowCount returns the number of rows in the data set.

func (*SAS7BDAT) StringFactorMap

func (sas *SAS7BDAT) StringFactorMap() map[uint64]string

StringFactorMap returns a map that associates integer codes with the string value that each code represents. This is only relevant if FactorizeStrings is set to True.

type Series

type Series struct {

	// A name describing what is in this series.
	Name string
	// contains filtered or unexported fields
}

A Series is a fixed-type one-dimensional sequence of data values, with an optional mask for missing values.

func NewSeries

func NewSeries(name string, data interface{}, missing []bool) (*Series, error)

NewSeries returns a new Series value with the given name and data contents. The data slice parameter is not copied.

func (*Series) AllClose

func (ser *Series) AllClose(other *Series, tol float64) (bool, int)

AllClose returns true, 0 if the Series is within tol of the other series. If the Series have different lengths, AllClose returns false, -1. If the Series have different types, AllClose returns false, -2. If the Series have the same type and the same length but are not equal, AllClose returns false, j, where j is the index of the first position where the two series differ.

func (*Series) AllEqual

func (ser *Series) AllEqual(other *Series) (bool, int)

AllEqual is equivalent to AllClose with tol=0.

func (*Series) AsFloat64Slice

func (ser *Series) AsFloat64Slice() ([]float64, []bool, error)

AsFloat64Slice returns the data of the series as a float64 slice, and a boolean slice for the missing value indicators.

func (*Series) AsStringSlice

func (ser *Series) AsStringSlice() ([]string, []bool, error)

AsStringSlice returns the series data as slices for the values, and the missing data indicators.

func (*Series) AsUint64Slice

func (ser *Series) AsUint64Slice() ([]uint64, []bool, error)

AsUint64Slice returns the data of the series as a uint64 slice, and a boolean slice for the missing value indicators.

func (*Series) CountMissing

func (ser *Series) CountMissing() int

CountMissing returns the number of missing values in the Series.

func (*Series) Data

func (ser *Series) Data() interface{}

Data returns the data component of the Series.

func (*Series) DateFromDuration

func (ser *Series) DateFromDuration(base time.Time, units string) (*Series, error)

DateFromDuration returns a new Series in which the data are dates, derived from a given duration value. Currently, units must be "days".

func (*Series) ForceNumeric

func (ser *Series) ForceNumeric() *Series

ForceNumeric converts string values to float64 values, creating missing values where the conversion is not possible. If the data is not string type, it is unaffected.

func (*Series) Length

func (ser *Series) Length() int

Length returns the number of elements in a Series.

func (*Series) Missing

func (ser *Series) Missing() []bool

Missing returns the array of missing value indicators.

func (*Series) NullStringMissing

func (ser *Series) NullStringMissing() *Series

NullStringMissing returns a copy of a string series in which zero-length strings are treated as missing values. If the method is applied to a series that is not of string type, the series is returned unchanged.

func (*Series) Print

func (ser *Series) Print()

Print prints the entire Series to the standard output.

func (*Series) PrintRange

func (ser *Series) PrintRange(first, last int)

PrintRange prints a slice of the Series to the standard output.

func (*Series) StringFunc

func (ser *Series) StringFunc(f func(string) string) *Series

StringFunc applies the given function to all values in the series, if the series holds string values. Otherwise calling this method has no effect.

func (*Series) ToString

func (ser *Series) ToString() *Series

ToString returns a Series with string values, derived from the given series.

func (*Series) UpcastNumeric

func (ser *Series) UpcastNumeric() *Series

UpcastNumeric converts in-place all numeric type variables to float64 values. Non-numeric data is not affected.

func (*Series) Write

func (ser *Series) Write(w io.Writer)

Write writes the entire Series to the given writer.

func (*Series) WriteRange

func (ser *Series) WriteRange(w io.Writer, first, last int)

WriteRange writes the given subinterval of the Series to the given writer.

type SeriesArray

type SeriesArray []*Series

SeriesArray is an array of pointers to Series objects. It can represent a dataset consisting of several variables.

func (SeriesArray) AllClose

func (ser SeriesArray) AllClose(other []*Series, tol float64) (bool, int, int)

AllClose returns (true, 0, 0) if all numeric values in corresponding columns of the two arrays of Series objects are within the given tolerance. If any corresponding columns are not identically equal, returns (false, j, i), where j is the index of a column and i is the index of a row where the two Series are not identical. If the two SeriesArray objects have different numbers of columns, returns (false, -1, -1). If column j of the two SeriesArray objects have different lengths, returns (false, j, -1). If column j of the two SeriesArray objects have different types, returns (false, j, -2)

func (SeriesArray) AllEqual

func (ser SeriesArray) AllEqual(other []*Series) (bool, int, int)

AllEqual is equivalent to AllClose with tol = 0.

type StataReader

type StataReader struct {

	// If true, the strl numerical codes are replaced with their
	// string values when available.
	InsertStrls bool

	// If true, the categorial numerical codes are replaced with
	// their string labels when available.
	InsertCategoryLabels bool

	// If true, dates are converted to Go date format.
	ConvertDates bool

	// A short text label for the data set.
	DatasetLabel string

	// The time stamp for the data set
	TimeStamp string

	// Number of variables
	Nvar int

	// An additional text entry describing each variable
	ColumnNamesLong []string

	// String labels for categorical variables
	ValueLabels     map[string]map[int32]string
	ValueLabelNames []string

	// Format codes for each variable
	Formats []string

	// Maps from strl keys to values
	Strls      map[uint64]string
	StrlsBytes map[uint64][]byte

	// The format version of the dta file
	FormatVersion int

	// The endian-ness of the file
	ByteOrder binary.ByteOrder
	// contains filtered or unexported fields
}

StataReader reads Stata dta data files. Currently dta format versions 114, 115, 117, and 118 can be read.

The Read method reads and returns the data. Several fields of the StataReader struct may also be of interest.

Technical information about the file format can be found here: http://www.stata.com/help.cgi?dta

func NewStataReader

func NewStataReader(r io.ReadSeeker) (*StataReader, error)

NewStataReader returns a StataReader for reading from the given io.ReadSeeker.

func (*StataReader) ColumnNames

func (rdr *StataReader) ColumnNames() []string

ColumnNames returns the names of the columns in the data file.

func (*StataReader) ColumnTypes

func (rdr *StataReader) ColumnTypes() []ColumnTypeT

ColumnTypes returns integer codes corresponding to the data types in the Stata file. See the Stata dta doumentation for more information.

func (*StataReader) Read

func (rdr *StataReader) Read(rows int) ([]*Series, error)

Read returns the given number of rows of data from the Stata data file. The data are returned as an array of Series objects. If rows is negative, the remainder of the file is read.

func (*StataReader) RowCount

func (rdr *StataReader) RowCount() int

RowCount returns the number of rows in the data set.

type StatfileReader

type StatfileReader interface {
	ColumnNames() []string
	ColumnTypes() []ColumnTypeT
	RowCount() int
	Read(int) ([]*Series, error)
}

StatfileReader is an interface that can be used to work interchangeably with StataReader and SAS7BDAT objects.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL