datastruct

package

v0.0.0-...-9be3a58 Latest Latest Go to latest Published: May 22, 2017 License: BSD-3-Clause Imports: 1 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/qihoo360/poseidon

Links

Open Source Insights

Documentation ¶

Overview ¶

Package datastruct is a generated protocol buffer package.

It is generated from these files:

poseidon_if.proto

It has these top-level messages:

DocGzMeta
DocId
DocIdList
CompressedDocIdList
InvertedIndex
CompressedInvertedIndex
InvertedIndexGzMeta

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type CompressedDocIdList ¶

type CompressedDocIdList struct {
	DocList []uint32 `protobuf:"varint,1,rep,name=docList" json:"docList,omitempty"`
	RowList []uint32 `protobuf:"varint,2,rep,name=rowList" json:"rowList,omitempty"`
}

压缩的docIdList, 使用FastPFOR算法压缩，两个数组解压后等长

func (*CompressedDocIdList) ProtoMessage ¶

func (*CompressedDocIdList) ProtoMessage()

func (*CompressedDocIdList) Reset ¶

func (m *CompressedDocIdList) Reset()

func (*CompressedDocIdList) String ¶

func (m *CompressedDocIdList) String() string

type CompressedInvertedIndex ¶

type CompressedInvertedIndex struct {
	Index map[string]*CompressedDocIdList `` /* 130-byte string literal not displayed */
}

func (*CompressedInvertedIndex) GetIndex ¶

func (m *CompressedInvertedIndex) GetIndex() map[string]*CompressedDocIdList

func (*CompressedInvertedIndex) ProtoMessage ¶

func (*CompressedInvertedIndex) ProtoMessage()

func (*CompressedInvertedIndex) Reset ¶

func (m *CompressedInvertedIndex) Reset()

func (*CompressedInvertedIndex) String ¶

func (m *CompressedInvertedIndex) String() string

type DocGzMeta ¶

type DocGzMeta struct {
	Path   string `protobuf:"bytes,1,opt,name=path" json:"path,omitempty"`
	Offset uint32 `protobuf:"varint,2,opt,name=offset" json:"offset,omitempty"`
	Length uint32 `protobuf:"varint,3,opt,name=length" json:"length,omitempty"`
}

原始数据按照gz压缩文件格式存放在hdfs中每128行原始数据合在一起称为一个 Document（文档）一个hdfs文件按照2GB大小计算，大约可以容纳 10w 个压缩后的 Document 我们用 DocGzMeta 结构来描述文档相关的元数据信息

func (*DocGzMeta) ProtoMessage ¶

func (*DocGzMeta) ProtoMessage()

func (*DocGzMeta) Reset ¶

func (m *DocGzMeta) Reset()

func (*DocGzMeta) String ¶

func (m *DocGzMeta) String() string

type DocId ¶

type DocId struct {
	DocId    uint32 `protobuf:"varint,1,opt,name=docId" json:"docId,omitempty"`
	RowIndex uint32 `protobuf:"varint,2,opt,name=rowIndex" json:"rowIndex,omitempty"`
}

func (*DocId) ProtoMessage ¶

func (*DocId) ProtoMessage()

func (*DocId) Reset ¶

func (m *DocId) Reset()

func (*DocId) String ¶

func (m *DocId) String() string

type DocIdList ¶

type DocIdList struct {
	// 该分词所关联的 Document ID。按照 docId 升序排列
	// 为了方便 protobuf 的 varint 压缩存储，采用差分数据来存储
	// 差分数据：后一个数据的存储值等于它的原始值减去前一个数据的原始
	// 举例如下：
	// 假如原始 docId 列表为：1,3,4,7,9,115,120,121,226
	// 那么实际存储的数据为： 1,2,1,3,2,106,6,1,105
	DocIds []*DocId `protobuf:"bytes,1,rep,name=docIds" json:"docIds,omitempty"`
}

一个分词可能会出现多个文档中，由于每个文档有多行原始数据组成每个关联数据需要 docId、rawIndex 两个信息来描述

func (*DocIdList) GetDocIds ¶

func (m *DocIdList) GetDocIds() []*DocId

func (*DocIdList) ProtoMessage ¶

func (*DocIdList) ProtoMessage()

func (*DocIdList) Reset ¶

func (m *DocIdList) Reset()

func (*DocIdList) String ¶

func (m *DocIdList) String() string

type InvertedIndex ¶

type InvertedIndex struct {
	Index map[string]*DocIdList `` /* 130-byte string literal not displayed */
}

Token->DocIds 倒排索引表结构。这个索引数据最终每天需要占用2TB hashid=hash64(token)%100亿，重复(冲突)不影响直接在hdfs上进行分词，中间数据文件(按照hashid排序，总共100亿行)：hashid token list<DocId>

索引文件创建过程

loop:
    1. 取N行(N=1000)，生成一个 InvertedIndex 对象，序列化，gz压缩，追加到hdfs文件中
       记录: hdfspath hashid offset length
    2. 如果 hashid%N == M (M=1000 具体取值可以参考hdfs文件大小等于256MB左右的时候，为宜），重新写一个新的hdfs文件
       N*M*277(每个token对应的DocIdList.Item个数)*4(每个DocIdList.Item占用4自己)*0.2(压缩比) --> N=1000,M=1000结果在256M以内
    3. 回到1
上述第1步中记录的4个字段，hdfspath、hashid可以根据规则推测出来，因此只需要记录offset、length即可
总共需要记录 1000w (=总分词数/N)，每个8字节，总计需要80M，这个文件可以存放在hdfs中，加载的时候可以加载到缓存中(redis)

func (*InvertedIndex) GetIndex ¶

func (m *InvertedIndex) GetIndex() map[string]*DocIdList

func (*InvertedIndex) ProtoMessage ¶

func (*InvertedIndex) ProtoMessage()

func (*InvertedIndex) Reset ¶

func (m *InvertedIndex) Reset()

func (*InvertedIndex) String ¶

func (m *InvertedIndex) String() string

type InvertedIndexGzMeta ¶

type InvertedIndexGzMeta struct {
	Offset uint32 `protobuf:"varint,1,opt,name=offset" json:"offset,omitempty"`
	Length uint32 `protobuf:"varint,2,opt,name=length" json:"length,omitempty"`
	Path   string `protobuf:"bytes,3,opt,name=path" json:"path,omitempty"`
}

func (*InvertedIndexGzMeta) ProtoMessage ¶

func (*InvertedIndexGzMeta) ProtoMessage()

func (*InvertedIndexGzMeta) Reset ¶

func (m *InvertedIndexGzMeta) Reset()

func (*InvertedIndexGzMeta) String ¶

func (m *InvertedIndexGzMeta) String() string

Source Files ¶

View all Source files

poseidon_if.pb.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL