Elasticsearch adaptor
The elasticsearch adaptor sends data to defined endpoints. List of
supported versions is below.
Version |
Note |
1.X |
This version does not support bulk operations and will thus be much slower. |
2.X |
Will only receive bug fixes, please consider upgrading. |
5.X |
Most recent and supported version. |
IMPORTANT
If you want to keep the source _id
as the elasticsearch document _id
, transporter will
automatically do this. If you wish to use the auto-generated _id
field for elasticsearch but would
like to retain the originating _id
from the source, you'll need to include a transform function
similar to the following (assumes MongoDB source):
module.exports = function(msg) {
msg.data["mongo_id"] = msg.data._id['$oid']
msg.data = _.omit(msg.data, ["_id"]);
return msg;
}
NOTES
By using the elasticsearch auto-generated _id
, it is not currently possible for transporter to
process update/delete operations. Future work is planned in #39
to address this problem.
If no INDEX_NAME
is provided, transporter will configure a default index named test
.
Configuration:
es = elasticsearch({
"uri": "https://username:password@hostname:port/INDEX_NAME"
"timeout": "10s" // optional, defaults to 30s
"aws_access_key": "XXX" // optional, used for signing requests to AWS Elasticsearch service
"aws_access_secret": "XXX" // optional, used for signing requests to AWS Elasticsearch service
"parent_id": "elastic_parent" // optional, used for specifying parent-child relationships
})
NOTES
When writing to Elasticsearch with larger documents or any complex inserts that requires longer Elasticsearch timeouts, increase your write timeouts with Config({"write_timeout":"30s"})
in addition to adapter level configuration to prevent concurrency issues.
t.Config({"write_timeout":"30s"}).Source("source", source).Save("sink", sink)
Addressing #391
Parent-Child Relationships
Note
Only Elasticsearch 5.x is being supported at the moment.
If you have parent-child relationships in your data, specify parent_id
in the configs.
Be sure to add your parent-child mapping and make sure that your elasticsearch _id
in your parent corresponds with the parent_id
that you specified in your configs.
Check that after you add your parent-child mapping, that data is getting inserted properly.
Update
About Routing:
Elasticsearch recommends that routing values default to _id
if there are no parent values specified. However, if a parent
is specified, the parent and child should use the same routing value which is the parent _id
. In our case, we will be using parent_id
as our routing value to that it ensures both parent and child documents are on the same shard.
https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-parent-child.html
Example of Parent-Child Mapping:
This step is manual, you must set your mapping manually using a PUT
request to Elasticsearch.
PUT /<your index name>
{
"mappings":{
"company":{},
"employee":{
"_parent":{
"type":"company"
}
}
}
}
Then your transporter configs will be something like:
es = elasticsearch({
"uri": "https://username:password@hostname:port/INDEX_NAME"
"timeout": "10s"
"aws_access_key": "XXX"
"aws_access_secret": "XXX"
"parent_id": "company_id"
})
Sample Data for the config above to insert:
In this sample dataset, a company has many employees.
Note: company_id
is the parent reference, below .
Company
{"_id": "9g2g", "name": "gingerbreadhouse"}
Employee
{"_id": "9g2g", "name": "hansel", "company_id": "gingerbreadhouse"}
{"_id": "9g4g", "name": "gretel", "company_id": "gingerbreadhouse"}
{"_id": "9g6g", "name": "witch", "company_id": "gingerbreadhouse"}
Caution: If you try to insert / update data without the mapping step, the inserts will fail. Run transporter with the debug flag to see errors.
transporter run -log.level=debug