I’m scraping a large set of items using node.js/request and mapping the fields to ElasticSearch documents. The original documents have an ID field which never changes:
{ id: 123456 }
Periodically, I’d like to “refresh” and see which original items are no longer available, for whatever reason. Currently, I have a script which scrapes directly and simply inserts into Elastic.
Is there a way to check if an item with the same ID already exists before doing an insert? I don’t want to end up with a ton of duplicates.
4
Answers
when you pushing data to elastic with bulk api, you can perform index action, and use as _id your source data ID, in that case elastic will create or replace document (if document with same id exist), here is example of bulk action
And then push data with bulk api,
Hope this helps
If you want to check for the existence of an item before trying to insert it, you can just query your db for this document. If the result is not empty, this means that a document with this
id
already exists.You can use a
term
query for that:I suppose it will be quite time-consuming, but it is a way to be sure that no duplicate will be inserted.
Assuming you’re using the Elasticsearch Javascript API, you can perform a simple get request on a known ID:
A 404 response status indicates the document does not already exist:
Are you using your ID as the document
_id
? Then it should be easy by using the operation type where you can specify that a document with a specific ID should only be created, but not overwritten: