skip to Main Content

I’m scraping a large set of items using node.js/request and mapping the fields to ElasticSearch documents. The original documents have an ID field which never changes:

{ id: 123456 }

Periodically, I’d like to “refresh” and see which original items are no longer available, for whatever reason. Currently, I have a script which scrapes directly and simply inserts into Elastic.

Is there a way to check if an item with the same ID already exists before doing an insert? I don’t want to end up with a ton of duplicates.

4

Answers


  1. when you pushing data to elastic with bulk api, you can perform index action, and use as _id your source data ID, in that case elastic will create or replace document (if document with same id exist), here is example of bulk action

    function createBulkBody(items, indexName) {
      var result = [];
      _.forEach(items, function(item) {
        result.push({
          index: {
            _index: indexName,
            _type: item.type,
            _id: item.ID
          }
        });
        result.push(item);
      });
      return result;
    }
    

    And then push data with bulk api,

       var body = createBulkBody(items, indexName);
       esClient.bulk({
         body: body
       }, function(err, resp) {
         if (err) {
           console.log(err);
         } else {
         console.log(resp);
         }
       });
    

    Hope this helps

    Login or Signup to reply.
  2. If you want to check for the existence of an item before trying to insert it, you can just query your db for this document. If the result is not empty, this means that a document with this id already exists.

    You can use a term query for that:

    q = {'term': {'id': '123456'}}
    

    I suppose it will be quite time-consuming, but it is a way to be sure that no duplicate will be inserted.

    Login or Signup to reply.
  3. Assuming you’re using the Elasticsearch Javascript API, you can perform a simple get request on a known ID:

    client.get({
      index: 'myindex',
      type: 'mytype',
      id: 1
    }, function (error, response) {
      // ...
    });
    

    A 404 response status indicates the document does not already exist:

    Example get request

    Login or Signup to reply.
  4. Are you using your ID as the document _id? Then it should be easy by using the operation type where you can specify that a document with a specific ID should only be created, but not overwritten:

    PUT your-index/your-type/123456/_create
    {
        "foo" : "bar",
    }
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search