skip to Main Content

I have an elasticsearch index containing "hit" documents (with fields like ip/timestamp/uri etc) which are populated from my nginx access logs.

I’m looking for a method of getting the total number of hits / ip – but for a subset of IPs, namely the ones that did a request today.

I know I can have a filtered aggregation by doing:

/search?size=0
{
    'query': { 'bool': { 'must': [
        {'range': { 'timestamp': { 'gte': $today}}},
        {'query_string': {'query': 'status:200 OR status:404'}},
    ]}},
    'aggregations': {'c': {'terms': {'field': 'ip', 'size': 99999}}}
}

but this will sum only the hits that were done today, I want the total number of hits in the index but only from IPs that have hits today. Is this possible?

-edit-

I’ve tried the global option but while

'aggregations': {'c': {'global': {}, 'aggs': {'c2': {'terms': {'field': 'remote_user', 'size': 99999}}}}}

returns counts from all IPs; it ignores my filter on timestamp (eg. it includes IPs that did hits a couple of days ago)

2

Answers


  1. In the example you have shared you have a query and your documents are filtered according to that. But you want your aggregation to take all documents regardless of the query.

    This is why the global option exists.

    This context is defined by the indices and the document types you’re searching on, but is not influenced by the search query itself.

    Sample query example:

    {
      "query": {
        "match": { "type": "t-shirt" }
      },
      "aggs": {
        "all_products": {
          "global": {}, 
          "aggs": {     
          "avg_price": { "avg": { "field": "price" } }
          }
        }
      }
    }
    
    
    Login or Signup to reply.
  2. There is a way to achieve what you want in a single query but since it involves scripting and the performance might suffer depending on the volume of data you will be running this query on.

    The idea is to leverage the scripted_metric aggregation in order to build your own aggregation logic over the whole document set.

    What we do below is pretty simple:

    • we don’t give any query, so we consider the full document set
    • Map phase: we build a map of all IPs and for each
      • we count the total number of hits
      • we flag it if it had hits today AND with the given status (same as what you do in your query)
    • Reduce phase: we return the total hits count for each IP that was flagged as having hits today

    Here is how the query looks like:

    POST my-index/_search
    {
      "size": 0,
      "aggs": {
        "all_time_hits": {
          "scripted_metric": {
            "init_script": "state.ips = [:]",
            "map_script": """
              // initialize total hits count for each IP and increment
              def ip = doc['ip.keyword'].value;
              if (state.ips[ip] == null) {
                state.ips[ip] = [
                  'total_hits': 0,
                  'hits_today': false
                ]
              }
              state.ips[ip].total_hits++;
    
              // flag IP if:
              // 1. it has hits today 
              // 2. the hit had one of the given statuses
              def today = Instant.ofEpochMilli(new Date().getTime()).truncatedTo(ChronoUnit.DAYS);
              def hitDate = doc['timestamp'].value.toInstant().truncatedTo(ChronoUnit.DAYS);
              def hitToday = today.equals(hitDate);
              def statusOk = params.statuses.indexOf((int) doc['status'].value) >= 0;
              state.ips[ip].hits_today = state.ips[ip].hits_today || (hitToday && statusOk);
            """,
            "combine_script": "return state.ips;",
            "reduce_script": """
              def ips = [:];
              for (state in states) {
                for (ip in state.keySet()) {
                  // only consider IPs that had hits today
                  if (state[ip].hits_today) {
                    if (ips[ip] == null) {
                      ips[ip] = 0;
                    }
                    ips[ip] += state[ip].total_hits;
                  }
                }
              }
              return ips;
            """,
            "params": {
              "statuses": [200, 404]
            }
          }
        }
      }
    }
    

    And here is how the answer looks like:

      "aggregations" : {
        "all_time_hits" : {
          "value" : {
            "123.123.123.125" : 1,
            "123.123.123.123" : 4
          }
        }
      }
    

    I think that pretty much does what you expect.

    The other option (more performant because no script) requires you to make two queries. First, a query with the date range and status check with a terms aggregation to retrieve all IPs that have hits today (like you do now), and then a second query where you filter on those IPs (using a terms query) over the whole index (no date range or status check) and get hits count for each of them using a terms aggregation.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search