skip to Main Content

I have a crawling platform (microservices with node/javascript) where i have index my crawled documents (each crawled url/subpage is a single document in my Mongodb) and i want to find out the best approach to efficiently search my documents based on keywords and sentences.

i want to in the most efficient way run a query with an array of keywords or sentences, then get the top 100 result (or less if they dont meet a threshold) back. if it possible i would like the result to be unique domains, so i dont get 100 documents back as the result for the same domain.

the main goal is to, based on the search-result (if the score is good enough) generate prospects where i merge the contact informations etc from each crawled url/subpage for a domain (in another microservce, not related to Elasticsearch). so if i can get the top 100 crawled documents i know what domains are a good prospect based on the search.

I have right now about 3million documents (urls) each document holds fields that i want to rank them by and return, and the fileds should have weights to them. e.g

domain (most worth)
url (second to most worth)
content (small worth)
headers (medium worth)

here is the current "crawled_data" index:


{
    "mappings": {
        "properties": {
            "content": {
                "type": "text"
            },
            "headers": {
                "type": "nested",
                "properties": {
                    "h1": {
                        "type": "keyword"
                    },
                    "h2": {
                        "type": "keyword"
                    },
                    "h3": {
                        "type": "keyword"
                    },
                    "h4": {
                        "type": "keyword"
                    },
                    "h5": {
                        "type": "keyword"
                    },
                    "h6": {
                        "type": "keyword"
                    }
                }
            },
            "domain": {
                "type": "keyword"
            },
            "url": {
                "type": "keyword"
            }
        }
    }
}

Should i use function query? or will that be to resource heavy given that it needs to be ran on all 3mil+ documents? (does it?)

is there better to run a more advanced aggregated search query?
is is maybe better to do a custom micro-service that handles the scoring?

do you guys have any better suggestions on the approach?

2

Answers


  1. Uniqueness

    if it possible i would like the result to be unique domains, …

    take a look at collapse API to see how you can get a distinct list of results, based on domain field.

    Weighing

    also to weight fields differently you can use the boost feature in Elasticsearch while querying your index. so that the results would be sorted based on documents relevance.

    Note: you might need to somehow play with the boost values to get the result you expect.

    also you can take a look at rank feature.

    Login or Signup to reply.
  2. Or you could use the termstop_hits aggregation pair

    Sample documents

    POST /crawled_data/_bulk
    {"create":{"_id":1}}
    {"domain":"url.com","text":"aaaa"}
    {"create":{"_id":2}}
    {"domain":"url.com","text":"bbbb"}
    {"create":{"_id":3}}
    {"domain":"url.com","text":"cccc"}
    {"create":{"_id":4}}
    {"domain":"http.com","text":"aaaa"}
    {"create":{"_id":5}}
    {"domain":"http.com","text":"zzzz"}
    

    Query with the aggregation pair

    GET /crawled_data/_search?filter_path=aggregations.by_url.buckets.key,aggregations.by_url.buckets.document_list.hits.hits
    {
        "aggs": {
            "by_url": {
                "terms": {
                    "field": "domain",
                    "size": 100
                },
                "aggs": {
                    "document_list": {
                        "top_hits": {
                            "size": 100
                        }
                    }
                }
            }
        }
    }
    

    Filtered response

    {
        "aggregations" : {
            "by_url" : {
                "buckets" : [
                    {
                        "key" : "url.com",
                        "document_list" : {
                            "hits" : {
                                "hits" : [
                                    {
                                        "_index" : "crawled_data",
                                        "_type" : "_doc",
                                        "_id" : "1",
                                        "_score" : 1.0,
                                        "_source" : {
                                            "domain" : "url.com",
                                            "text" : "aaaa"
                                        }
                                    },
                                    {
                                        "_index" : "crawled_data",
                                        "_type" : "_doc",
                                        "_id" : "2",
                                        "_score" : 1.0,
                                        "_source" : {
                                            "domain" : "url.com",
                                            "text" : "bbbb"
                                        }
                                    },
                                    {
                                        "_index" : "crawled_data",
                                        "_type" : "_doc",
                                        "_id" : "3",
                                        "_score" : 1.0,
                                        "_source" : {
                                            "domain" : "url.com",
                                            "text" : "cccc"
                                        }
                                    }
                                ]
                            }
                        }
                    },
                    {
                        "key" : "http.com",
                        "document_list" : {
                            "hits" : {
                                "hits" : [
                                    {
                                        "_index" : "crawled_data",
                                        "_type" : "_doc",
                                        "_id" : "4",
                                        "_score" : 1.0,
                                        "_source" : {
                                            "domain" : "http.com",
                                            "text" : "aaaa"
                                        }
                                    },
                                    {
                                        "_index" : "crawled_data",
                                        "_type" : "_doc",
                                        "_id" : "5",
                                        "_score" : 1.0,
                                        "_source" : {
                                            "domain" : "http.com",
                                            "text" : "zzzz"
                                        }
                                    }
                                ]
                            }
                        }
                    }
                ]
            }
        }
    }
    

    You could adjust sizes in aggregations

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search