skip to Main Content

ES documentation is looking a little bit confusing.

I need to create an index for the data like this:

{
 "company_id": uuid, //indexable
 "author_id": uuid, //indexable
 "image_url": str,
 "created_at": datetime, //indexable
 "text": array //indexable
}

Text field is an array of langs. Langs are dynamic, may be only en, may be it, fr, hy, such as:

en => Hello
fr => Bonjour

or

ru => Привет
hy => Բարև

So I have the problem in understanding the indexing. In the list you can see which fields are indexable, which not.

I tried to manually set mapping with this:

$params = [
        'index' => 'posts_index',
        'body' => [
            'mappings' => [
                'properties' => [
                    'text' => [
                        'type' => 'object',
                        'properties' => []
                    ],
                    'company_id' => [
                        'type' => 'keyword',
                        'index' => true,
                    ],
                    'author_id' => [
                        'type' => 'keyword',
                        'index' => true,
                    ],
                    'image_url' => [
                        'type' => 'text',
                        'index' => false,
                    ],
                    'created_at' => [
                        'type' => 'date',
                        'index' => true,
                        'format' => 'yyyy-MM-dd HH:mm:ss'
                    ],
                    'updated_at' => [
                        'type' => 'date',
                        'index' => false,
                        'format' => 'yyyy-MM-dd HH:mm:ss'
                    ],
                ],
            ],
        ],
    ];

Then I create 11000 testing records, in Kibana I see that index takes about 36MB
If I remove the index and create 11000 docs WITHOUT manually mapping, the index takes about 21MB and looks like all fields are indexable. I just want to use as less as possible storage, but in other hand its important to have some fields indexable.

What to do?

2

Answers


  1. This is an interesting usecase because of the dynamic keys for the text field for the different languages. ES can handle the dynamic mapping, you should just be careful with the mappings to avoid messing up the performance.

    You can try this mapping:

    $params = [
        'index' => 'posts_index',
        'body' => [
            'mappings' => [
                'dynamic_templates' => [
                    [
                        'texts' => [
                            'path_match' => 'text.*',
                            'mapping' => [
                                'type' => 'text',
                                'index' => true
                            ]
                        ]
                    ]
                ],
                'properties' => [
                    'text' => [
                        'type' => 'object',
                        'enabled' => true
                    ],
                    'company_id' => [
                        'type' => 'keyword'
                    ],
                    'author_id' => [
                        'type' => 'keyword'
                    ],
                    'image_url' => [
                        'type' => 'text',
                        'index' => false
                    ],
                    'created_at' => [
                        'type' => 'date',
                        'format' => 'yyyy-MM-dd HH:mm:ss'
                    ],
                    'updated_at' => [
                        'type' => 'date',
                        'index' => false,
                        'format' => 'yyyy-MM-dd HH:mm:ss'
                    ]
                ],
            ],
        ],
    ];
    
    

    The dynamic_templates section to handle any new language that gets added to the text field and I removed the index settings because keywords are indexed by default.

    Login or Signup to reply.
  2. The solution suggested by Yusuf Ganiyu should work, but I agree with Val’s comment. What he was referring to is low recall that the standard analyzer will provide due to different languages producing multiple word forms for what is essentially the same word.

    Some language are worse than other in this regard. Compare "cow", "cows" with "корова", "коровы", "коров", "корове", "коровам", "корову", "коровой", "коровою", "коровами", and "коровах". The standard analyzer will only find you words in the exactly the same form. If you want to find all the different "коровы" when user types "корова" you need to apply a special language specific analyzer to each field.

    The purpose of this analyzer is split the text into words, which is not as trivial in some languages such as Chinese, Japanese, and Korean, and to reduce words to their lemmas, which is not trivial in other languages such as Hebrew, Russian, and Ukrainian.

    So, if you don’t care about all that go with the Yusuf’s proposal, if you care about good languages support I would suggest following Val’s advice and creating a mapping that will look like this:

    PUT test
    {
      "mappings": {
        "properties": {
          "author_id": {
            "type": "keyword"
          },
          "company_id": {
            "type": "keyword"
          },
          "created_at": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss"
          },
          "image_url": {
            "type": "text",
            "index": false
          },
          "text": {
            "dynamic": "strict",
            "properties": {
              "en": {
                "type": "text",
                "analyzer": "english"
              },
              "hy": {
                "type": "text",
                "analyzer": "armenian"
              },
              "ru": {
                "type": "text",
                "analyzer": "russian"
              }
            }
          },
          "updated_at": {
            "type": "date",
            "index": false,
            "format": "yyyy-MM-dd HH:mm:ss"
          }
        }
      }
    }
    

    Here I am assigning each language its own specific analyzer. I set "dynamic": "strict" to ensure that if somebody will try adding a records with a language you didn’t configure the indexing operation will fail instead of creating a bogus fields indexed twice. Another option here is set "dynamic": "false" in which case these fields will be simply ignored. It is less noisy but easier to miss a new language.

    Elasticsearch has lanugage-specific analyzers for different languages. The list can be found here. Support for some language is better than for others, but it is still better than using standard.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search