skip to Main Content

I have a mongo collection that stores city/country data in multiple languages. For example, the following query:

db.cities_database.find({ "name.pl.country": "Węgry" }).pretty().limit(10);

Returns data in the following format:

[
  {
    _id: ObjectId('67331d2a9566994a18c505aa'),
    geoname_id_city: 714073,
    latitude: 46.91667,
    longitude: 21.26667,
    geohash: 'u2r4guvvmm4m',
    country_code: 'HU',
    population: 7494,
    estimated_radius: 400,
    feature_code: 'PPL',
    name: {
      pl: { city: 'Veszto', admin1: null, country: 'Węgry' },
      ascii: { city: 'veszto', admin1: null, country: null },
      lt: { city: 'Veszto', admin1: null, country: 'Vengrija' },
      ru: { city: 'Veszto', admin1: null, country: 'Венгрия' },
      hu: { city: 'Veszto', admin1: null, country: 'Magyarország' },
      en: { city: 'Veszto', admin1: null, country: 'Hungary' },
      fr: { city: 'Veszto', admin1: null, country: 'Hongrie' }
    }
  }
...
]

I want to be able to use the same query while using English only characters, so for this example I’d like to query by "name.pl.country": "Wegry" (Instead character ę I’d like Mongo to treat it as e while performing this query).

Is it possible to achieve this?

So far I tried using collation like this:

db.cities_database.find({ "name.pl.country": "Wegry" }).collation({ locale: "pl", strength: 1 }).pretty().limit(10);

but this query doesn’t return anything.

2

Answers


  1. I have no knowledge in Polish and I don’t know the difference between e and ę. But if you use MongoDB Altas, you can set up a customAnalyzer with icuFolding to perform diacritics-insensitive search.

    The index:

    {
      "analyzer": "diacriticFolder",
      "mappings": {
        "fields": {
          "name": {
            "type": "document",
            "fields": {
              "pl": {
                "type": "document",
                "fields": {
                  "country": {
                    "analyzer": "diacriticFolder",
                    "type": "string"
                  }
                }
              }
            }
          }
        }
      },
      "analyzers": [
        {
          "name": "diacriticFolder",
          "charFilters": [],
          "tokenizer": {
            "type": "keyword"
          },
          "tokenFilters": [
            {
              "type": "icuFolding"
            }
          ]
        }
      ]
    }
    

    $search query:

    [
      {
        $search: {
          "text": {
            "query": "Wegry",
            "path": "name.pl.country"
          }
        }
      }
    ]
    

    MongoDB Atlas search playground

    Login or Signup to reply.
  2. I think that’s the way how the polish collation is defined, see Polish CLDR chart.

    ę Ę are black, I guess that means "must match exactly".
    Other characters (e.g. é É è È ê Ê ë Ë) are grey, for them it works:

    db.collection.insertMany([
       { codepoint: 'U+00EBU', name: 'Latin Small Letter E with Diaeresis', char: 'ë' },
       { codepoint: 'U+0119', name: 'Latin Small Letter E with Ogonek', char: 'ę' },
       { codepoint: 'U+0065', name: 'Latin Small Letter E', char: 'e' }
    ])
    

    When you query them it gives

    db.collection.find({ char: "ë" }).collation({ locale: "pl", strength: 1 })
    [
      { name: 'Latin Small Letter E with Diaeresis', char: 'ë' },
      { name: 'Latin Small Letter E', char: 'e' }
    ]
    
    db.collection.find({ char: "ę" }).collation({ locale: "pl", strength: 1 })
    [
      { name: 'Latin Small Letter E with Ogonek', char: 'ę' }
    ]
    
    db.collection.find({ char: "e" }).collation({ locale: "pl", strength: 1 })
    [
      { name: 'Latin Small Letter E with Diaeresis', char: 'ë' },
      { name: 'Latin Small Letter E', char: 'e' }
    ]
    

    Maybe you are looking for

    db.cities_database.find({ "name.pl.country": "Wegry" }).collation({ locale: "en_US_POSIX", strength: 1 })
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search