skip to Main Content

I am grepping (Ubuntu) multiples files with this command:

LANG=en_US.UTF-8 grep  -P -R -i -I -H -A1 "^name#.*?r[AÀÁÂÃÄaàáâãä]f[AÀÁÂÃÄaàáâãä][EÈÉÊËeèéêë]l s[IÌÍÎÏiìíîï]m[OÔÒÓÕÖoòóôõö].*?#.*?#.*?#.*?#.*?$" image_args_*

Which returns a few results, this being of them:

image_args_search_134.txt:name#Rafael Simões Vieira#1767###Emerenciana Rodrigues de Oliveira
image_args_search_134.txt-#bati.#134#somelinkhere.com##
--

but if I add [EÈÉÊËeèéêë] as part of the operator like shown below:

LANG=en_US.UTF-8 grep  -P -R -i -I -H -A1 "^name#.*?r[AÀÁÂÃÄaàáâãä]f[AÀÁÂÃÄaàáâãä][EÈÉÊËeèéêë]l s[IÌÍÎÏiìíîï]m[OÔÒÓÕÖoòóôõö][EÈÉÊËeèéêë].*?#.*?#.*?#.*?#.*?$" image_args_*

Then I get nothing.

Why is that?
Thanks!

2

Answers


  1. I see your problem only, if i use some not installed locales.

    Please verify that the required locales are activated (not commended out by leading #)

    grep "en_US" /etc/locales.gen
    
    # en_US ISO-8859-1
    # en_US.ISO-8859-15 ISO-8859-15
    en_US.UTF-8 UTF-8
    

    (The file /etc/locales.gen can be configured to your needs by removing the commenting #)

    Assure that these configured locales are really generated:

    sudo update-locale
    
    Login or Signup to reply.
  2. From what I notice, it looks like you would like to use all possible diacritics that fit a given letter. Within the concept of regular expressions, you can use equivalence classes.

    An equivalence class expression shall represent the set of collating elements belonging to an equivalence class, as described in Collation Order. Only primary equivalence classes shall be recognized. The class shall be expressed by enclosing any one of the collating elements in the equivalence class within bracket-equal ( [= and =] ) delimiters. For example, if ‘a’, ‘à’, and ‘â’ belong to the same equivalence class, then [[=a=]b], [[=à=]b], and [[=â=]b] are each equivalent to [aàâb]. If the collating element does not belong to an equivalence class, the equivalence class expression shall be treated as a collating symbol.

    So you might want to write something based on:

    $ grep -i 'r[[=a=]]f[[=a=]][[=e=]]l s[[=i=]]m[[=o=]][[=e=]]s' file1 file2 file3
    

    Note that this does not exist in PCRE, so you just need to use extended regular expressions:

    $ grep -A1 -iIREH '^name#[^#]*r[[=a=]]f[[=a=]][[=e=]]l s[[=i=]]m[[=o=]][[=e=]]s[^#]*(#[^#]*){4}$' *
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search