skip to Main Content

I need to make diff ignore the case of my inputs. Both inputs contain German umlauts like ä and Ä. Option -i successfully makes diff ignore the case of my input for other characters like a and A, but not for umlauts:

$ diff -i <(echo ä) <(echo Ä)
1c1
< ä
---
> Ä

The output should be empty, as ä and Ä should be seen as the same letter if case is ignored. If I try this instead:

$ diff -i <(echo a) <(echo A)

Then it works as expected (no output).

I also tried to set the environment variable LANG to make diff use the correct locale, but this didn’t seem to have any influence:

LANG=de_DE.UTF-8 diff -i <(echo ä) <(echo Ä)

I tried various values for LANG.

Is there a way to make diff ignore the case of German umlauts?

(I’m on Ubuntu 22.04 FWIW.)

2

Answers


  1. A simple approach:

    de-ascii()(
        sed '
            s/utf/utf_/g
            s/ä/utf{ae}/g
            s/Ä/utf{Ae}/g
        ' "$@"
    )
    
    ascii-de()(
        sed '
            s/utf{ae}/ä/g
            s/utf{Ae}/Ä/g
            s/utf_/utf/g
        ' "$@"
    )
    
    diff -i <(echo ä | de-ascii) <(echo Ä | de-ascii) | ascii-de
    
    • select an escape sequence that won’t appear in normal diff output
      • (utf is probably not a good choice)
    • de-ascii – transliterate appropriate characters
    • ascii-de – undo transliteration
    • encode inputs; diff; decode
    • assumes a version of sed that correctly handles UTF-8
    Login or Signup to reply.
  2. Compare normalized strings, see Unicode normalization forms:

     diff -i <(echo ä| uconv -x Any-NFD) <(echo Ä| uconv -x Any-NFD)
    

    Note: used uconv from sudo apt install icu-devtools

    FYI:

    Form   String StrLen Unicode
    ----   ------ ------ -------
    NFC    äÄ          2 u00e4u00c4
    NFD    äÄ          4 u0061u0308u0041u0308
    NFKC   äÄ          2 u00e4u00c4
    NFKD   äÄ          4 u0061u0308u0041u0308
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search