Friday, April 5, 2019

Case-Insensitive String Comparisons

In Python, string comparisons using == are case-sensitive.

For example, the following evaluates to False:
'python' == 'PYTHON'

The typical technique for doing a case-insensitive comparison is to use the lower() (or using upper()) method in the str class, to compare lower-case versions of both strings:
'python'.lower() == 'PYTHON'.lower() --> True
But there's a slight problem with that!

The technique above uses something called case mapping, or case conversion. It is basically a process that converts strings to a particular form, such as lower case, upper case, or title case, and should primarily be used for display purposes.

Here's an example where case mapping fails:
s1 = 'STRASSE'
s2 = 'Straße'
Technically, from a case insensitive comparison perspective these two strings are equal!

But look at what happens when we do a lower-case comparison:
s1.lower(), s2.lower(), s1.lower() == s2.lower()
we get:
('strasse', 'straße', False)
If we had done an uppercase comparison that would actually have worked:
s1.upper(), s2.upper(), s1.upper() == s2.upper()
with result:
('STRASSE', 'STRASSE', True)
The better alternative for case-insensitive comparisons is to use case folding.

Case folding essentially provides us a more consistent method that we can use to compare two strings:
s1.casefold(), s2.casefold(), s1.casefold() == s2.casefold()
with the result now being:
('strasse', 'strasse', True)
So casefold can address some of the issues surrounding case-insensitive comparisons.

But not all!

Consider the following two strings:
s1 = 'ê'
s2 = 'ê'

These may look like the same character, but:
s1 == s2 --> False

Even though these two characters look the same (and we probably would want them to compare equal), casefold will not help us here:
s1.casefold(), s2.casefold(), s1.casefold() == s2.casefold()
results in:
('ê', 'ê', False)

This is happening because these two strings use different unicode encodings to define each character.

The first one uses a single character, whereas the second one is actually two characters!
s1, len(s1), s2, len(s2) --> ('ê', 1, 'ê', 2)
We can see what those two characters are:

import unicodedata
unicodedata.name(s1) --> 'LATIN SMALL LETTER E WITH CIRCUMFLEX'
[unicodedata.name(c) for c in s2]  --> ['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT'] 

One additional step we should perform is a unicode normalization.

In this case we can use something called NFD (Normal Form D) normalization (see D145 in the official Unicode documentation here: http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)

We can achieve this using Python's unicodedata.normalize function:
unicodedata.normalize('NFD', s1) == unicodedata.normalize('NFD', s2) --> True

So now, we can deal with let's say the following two strings to do a case-insensitive comparison by combining the NFD normalization and the case folding:
s1 = 'ê'
s2 = 'Ê'
Just case folding would not work:
s1.casefold() == s2.casefold() --> False

Just normalization will not work since the characters are not the same case:

unicodedata.normalize('NFD', s1) == unicodedata.normalize('NFD', s2) --> False

But combining the two:
unicodedata.normalize('NFD', s1).casefold() == unicodedata.normalize('NFD', s2).casefold() --> True

I usually end up with a small helper function:

def strcomp(a, b, case_insensitive=False):
    a = unicodedata.normalize('NFD', a)
    b = unicodedata.normalize('NFD', b)
    if case_insensitive:
        return a.casefold() == b.casefold()
    else:
        return a == b

And this will work with all our examples so far:

s1 = 'être'
s2 = 'ÊTRE'
strcomp(s1, s2)
--> False
strcomp(s1, s2, True) --> True

s1 = 'STRASSE'
s2 = 'Straße'
strcomp(s1, s2)
--> False
strcomp(s1, s2, True) --> True

Case-insensitive comparisons can be quite simple using case folding, but once you start considering internalization issues, things get more complicated very fast!

No comments:

Post a Comment

Looping N Times

The itertools  module is one of my favorite modules in the Python standard library! Here's an interesting example. Often we just wan...