MathByte Academy: Case-Insensitive String Comparisons

In Python, string comparisons using == are case-sensitive.

For example, the following evaluates to False:

'python' == 'PYTHON'

The typical technique for doing a case-insensitive comparison is to use the lower() (or using upper()) method in the str class, to compare lower-case versions of both strings:

'python'.lower() == 'PYTHON'.lower() --> True

But there's a slight problem with that!

The technique above uses something called case mapping, or case conversion. It is basically a process that converts strings to a particular form, such as lower case, upper case, or title case, and should primarily be used for display purposes.

Here's an example where case mapping fails:

s1 = 'STRASSE'
s2 = 'Straße'

Technically, from a case insensitive comparison perspective these two strings are equal!

But look at what happens when we do a lower-case comparison:

s1.lower(), s2.lower(), s1.lower() == s2.lower()

we get:

('strasse', 'straße', False)

If we had done an uppercase comparison that would actually have worked:

s1.upper(), s2.upper(), s1.upper() == s2.upper()

with result:

('STRASSE', 'STRASSE', True)

The better alternative for case-insensitive comparisons is to use case folding.

Case folding essentially provides us a more consistent method that we can use to compare two strings:

s1.casefold(), s2.casefold(), s1.casefold() == s2.casefold()

with the result now being:

('strasse', 'strasse', True)

So casefold can address some of the issues surrounding case-insensitive comparisons.

But not all!

Consider the following two strings:

s1 = 'ê'
s2 = 'ê'

These may look like the same character, but:

s1 == s2 --> False

Even though these two characters look the same (and we probably would want them to compare equal), casefold will not help us here:

s1.casefold(), s2.casefold(), s1.casefold() == s2.casefold()

results in:

('ê', 'ê', False)

This is happening because these two strings use different unicode encodings to define each character.

The first one uses a single character, whereas the second one is actually two characters!

s1, len(s1), s2, len(s2) --> ('ê', 1, 'ê', 2)

We can see what those two characters are:

import unicodedata

unicodedata.name(s1) --> 'LATIN SMALL LETTER E WITH CIRCUMFLEX'

[unicodedata.name(c) for c in s2] --> ['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']

One additional step we should perform is a unicode normalization.

In this case we can use something called NFD (Normal Form D) normalization (see D145 in the official Unicode documentation here: http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)

We can achieve this using Python's unicodedata.normalize function:

unicodedata.normalize('NFD', s1) == unicodedata.normalize('NFD', s2) --> True

So now, we can deal with let's say the following two strings to do a case-insensitive comparison by combining the NFD normalization and the case folding:

s1 = 'ê'
s2 = 'Ê'

Just case folding would not work:

s1.casefold() == s2.casefold() --> False

Just normalization will not work since the characters are not the same case:

unicodedata.normalize('NFD', s1) == unicodedata.normalize('NFD', s2) --> False

But combining the two:

unicodedata.normalize('NFD', s1).casefold() == unicodedata.normalize('NFD', s2).casefold() --> True

I usually end up with a small helper function:

def strcomp(a, b, case_insensitive=False):
a = unicodedata.normalize('NFD', a)
b = unicodedata.normalize('NFD', b)
if case_insensitive:
return a.casefold() == b.casefold()
else:
return a == b

And this will work with all our examples so far:

s1 = 'être'
s2 = 'ÊTRE'
strcomp(s1, s2) --> False
strcomp(s1, s2, True) --> True

s1 = 'STRASSE'
s2 = 'Straße'
strcomp(s1, s2) --> False
strcomp(s1, s2, True) --> True

Case-insensitive comparisons can be quite simple using case folding, but once you start considering internalization issues, things get more complicated very fast!

MathByte Academy

Friday, April 5, 2019

Case-Insensitive String Comparisons

No comments:

Post a Comment

Looping N Times

Report Abuse

Labels