In Python, string comparisons using == are case-sensitive.
For example, the following evaluates to
False:
'python' == 'PYTHON'
The typical technique for doing a case-insensitive comparison is to use the
lower() (or using
upper()) method in the
str class, to compare lower-case versions of both strings:
'python'.lower() == 'PYTHON'.lower() --> True
But there's a slight problem with that!
The technique above uses something called
case mapping, or
case conversion. It is basically a process that converts strings to a particular form, such as lower case, upper case, or title case, and should primarily be used for display purposes.
Here's an example where case mapping fails:
s1 = 'STRASSE'
s2 = 'Straße'
Technically, from a case insensitive comparison perspective these two strings
are equal!
But look at what happens when we do a lower-case comparison:
s1.lower(), s2.lower(), s1.lower() == s2.lower()
we get:
('strasse', 'straße', False)
If we had done an
uppercase comparison that would actually have worked:
s1.upper(), s2.upper(), s1.upper() == s2.upper()
with result:
('STRASSE', 'STRASSE', True)
The better alternative for case-insensitive comparisons is to use
case folding.
Case folding essentially provides us a more consistent method that we can use to compare two strings:
s1.casefold(), s2.casefold(), s1.casefold() == s2.casefold()
with the result now being:
('strasse', 'strasse', True)
So
casefold can address some of the issues surrounding case-insensitive comparisons.
But not all!
Consider the following two strings:
s1 = 'ê'
s2 = 'ê'
These may look like the same character, but:
s1 == s2 --> False
Even though these two characters look the same (and we probably would want them to compare equal),
casefold will not help us here:
s1.casefold(), s2.casefold(), s1.casefold() == s2.casefold()
results in:
('ê', 'ê', False)
This is happening because these two strings use
different unicode encodings to define each character.
The first one uses a
single character, whereas the second one is actually
two characters!
s1, len(s1), s2, len(s2) --> ('ê', 1, 'ê', 2)
We can see what those two characters are:
import unicodedata
unicodedata.name(s1) --> 'LATIN SMALL LETTER E WITH CIRCUMFLEX'
[unicodedata.name(c) for c in s2] --> ['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']
One additional step we should perform is a
unicode normalization.
In this case we can use something called
NFD (Normal Form D) normalization (see D145 in the official Unicode documentation here:
http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)
We can achieve this using Python's
unicodedata.normalize function:
unicodedata.normalize('NFD', s1) == unicodedata.normalize('NFD', s2) --> True
So now, we can deal with let's say the following two strings to do a case-insensitive comparison by combining the NFD normalization and the case folding:
s1 = 'ê'
s2 = 'Ê'
Just case folding would not work:
s1.casefold() == s2.casefold() --> False
Just normalization will not work since the characters are not the same case:
unicodedata.normalize('NFD', s1) == unicodedata.normalize('NFD', s2) --> False
But combining the two:
unicodedata.normalize('NFD', s1).casefold() == unicodedata.normalize('NFD', s2).casefold() --> True
I usually end up with a small helper function:
def strcomp(a, b, case_insensitive=False):
a = unicodedata.normalize('NFD', a)
b = unicodedata.normalize('NFD', b)
if case_insensitive:
return a.casefold() == b.casefold()
else:
return a == b
And this will work with all our examples so far:
s1 = 'être'
s2 = 'ÊTRE'
strcomp(s1, s2) --> False
strcomp(s1, s2, True) --> True
s1 = 'STRASSE'
s2 = 'Straße'
strcomp(s1, s2) --> False
strcomp(s1, s2, True) --> True
Case-insensitive comparisons can be quite simple using case folding, but once you start considering internalization issues, things get more complicated very fast!