Friday, August 2, 2019

Looping N Times

The itertools module is one of my favorite modules in the Python standard library!

Here's an interesting example.

Often we just want to repeat an operation n times, and we don't particularly care about the loop counter.

The common approach is to use range to do this.

For example, to repeat something 10 times, we can write:

for _ in range(10):
    print('python')

(Here I use the underscore (_) character to indicate I don't care about the loop counter - but _ is a valid variable name, so nothing special going on here)

This works fine, but we have to create a range object, and we can in fact approach this more efficiently using the itertools module.

The function itertools.repeat(obj, n) will basically create a generator that will yield the specified obj, n times.

For example:

import itertools

for s in itertools.repeat('python', 5):
    print(s)

Now in the original intent we don't actually care about the loop counter, so we can use None as the object:

for _ in itertools.repeat(None, 5):
    print('python')

Using itertools.repeat actually offers us better performance.

Let's time some operations where we want to find the maximum value of a sequence of N random numbers.

from itertools import repeat
from timeit import timeit
import random

def find_max_range(n):
    random.seed(0)
    return max(random.random() for _ in range(n))

def find_max_iter(n):
    random.seed(0)
    return max(random.random() for _ in repeat(None, n))

 print(timeit('find_max_range(10_000)', 
              globals=globals(), number=1_000))
 print(timeit('find_max_iter(10_000)', 
              globals=globals(), number=1_000))

 When I run this, I get the following timings:

  •  using range: 1.09s
  •  using repeat: 0.93s

 So using repeat is about 14% faster than using range!

Friday, April 5, 2019

Case-Insensitive String Comparisons

In Python, string comparisons using == are case-sensitive.

For example, the following evaluates to False:
'python' == 'PYTHON'

The typical technique for doing a case-insensitive comparison is to use the lower() (or using upper()) method in the str class, to compare lower-case versions of both strings:
'python'.lower() == 'PYTHON'.lower() --> True
But there's a slight problem with that!

The technique above uses something called case mapping, or case conversion. It is basically a process that converts strings to a particular form, such as lower case, upper case, or title case, and should primarily be used for display purposes.

Here's an example where case mapping fails:
s1 = 'STRASSE'
s2 = 'Straße'
Technically, from a case insensitive comparison perspective these two strings are equal!

But look at what happens when we do a lower-case comparison:
s1.lower(), s2.lower(), s1.lower() == s2.lower()
we get:
('strasse', 'straße', False)
If we had done an uppercase comparison that would actually have worked:
s1.upper(), s2.upper(), s1.upper() == s2.upper()
with result:
('STRASSE', 'STRASSE', True)
The better alternative for case-insensitive comparisons is to use case folding.

Case folding essentially provides us a more consistent method that we can use to compare two strings:
s1.casefold(), s2.casefold(), s1.casefold() == s2.casefold()
with the result now being:
('strasse', 'strasse', True)
So casefold can address some of the issues surrounding case-insensitive comparisons.

But not all!

Consider the following two strings:
s1 = 'ê'
s2 = 'ê'

These may look like the same character, but:
s1 == s2 --> False

Even though these two characters look the same (and we probably would want them to compare equal), casefold will not help us here:
s1.casefold(), s2.casefold(), s1.casefold() == s2.casefold()
results in:
('ê', 'ê', False)

This is happening because these two strings use different unicode encodings to define each character.

The first one uses a single character, whereas the second one is actually two characters!
s1, len(s1), s2, len(s2) --> ('ê', 1, 'ê', 2)
We can see what those two characters are:

import unicodedata
unicodedata.name(s1) --> 'LATIN SMALL LETTER E WITH CIRCUMFLEX'
[unicodedata.name(c) for c in s2]  --> ['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT'] 

One additional step we should perform is a unicode normalization.

In this case we can use something called NFD (Normal Form D) normalization (see D145 in the official Unicode documentation here: http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)

We can achieve this using Python's unicodedata.normalize function:
unicodedata.normalize('NFD', s1) == unicodedata.normalize('NFD', s2) --> True

So now, we can deal with let's say the following two strings to do a case-insensitive comparison by combining the NFD normalization and the case folding:
s1 = 'ê'
s2 = 'Ê'
Just case folding would not work:
s1.casefold() == s2.casefold() --> False

Just normalization will not work since the characters are not the same case:

unicodedata.normalize('NFD', s1) == unicodedata.normalize('NFD', s2) --> False

But combining the two:
unicodedata.normalize('NFD', s1).casefold() == unicodedata.normalize('NFD', s2).casefold() --> True

I usually end up with a small helper function:

def strcomp(a, b, case_insensitive=False):
    a = unicodedata.normalize('NFD', a)
    b = unicodedata.normalize('NFD', b)
    if case_insensitive:
        return a.casefold() == b.casefold()
    else:
        return a == b

And this will work with all our examples so far:

s1 = 'être'
s2 = 'ÊTRE'
strcomp(s1, s2)
--> False
strcomp(s1, s2, True) --> True

s1 = 'STRASSE'
s2 = 'Straße'
strcomp(s1, s2)
--> False
strcomp(s1, s2, True) --> True

Case-insensitive comparisons can be quite simple using case folding, but once you start considering internalization issues, things get more complicated very fast!

Thursday, April 4, 2019

Pairwise Iteration

The zip function in Python can be used in some really interesting ways.

One of those ways is for iterating through an iterable in consecutive pairs.

Consider this iterable:
l = [1, 2, 3, 4, 5, 6]
Suppose we need to iterate through this list pairwise as follows:
(1, 2), (2, 3), (3, 4), (4, 5), (5, 6)
We could certain set up a loop of indices and do it the "normal" way, but a much more Pythonic way of doing this would be to use sequence slicing and the zip function as follows:
for t in zip(l, l[1:]):
    print(t)
which will give us the result:
(1, 2)(2, 3)(3, 4)(4, 5)(5, 6)
We can tweak this further to allow us to iterate thus:
(1, 2), (3, 4), (5, 6)

by writing it this way:
for t in zip(l[::2], l[1::2]):
    print(t)
which gives us the desired result:
(1, 2)(3, 4)(5, 6)
Of course we can expand on this to iterate in 3-tuple, 4-tuple, etc.

For example:
l = range(10)for t in zip(l, l[1::], l[2::]):
    print(t)
will result in this output:
(0, 1, 2)(1, 2, 3)(2, 3, 4)(3, 4, 5)(4, 5, 6)(5, 6, 7)(6, 7, 8)(7, 8, 9)
while this:
l = range(15)
for t in zip(l[::3], l[1::3], l[2::3]):
    print(t)
will result in this output:
(0, 1, 2)(3, 4, 5)(6, 7, 8)(9, 10, 11)(12, 13, 14)
For a three character function, zip packs a lot of punch!

Comparing Lists

There are a few different ways in which we may want to compare lists, depending on what is important in the comparison.

Identity Matters

In this situation we really want to know whether two lists objects are the same object or not. 

For this type of comparison, we want to use the is operator.

For example, if we have these three lists:
list_1 = [1, 2, 3]
list_2 = [1, 2, 3]
list_3 = list_1
Then list_1 and list_2 are not the same objects, but list_1 and list_3 are, so we have:
     
list_1 is list_2 --> False 
list_1 is list_3 --> True


Order and Duplicates Matter

In this case we don't care about whether the two lists objects are the same object, we just want to know if they have the same content, and in particular order matters as well as duplicate objects.

For example these two lists would not be considered equal because the items are not ordered in the same way:
[1, 2, 3, 4]
[4, 3, 2, 1]
In the same way, these two lists are not considered equal because they do not contain the same repeated elements:
[1, 1, 2, 2]
[1, 2]
For these types of comparisons, we can just use the == operator.
So we would have:
[1, 2, 3, 4] == [1, 2, 3, 4] --> True
[1, 2, 3, 4] == [4, 3, 2, 1] --> False


Order Ignored, Duplicates Matter

This one is less usual. Here we really want to know if the two lists contain the same elements, including repeated elements, but we don't care about the order of these elements.

For example each of these lists would be considered equal:
[1, 2, 3, 4, 4]
[1, 4, 3, 2, 4]
[4, 4, 3, 2, 1]
So what we really want to know here is whether the number of each distinct element matches between the two lists, irrespective of their order in the list.

The easiest way to do this is to use the Counter object in the collections module.

A Counter object is essentially a multi-set - you can think of it as a dictionary with a key for each unique element in the list and a corresponding value of the number of times the element appears in the list. 

So in the above example, the Counter for each list would be the same:
{1: 1, 2: 1, 3:1, 4:2}
We can implement this type of equality testing as follows:

from collections import Counter
list_1 = [1, 2, 3, 4, 4]
list_2 = [1, 4, 3, 2, 4]
Counter(list_1) == Counter(list_2)
--> True

Order Ignored, Duplicates Ignored

In this case duplicates can be ignored, and order does not matter. This is basically how sets are compared. Sets do not have duplicate elements, and sets have no ordering.

For example, the following lists would be considered equal:
[1, 2, 2, 3, 3, 4]
[4, 3, 2, 1]
We can easily do this using Python sets, by converting the lists to sets (which will result in duplicate elements being ignored), and comparing the two resulting sets using == (where order is ignored since sets do not have ordering):
list_1 = [1, 2, 2, 3, 3, 4]
list_2 = [4, 3, 2, 1]
set(list_1) == set(list_2)
--> True


And there you go, four different ways of comparing lists! Of course, the same works with (finite) iterables in general, not just lists.

Looping N Times

The itertools  module is one of my favorite modules in the Python standard library! Here's an interesting example. Often we just wan...