simont | RepugNaNt

Crossposts: http://simont.livejournal.com/234629.html

RepugNaNt

I had a disgusting thought just now. I needed to have some Python code check whether a dictionary contains a particular key/value pair. The obvious approach,

if dictionary[key] == value:

isn't sufficient, because I need to not throw an exception if the dictionary doesn't contain any mapping for that key at all.

An obvious way is to use a second clause to handle the latter case, so you end up writing

if key in dictionary and dictionary[key] == value:

But that's unpleasantly verbose, and also undesirable if dictionary or key is a complicated expression or one with side effects. So I wondered if we could do better.

Another possibility is to use the get() method of Python dictionaries, which lets you specify a default return value if the dictionary hasn't got the key in question. So you might say, for instance,

if dictionary.get(key, None) == value:

But in order to be able to do that, you have to be able to think of a default that will guarantee to compare unequal to value; for instance here, what if the value is None? If you know something about value (e.g. its type) then this is probably feasible one way or another, but one can imagine more general situations in which that wouldn't be the case.

I think the right answer, in fact, is to write

if (key, value) in dictionary.viewitems():

where viewitems() is a standard method which presents the dictionary as if it were a set of ordered pairs. (You could also use dictionary.items(), which would be semantically equivalent, but much slower due to constructing an explicit list and then iterating along it.)

But before I found the viewitems() method, I thought a bit harder about the get() approach. Perhaps, I thought, I could define a class with a comparison method that always returned false, so that an instance of that class would compare unequal to anything, including itself.

And then, as I thought the phrase ‘compare unequal to anything, including itself’, a stray neuron fired in my head. I think this would work pretty reliably:

if dictionary.get(key, float('nan')) == value:

and its only downside is that it is the most disgusting thing ever! :-)

Flat | Top-Level Comments Only

Ewww...

Especially the way you have to say float('nan') - surely there's a better way of getting hold of NaN??

*thinks* Now that you mention it, if someone calls your hypothetical function with (key, NaN) is the desired behaviour to return false under all circumstances, or to test for NaN? :)

I did wonder that myself, and reached no conclusion :-) so for the moment I'm comfortable with leaving that case unspecified and hence not disqualifying any solution for its behaviour in that situation.

I'm no Python expert, but it looks naïvely as though the viewitems version would take O(n) time where the others took O(log n)? Or does Python optimise?

In C++, what you want is relatively straightforward:

auto i = dictionary.find(key);
if (i!=dictionary.end() && *i==value) {

I assume Python has nothing similar to C++ STL's iterators to help you?

…except strictly it's probably i->second==value rather than *i==value, because dictionary will probably be an associative container.

…and you'd have to name the iterator's type explicitly before C++11.

…and in C++ you might be using an associative container that can map the same key to multiple values, in which case you'd need to know whether you wanted all entries with that key to have the specified value or at least one.

http://www.python.org/dev/peps/pep-3106/ (http://www.python.org/dev/peps/pep-3106/) has "pseudo-code to define the semantics":

class d_items:

    def __init__(self, d):
        self.__d = d

    def __len__(self):
        return len(self.__d)

    def __contains__(self, (key, value)):
        return key in self.__d and self.__d[key] == value

    (etc.)

which suggests that that sort of optimisation (if you can really call it that) is what is in mind.

it looks naïvely as though the viewitems version would take O(n) time

The Python help says that if D is a dictionary then D.viewitems() returns "a set-like object providing a view on D's items", which seems to me like a clear hint that it's not going to the effort of physically constructing an actual Python set object but is instead providing a duck-typed thing that behaves like a set. So I would expect it to DTRT, performance-wise.

In C++, what you want is relatively straightforward:

It's straightforward if you're happy to use a whole statement and a variable declaration! All of the Python idioms I suggest above, including the two-clause one that seems to me most analogous to your C++ version, can be used in the middle of a larger expression without having to stop to declare local variables first.

Edited 2011-12-07 16:33 (UTC)

Indeed, a quick test confirms that viewitems() answers queries of this type quickly as I'd expected. I just did this:

>>> d = dict([(i,i*i) for i in range(1000000)])
>>> (5,47) in d.viewitems()
False
>>> (5,47) in d.items()
False

and there was a delay of a second or so when constructing the dictionary in the first place, and when testing via items(), but testing via viewitems() was instant.

Yeah, I like C++, but that just seems to be a different way of expressing "if key in dictionary and dictionary[key] == value". FWIW, I think you can obviously choose to use a local variable in python, or to write out "dictionary.find(key)" twice in C++, depending whether you trust your optimiser to avoid doing double work if you write the code twice.

It seems what you actually want is an operator built into the language which does something like:

#define MACRO(lhs,rhs) (lhs) && (lhs rhs)

MACRO(dictionary[key],==value)

but without the ugliness of being a macro. But I don't know if any language does have that?

I'm sure you could ~~cock~~ cook one up in Lisp :-)

ROFL. Yeah, I guess you can. The limiting factor is that in most languages, you can't pass "==value" as an argument, but I guess in LISP that's just a function and then it works. It just has lots of brackets in :)

I'm sure that in lisp macros don't affect the level of ugliness.

Beautifully put :-)

If any C++ compiler managed to optimise two calls to dictionary.find(key) down to one I'd be extraordinarily impressed. Even more so if it managed without complicity between compiler and STL.

Delving deeper, I've always felt the STL gets things significantly wrong with iterators. Rather than iterators, it should be dealing in ranges, things you can dereference, increment and test for truth. Amongst the many more major advantages, that would mean one could do:

if (auto r = dictionary.find(key))
    if (r->second==value)

You're still doing two steps, but at least they nest and scope tidily!

If any C++ compiler managed to optimise two calls to dictionary.find(key) down to one I'd be extraordinarily impressed

Hm. I guess I thought "calling a member function with no side effects, doing a comparison, then calling the same function again with the same arguments" was one of the most obvious optimisations, but I guess (a) "no side effects" isn't obvious (if this is in a function, dictionary should probably be passed as a const reference, and then it will use a const version of find, assuming there is one, but I guess even then, if it's in a library, there's no way for the calling code to know it wasn't modifying some global or mutable state) and (b) and even if it were I know really, really little about what compilers actually do in practice, so you're probably right.

Also, the compiler doesn't know the structure find is traversing doesn't alias other objects (the key, for example). And it needs to pay attention to what side effects the compare functor might have (probably none, but not definitely none, without analysis).

Also, it's easy to write inherently efficient code that uses std::map, so compiler authors might not have seen optimising inefficient uses as a priority.

A quick with "gcc -S" shows you're right. I don't know if a whole program optimisation can do any better, although probably not.

I feel silly, because I've been generally trying to prioritise clarity over micro-optimisation, which normally means _not_ using an extra local variable (unless it makes it equally or more clear, as in some loops, or is necessary, as in cpu-bound inner loops) and now I wonder if I'm wrong.

I wonder, _should_ C++ be able to apply "const" somewhere to say "this function doesn't change any global or static state"? Would it help?

You can specify the const keyword when defining a class method, indicating that it's permissible to apply the method to a const instance of that class.

But the semantics aren't quite the same. It's easy to imagine functions that don't change any state and are safe to apply to a const object, but which nonetheless don't reliably return the same value if called twice. (For instance, a function that returns the number of nanoseconds until the object's expiry date, or one that returns the current contents of the file whose pathname is stored in the object. eta: or, come to think of it, a function that simply dereferences a pointer stored in the object! Much simpler, and no need to invoke the mysterious concurrency of the real world :-)

So you actually want a marker less like "const" and more like "pure", indicating that whatever the function does has no side effects and depends on only the obvious input values, so that the optimiser is permitted to assume that apparently identical calls to it are actually identical and fold them together.

Edited 2011-12-08 11:57 (UTC)

const is a semantic contract, and an overloaded-function differentiator, which is more to ensure correctness for the benefit of user code than intended to be any help at all to the compiler.

Apart from overload resolution, I believe there is never any circumstance in which changing the constness of a function would affect how it compiled. In the very most extreme case, it might const_cast<MyType *>(this).

Conversely, except in the actually-quite-common case where polymorphic objects or function pointers impinge, the compiler could, in theory, recursively annotate functions with their purity. But take a look at just how deep the call stack gets when you say if (dictionary.count(key) && dictionary[key]==value), having regard to all the std::pair accessors, forwardings to the underlying red-black tree, std::less calls, etc. While determining purity and eliminating a richer set of common sub-expressions is conceptually possible, and might even be achievable in time better than O(n²) on program size, it's a tall order.

In many cases the STL manages to look like it ought to make what you want to do short and elegant (if not always comprehensible to the untrained eye), but on deeper inspection either the standard or all implementations manage to make it just ever so slightly the wrong side of impossible. (If it didn't get so close, it wouldn't be so irritating.)

if (std::count_if(C.equal_range(key), value)>0) ...

Except that equal_range returns a std::pair of iterators and the <algorithm> functions and equivalent container methods only take individual iterators, so you have to at least calculate the range as a temporary.

if (C.count(C::value_type(key, value))>0) ...

Except that count() (or just find()!) are only defined on key_type, not value_type.

This is another case in point for my claim up there that the STL should offer ranges instead of — or at least as well as — iterators.

Given range support, especially if containers themselves presented the range concept, the algorithms that currently accept a pair of iterators would all accept a range and your std::count_if example would work out of the box.

Now that C++11 is out there, I'm seriously considering putting some concrete proposals in for the next revision of C++, including ranges.

if dictionary.get(key, not value) == value:

Ooh, neat!

I did consider using a transformed version of value, e.g. value+1 if it was an integer or value+"x" if a string, but in each of those cases I couldn't see any reason you'd do that – if you knew the type of value anyway, it would be easier to just compare it against some constant thing not of that type.

But it hadn't occurred to me that not value would apply to all types. Cunning.

Yeah. I considered that, if the function is parameterised by the type of value, you could have a meta-type, but I didn't expect to think of anything neat. Thinking about it, the tricky (but not very efficient) way would be to have a vector (or any other container type) containing value (assuming python doesn't support vectors nested infinitely deep :))

But "not" is very good. Are there any weird edge cases? :)

It doesn't fix the NaN <> NaN problem :-)

Not _quite_ equivalent to what simont had originally:

>>> class foo:
...   def __cmp__(self, other):
...     if other is True: return 0
...     if other is False: return 0
...     return 1
... 
>>> f = foo()
>>> g = foo()
>>> f == g
False
>>> d = {}
>>> d[5] = f
>>> d.get(6, not f) == f
True
>>> 6 in d and d[6] == f
False

Note cmp is -1,0,1 for less than/equal/greater than.

If you abuse semantics like that you deserve to lose :-)

Yes, I think that clever as

ptc24's counterexample is (and doubly so because I tried to construct a similar thing and didn't quite get it to work), it falls into the same category as

cartesiandaemon's question about NaN: if you define a class which has deliberately weird behaviour when compared, then you've violated the warranty of any of these methods once you try to use an instance of that class as a key or value in the dictionary or as a key or value to test for. It's only fair game to use it as the dummy non-value.

I may be missing something, but what is wrong with:
try: if dictionary[key] == value: # do stuff else: # do other stuff except: # do different stuff
?

It's a zillion lines long and requires dropping out of expression context to statement context!

If I was writing some complicated boolean expression along the lines of

if a<b and c==d and (dict[key]==value or other_equally_good_case):

and realised that the dictionary might not contain the key at all, then I'd much rather correct the code to

if a<b and c==d and ((key,value) in dict.viewitems() or other_equally_good_case):

than have to turn it into something like this:

try:
    tmpvar = dict[key]==value
except:
    tmpvar = False
if a<b and c==d and (tmpvar or other_equally_good_case):

There just doesn't seem to me to be any contest there.

Actually, that's close to what I'd probably write (though not always; context is everything, and I'd do "except KeyError" not bare except, of course). Though the reason for that is that I wouldn't call the variable tmpvar; I'd give it a meaningful name, and think of it as an opportunity to write more self-documenting code.

I don't think I've ever come across this in practice, though; normally, I just work with dicts which either don't contain None, or where None would be expected to be treated the same as the entry being missing.

That should be "except KeyError:" :p

[sorry, "except:" is one of my pet python hates]

Why not just create a function that does the nasty multi-line thing, and give it a nice interface, rather than trying to create something horrific that you can put inline?

Do you think (key, value) in dictionary.viewitems() is horrific? It seems to me that it's pretty close to ideal. The only improvement would be if I'd been able to just write (key, value) in dictionary, on the principle that a dictionary is mathematically speaking a restricted form of relation and a relation is a subset of the set of possible ordered pairs. Sadly, in dictionary was already used for something else.

It doesn't read naturally to me,and tell me what it's doing. I'd rather end up with code like:
if dictionary.itemequals(key,value):
but that may just be my lack of familiary with Python showing. If the former code is idiomatic Python then go for it.

My brain clicked this morning, and suddenly the viewitems seems much more natural. I was clearly too tired to be looking at code when I commented last night.

On the other hand, I freely concede the horrificness of the NaN approach! But then, I never intended to put that one in actual code, only to post it on the Internet and marvel at how horrible it was :-)

It's a rare and precious thing for me to see 'a proper language' doing something even nastier than my daily fare of VBA.

Standard Python idiom for creating unique sentinel values is to create an object() just for the job (since if you try to == it will fall back to is). So:

if dictionary.get(key, object()) == value:

gets you the same solution without NaN abuse. And remember that objects are pretty lightweight—they don't even have a __dict__.

Again, vulnerable to stupid classes such as:

>>> class bar:
...   def __cmp__(self, other):
...     return 0

but as

fanf says, if you do that, you deserve to lose.

If value can be a bar then I think the whole concept is an intractable problem. (Unless you know about bar and special case it, and given that the original issue was how to do this elegantly, that fails heavily.)

Flat | Top-Level Comments Only

RepugNaNt

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject