This article is part number 9 of the Readability series.


Yes: a dictionary is a data type. No: a dictionary is not a way to implement abstract data types; doing so is lazy programming and is asking for trouble later on.

What do I mean by this? In Python and other similar dynamic languages, dictionaries are a mapping of keys to values that have no typing restrictions: the dictionary is heterogeneous, and a single dictionary can contain elements of different types both as its keys and its values. To make things worse, the syntax of the language makes it incredibly easy to create and populate dictionaries (unlike, say, in C++). Combine these two facts together and the temptation to abuse a dictionary to implement a structured data type is high.

Let’s look at a fictitious function to check if a given process is within its current resource limits:

def process_within_limits(pid, limits, usages):
    """Checks if a process is within its limits.

    Args:
        pid: int.  The process identifier.
        limits: dict(int, dict(str, int)).  Mapping of process
            identifiers to the limits for the corresponding
            process.  The limits of a process are a mapping of
            resource names to the numerical limit.  The valid
            names currently are 'cpu' and 'ram'.
        usages: dict(int, dict(str, int)).  Same as limits but
            for the current instantaneous measurements of the
            process resource consumption.

    Returns:
        bool.  True if the process is within its limits.
    """
    if pid not in limits:
        raise ValueError('Missing process limit')
    limit = limits[pid]

    if pid not in usages:
        raise ValueError('Missing process usage')
    usage = usages[pid]

    assert 'cpu' in usage and 'cpu' in limit
    cpu_in_quota = usage['cpu'] <= limit['cpu']
    assert 'ram' in usage and 'ram' in limit
    ram_in_quota = usage['ram'] <= limit['ram']
    return cpu_in_quota and ram_in_quota

This code is nesting two dictionaries in the limits and usages arguments, and is using each level in a different semantical manner. In the first level we have a mapping of process identifiers to either the process’ limits or current usage counts; in other words, a perfectly valid use case for a dictionary. In the second level, however, we have a collection of resource names mapped to values; needless to say, this is bad practice (with very few exceptions).

The way to improve this code is by defining an actual data type so that the various attributes are properly represented by member fields. In Python, we can use the standard collections.namedtuple class to simplify this:

import collections

# Resource limit or usage values for a process.
#
# Attributes:
#     cpu: float.  Resource value for the CPU usage.
#     ram: int.  Resource value for the RAM usage, in bytes.
Resources = collections.namedtuple('Resources', 'cpu ram')

def process_within_limits(pid, limits, usages):
    """Checks if a process is within its limits.

    Args:
        pid: int.  The process identifier.
        limits: dict(int, Resources).  Mapping of process
            identifiers to the limits for the corresponding
            process.
        usages: dict(int, Resources).  Mapping of process
            identifiers to the current instantaneous resource
            usage values.

    Returns:
        bool.  True if the process is within its limits.
    """
    if pid not in limits:
        raise ValueError('Missing process limit')
    limit = limits[pid]

    if pid not in usages:
        raise ValueError('Missing process usage')
    usage = usages[pid]

    cpu_in_quota = usage.cpu <= limit.cpu
    ram_in_quota = usage.ram <= limit.ram
    return cpu_in_quota and ram_in_quota

This is already quite an improvement. There are two things that I want to highlight here:

  • The explanation of the function arguments no longer describes what the contents of the resources dictionary should be. Doing so is now unnecessary because we have an actual type with documentation to do so.
  • Access to the member fields is done via a field instead of dynamically via a map query. This allows the validation of the accesses at build type (not in the case of Python, of course) and also the enforcement of types and/or data invariants if any.

There is one more twist to all this. By having extracted the dictionary as a data type, we can now clearly see that some functionality belongs in the data type itself for encapsulation purposes: invariant checking, operator overload, auxiliary methods… In this specific example, just imagine if you ever wanted to add a new resource dimension to the Resources class: you wouldn’t like to have to hunt down all callers to ensure they know about the new field! So we move the necessary functionality into the type:

import collections

class Resources(collections.namedtuple('Resources', 'cpu ram')):
    """Resource limit or usage values for a process.

    Attributes:
        cpu: float.  Resource value for the CPU usage.
        ram: int.  Resource value for the RAM usage, in bytes.
    """

    def __le__(self, other):
        return self.cpu <= other.cpu and self.ram <= other.ram

def process_within_limits(pid, limits, usages):
    """Checks if a process is within its limits.

    Args:
        pid: int.  The process identifier.
        limits: dict(int, Resources).  Mapping of process
            identifiers to the limits for the corresponding
            process.
        usages: dict(int, Resources).  Mapping of process
            identifiers to the current instantaneous resource
            usage values.

    Returns:
        bool.  True if the process is within its limits.
    """
    if pid not in limits:
        raise ValueError('Missing process limit')
    limit = limits[pid]

    if pid not in usages:
        raise ValueError('Missing process usage')
    usage = usages[pid]

    return usage <= limit

Let me conclude by saying that this is all inspired by actual production code I’ve had to deal with… so the example and its simplicity are not that contrived.

Comments from the original Blogger-hosted post: