Identity by Any Other Name

What’s in a name? That which we call a rose by any other name would smell as sweet.
— William Shakespeare (Romeo and Juliet)

Communications of the ACM, April 2019, Vol. 62 No. 4, Page 80
Practice : "Identity by Any Other Name”
By Pat Helland

As distributed systems scale in size and heterogeneity, increasingly identifiers connect them. These may be called IDs, names, keys, numbers, URLs, file names, references, UPCs (Universal Product Codes), and many other terms. Frequently, these terms refer to immutable things. At other times, they refer to stuff that changes as time goes on. Identifiers are even used to represent the nature of the computation working across distrusting systems.

The fascinating thing about identifiers is that while they identify the same "thing" over time, that referenced thing may slide around in its meaning. Product descriptions, reviews, and inventory balance all change, while the product ID does not. Reservations, orders, and bookings all have identifiers that do not change, while the stuff they identify may subtly change over time.

Identity and identifiers provide the immutable linkage. Both sides of this linkage may change, but they provide a semantic consistency needed by the business operation. No matter what you call it, identity is the glue that makes things stick and lubricates cooperative work.

This article is yet another thought experiment and rumination about the complex cacophony of intertwined systems.

The Need for Identity

For a long time, we worked behind the façade of a single centralized database. Attempting to talk to other computers was considered an "application problem" and not in the purview of the system. Data lived as values in cells in the relational database. Everything could be explained in simple abstractions, and life was good!

Then, we started splitting up centralized systems for scale and manageability. We also tried to get different systems that had been independently developed to work together. That created many challenges in understanding each other and ensuring predictable outcomes, especially for atomic transactions.

As time moved on, a number of usage patterns emerged that address the challenges of work across both homogeneous and heterogeneous boundaries. All of those patterns depend on connecting things with notions of identity. The identities involved frequently remain firm and intact over long periods of time.

Data on the outside vs. data on the inside. In 2005, I wrote a paper, "Data on the Outside versus Data on the Inside,"7 that explored what it means to have data not kept in the SQL database but rather kept in messages, files, documents, and other representations. It turns out that information not kept in databases emerges as immutable messages, files, values (à la key/values), or other representations. These are typically semi-structured in their representations, but they always have some form of identifier.

Scale, long-running, and heterogeneous. Systems are knit together by identity, too. As homogeneous solutions are designed for scale, shards, replicas, and caches are all based on some form of identity. Solutions respond to stimuli over time, using one or more representations of identity to figure out what work to restart or continue. Connecting independently created systems with their own private and distrusting implementations always uses shared identities and identifiers that are the crux of their cooperation.

Searching and learning. Many other parts of the computing landscape depend on identities. Searching assigns document IDs and then organizes indices of search terms associated with them. Machine learning binds attributes with identities. In many cases, a set of attributes becomes interesting and is then assigned an identity. The system repeatedly works to associate even more attributes to them. It's when these attributes form patterns across the identities that the machine has learned something.

Identities: The new fulcrum. Computing patterns show our dependency on identities. We used to look only at relational databases but now we see pieces of computation and storage interconnected by identities. The data and computation connected by identities can swirl and shift around.

The identifiers connecting these pieces remain immutable while the stuff they identify spins and dances and evolve. Similarly, whatever is using the identity may be simply a mirage while the identifier used remains solid.

What's in a name? This article refers to identities. There are an astonishing number of synonyms for identity. All that really matters is that the identity is unique within the spatial and temporal bounds of its use. Name, key, pointer, file name, handle, check number, UPC, UUID (universally unique identifier), ASIN (Amazon Standard Identification Number), part number, model number, SKU (stock keeping unit), and more are unique either globally or within the scope of their use. It is the immutable nature of each identifier within the scope of its use that allows it to be the interstitial glue that holds computation together.

Read the full article »

About the author:

Pat Helland has been implementing transaction systems, databases, application platforms, distributed systems, fault-tolerant systems, and messaging systems since 1978. He currently works at Salesforce.