2017-06-18

What is wrong with UUIDs and GUIDs

Introduction

Universally Unique Identifiers (UUIDs) otherwise known as Globally Unique Identifiers (GUIDs) are 128-bit numbers that are often used to identify information. In its canonical representation, a UUID looks like this: 2205cf3e-139c-4abc-be2d-e29b692934b0.

The Wikipedia entry for Universally Unique Identifier ()says that they are for practical purposes unique and that while the probability that a UUID will be duplicated is not zero, it is so close to zero as to be negligible. Wikipedia then does the math and shows that if 103 trillion UUIDs are generated, the chance of duplication among them is one in a billion.

Despite the infinitesimally small chances of receiving a duplicate UUID, there exist programmers out there who are afraid of this actually happening, and who will not hesitate to suspect duplicate UUIDs as being responsible for an observed malfunction in their software rather than first look for a bug in their code. Clearly, these folks do not understand the meaning of infinitesimally small chance.

Great. Now, let me tell you why I hate UUIDs.

(Useful pre-reading: About these papers)

Known disadvantages

Disadvantages of UUIDs that are unanimously recognized are the following:

  • A UUID is 4 times larger than a regular 32-bit integer. This undeniably affects the performance and storage demands of a system. Apparently, the industry has decided that the benefits of UUIDs are so great that they are worth the sacrifice.
  • The randomness of UUIDs is technically unsuitable in certain scenarios, for example in database clustered indexes, where the record ids must be sequential. When a UUID is needed in such applications, a special kind of UUID is used which contains a sequential part, but its uniqueness guarantees are severely limited. (Remember that one-in-a-billion chance of duplication mentioned earlier? Well, you may forget it now.)
  • UUIDs are cumbersome to debug with, because they are unreadable, non-sequential, and non-repeatable. Debugging is a notoriously difficult process, so we do not need anything that makes it harder than it already is, but the use of UUIDs imposes an additional burden on debugging.

In the paragraphs that follow I will address some of those disadvantages in greater detail, and I will also address some disadvantages that I have personally identified with UUIDs.

The entropy

When looking at a table of columns, I find that the UUID column is always the angry column. This is because the 32 hexadecimal digits that make up a UUID have a higher concentration of entropy than anything else that I deal with during a regular working day. (It helps that IntelliJ IDEA spares me from having to see git commit hashes.) This is to say that the overwhelming majority of all the entropy that I am exposed to nowadays is due to seeing UUIDs. This was not happening in the days before the UUID; entire weeks could pass without seeing something as hopelessly nonsensical as a UUID, requiring me to coerce my brain to ignore it because there is no sense to be made there.

The higher the entropy of the visual stimulus we are exposed to, the higher the cognitive effort required to process it, even if just to dismiss it as un-processable. This makes UUIDs very tiresome to work with.

The Undebuggability

Ben Morris says in The Problem with GUIDs (http://www.ben-morris.com/the-problem-with-guids/) :

This readability issue is often dismissed as mere inconvenience, but it’s a real problem for anybody who has to support applications or trouble-shoot data. GUIDs are often a lazy solution selected by developers who will not have to deal with the support consequences.

If you’re going to replicate or combine disparate data sources then you really will need some globally unique identifiers. However, this is an implementation detail that does not have to be baked into data design. There’s nothing to stop you from adding separate identifiers onto your data rows in response to replication requirements.

Let me explain in a bit more detail what the problem is with troubleshooting in a system that identifies entities using UUIDs instead of regular sequentially issued integers.

  • With sequentially issued integers you can take a mental note of the id of the entity that you are troubleshooting, and then see when and where it pops up. This means noting say, the number 1015, and then looking for a 1015 to appear again. With UUIDs you cannot do that, because a UUID is impossible to memorize. You literally cannot tell that the UUID that you are seeing now is the same as a UUID that you saw a few seconds earlier. Even if you write down the UUID that you are looking for, there is still considerable difficulty in visually comparing a UUID on the screen with a copy you made earlier.
  • While you are looking for that 1015, if you see 1010, you know you are close. When you see 1020, you know you passed it. With UUIDs, you cannot do that, because they do not form a sequence. Even when UUIDs are of the special sequentially issued kind, the sequential part is hidden among random digits, making extraction difficult, and even if you detect the subset of the digits that make up the counter, it is in hexadecimal instead of decimal, so it is hard to make sense out of it.
  • In the mean time, when the ids of some other entity increment from 2100 to 2200, you know that for every entity of the kind you are troubleshooting, 10 entities of the other kind are being generated. So, if you suddenly see a newly issued id of the other kind in the 3000 range, you know that something for some reason generated more of that kind of entity than expected. No such hint is available when using UUIDs, because they are just random numbers.
  • Most importantly, on a subsequent test run, starting with the same initial database state, you can expect the exact same sequential ids to be issued, so you have the exact same ids to troubleshoot. Not so with UUIDs, which are entirely different from run to run.

So, what it boils down to is that none of the most common lines of reasoning are applicable when troubleshooting UUIDs: you are constantly in the dark about most aspects that have to do with the identifiers of the entities that you are dealing with.

Let that sink in for a moment:

The identifier of an entity is what you use to identify the entity with.

It is a very important piece of information.

Arguably, in most scenarios, it is the most important piece of information about an entity.

UUIDs invalidate all previously known methods of reasoning about identifiers.

They are essentially useless to humans.

We don't want that.

As a matter of fact, let me put it in blunt terms to drive home a point:

What kind of idiot thought that this would be a good idea?

The Needlessness

I agree that UUIDs have certain usages, but quite often I see them being used in situations where they are not needed, or they are rather unwanted. Here is a stackoverflow question where some genius is assigning names to his threads, and he is using UUIDs as names: Stack Overflow - Writing a custom ThreadPool ()

The only scenario where you really need UUIDs is when you have a decentralized system (consisting of "nodes") in which all of the following conditions hold true:

  1. You want to have no single point of failure and therefore no single node issuing unique identifiers.
  2. You have such high performance requirements that you do not want the nodes to have to coordinate with each other in order to issue unique identifiers.
  3. You are for some reason unable to issue a guaranteed unique node id to each node, so as to trivially solve the problem of unique keys by making each key consist of node id + node-local sequential number.

If you do not have a situation that meets all of the above criteria, then you are only using GUIDs because you heard of some really smart and successful guys using them on some really monstrous systems, and you want to be like them.

The only kind of scenario that I can think of that would actually meet the above criteria would be a system with such a large number of nodes, and such a high new node join rate, that negotiation for a unique node id for each new node would be impractical. There are probably not very many systems in existence on the planet with such requirements, which in turn means that every single one of them is a special case. There is really no point in imposing a worldwide curse on computing just because a few special cases benefit from it.

If you are using a database, then you probably already have a single point of failure. So, go ahead and use an SQL SEQUENCE, which is very efficient because it caches thousands of ids at a time, and has been available in RDBMS products since the eighties, and part of the standard since SQL2003.

Many people appear to be under the impression that UUIDs are necessary for replication, but that is not true. What is necessary for replication is row identifiers that are unique over all nodes that participate in the replication of a specific table. That is "system-wide per-table unique identifiers", which not even system-unique identifiers, and certainly a far cry from "globally-unique" identifiers. A unique row identifier could be created by concatenating a unique node identifier with a node-local, table-specific, sequential row number. It is an arbitrary choice of Microsoft SQL Server to require a ROWGUIDCOL of the xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx format for merge replication, (and transactional replication with queued updating subscriptions,) and if we are to believe the documentation, this requirement can be circumvented by creating your own GUIDs instead of using Microsoft's newid() function.

Another thing that is sometimes cited as a benefit of UUIDs is their alleged ability to be issued off-line. "Off line" was a condition that computing systems could suffer from in the old times. It is generally not an issue today, and the vast majority of those who cite this as a benefit of UUIDs do not really have an application at hand which really needs to be able to issue ids off-line. However, even in the extremely rare case where being "off-line" is an issue today, it can be taken care of with special handling. We really do not need to pollute everything everywhere with nonsensical entity identifiers just because some exceedingly rare special cases might benefit from them.

The paradigm shift

When sequentially incrementing integers are used as identifiers, they represent an absolute guarantee that every identifier will be unique. When UUIDs are used, they represent an almost-absolute guarantee.

Thus, UUIDs have introduced a fundamental and completely unwanted paradigm shift in programming: we have gone from systematic absolute determinism (never leaving anything to chance) to systematic non-absolute determinism (regularly leaving something to, a however minuscule, chance.)

You see, that's what the almost-absolute is: non-absolute. I am not sure all these people who are so happily using UUIDs realize this. I find it sacrilegious, like picking buffer a size which is not a power of two.

The technological compromise

Furthermore, I am not sure people realize that UUIDs represent a technological compromise. Why are UUIDs only 128 bits instead of 256 bits? 256 bits would give even more guarantees of uniqueness, right? How about 512 bits to really make sure no duplicate ever gets issued in this universe and in all parallel universes that we might one day somehow come in contact with? Wouldn't that be the ultimate? Well, obviously, there will always be an even higher number of bits that will always be better, so what it boils down to is that a compromise has been made.

The thing with GUIDs is that we don't want them to be huge, because then they would be wasteful, so someone had to come up with a number of bits that is small enough to not be too wasteful and yet large enough to give a reasonable guarantee against collisions.

However, if history has taught us anything, it is that technological compromises always seem very reasonable at the time that they are made, and invariably turn out to be unreasonable at a later point in time. The difference between saying "128 bits should be enough for everyone" and saying "640K should be enough for everyone" is quantitative, not qualitative. In our century 128 bits seem to be a good compromise, but with almost mathematical certainty there will be another century when this compromise will not be so good anymore.

Epilogue

I do believe that there will be a time, maybe in a couple of thousand years from now, maybe sooner, when we will be colonizing the galaxy, our population will be in the trillions, the number of individual devices embedded everywhere will number in the quadrillions, and these devices will be generating UUIDs at rates that are unthinkable today. When that time comes, we will inevitably start running into trouble with duplicate UUIDs popping up every once in a while in distant areas of the galaxy, and then it will be like 640k of memory all over again, two-digit-year millennium bug all over again, DLL hell all over again, all of them combined.

When that time comes, I hope that we as a species still have some sufficiently low-level understanding of how our computers work, so as to be able to fix them. I fear we might not.



This post was inspired by a Stack Overflow answer that I wrote, here:
https://stackoverflow.com/a/8642874/773113

No comments:

Post a Comment