On UUIDs and GUIDs

Universally Unique Identifiers (UUIDs) otherwise called Globally Unique Identifiers (GUIDs) are 128-bit numbers that are often used to identify information. In its canonical representation, a UUID looks like this: 2205cf3e-139c-4abc-be2d-e29b692934b0  

The Wikipedia entry for Universally Unique Identifier (https://en.wikipedia.org/wiki/Universally_unique_identifier) says that they are "for practical purposes unique" and that "while the probability that a UUID will be duplicated is not zero, it is so close to zero as to be negligible."  Wikipedia then does the math and shows that if 103 trillion UUIDs are generated, the chance of duplication among them is one in a billion.

Great.  Now, let me tell you why I hate UUIDs.

The 32 hexadecimal digits that make up a UUID have a higher concentration of entropy than anything else that I deal with during a regular working day.  (It helps that IntelliJ IDEA spares me from having to see git commit hashes.)  This is to say that the overwhelming majority of all the entropy that I am exposed to nowadays is due to seeing UUIDs. This was not happening in the days before the UUID; entire weeks could pass without seeing something as hopelessly nonsensical as a UUID, requiring me to coerce my brain to ignore it because "there is no sense to be made here". The higher the entropy of the visual stimulus we are exposed to, the higher the cognitive effort required to process it, even if just to dismiss it as un-processable. This makes UUIDs very tiresome to work with. When looking at a table of columns, the UUID column is always the angry column.

I agree that UUIDs have certain usages, but quite often I see them being used in situations where they are not needed, or they are rather unwanted. Here is a stackoverflow question where some genius is assigning names to his threads, and he is using UUIDs as names: Stack Overflow - Writing a custom ThreadPool (https://stackoverflow.com/questions/44198702/writing-a-custom-threadpool)

Disadvantages of UUIDs that are universally recognized are the following:
  • A UUID is 4 times larger than a regular 32-bit integer. This undeniably affects the performance and storage demands of a system.  (Apparently, the industry has decided that the benefits of UUIDs are so great that they are worth the sacrifice, but I am not convinced.)
  • The randomness of UUIDs is technically unsuitable in certain scenarios, for example in database clustered indexes, requiring the use of a special kind of UUID in these cases which is partially sequential. The uniqueness guarantees of this special kind of UUID are severely limited. (Remember those 104 trillion ids for a one-in-a-billion chance of duplication mentioned earlier? Well, you may forget it now.)  Funnily, I have seen implementations (and heard of the existence of many more) where in order to overcome this problem they introduce a regular sequentially increasing primary key based on IDENTITY or SEQUENCE, so that it can be clustered, and they also have a GUID column which is a unique but non-primary. So, they defeat the purpose of using a GUID in the first place, because there is now a single source of sequentially increasing numbers in the system.
  • UUIDs are cumbersome to debug with, because they are unreadable and non-sequential. Debugging is a notoriously difficult process, so we do not need anything that makes it harder than it already is. The use of UUIDs, however, imposes an unreasonable burden on debugging.
Ben Morris says in The Problem with GUIDs (http://www.ben-morris.com/the-problem-with-guids/) :
This readability issue is often dismissed as mere inconvenience, but it’s a real problem for anybody who has to support applications or trouble-shoot data. GUIDs are often a lazy solution selected by developers who will not have to deal with the support consequences.
If you’re going to replicate or combine disparate data sources then you really will need some globally unique identifiers. However, this is an implementation detail that does not have to be baked into data design. There’s nothing to stop you from adding separate identifiers onto your data rows in response to replication requirements.
Let me explain in a bit greater detail what the problem is with troubleshooting in a system that identifies entities using UUIDs instead of regular sequentially issued integers.
  • With sequentially issued integers you can take a mental note of the id of the entity that you are troubleshooting, and then see when and where it pops up. This means noting say, the number 1015, and then looking for a 1015 to appear again.  With UUIDs you cannot do that, because a UUID is impossible to memorize.  You literally cannot tell that the UUID that you are seeing now is the same as a UUID that you saw earlier.  If you are still young and you have extra mental capacity to spare, you might actually be able to do it once or twice in the beginning of the working day, but you can't keep doing it, at least not reliably. Even if you write down the UUID that you are looking for, there is still considerable difficulty in visually comparing a UUID on the screen with a copy you made earlier.
  • While you are looking for that 1015, if you see 1010, you know you are close.  When you see 1020, you know you passed it.  With UUIDs, you cannot do that, because they do not form a sequence.  Even when UUIDs are of the special sequentially issued kind, the sequential part is hidden among random digits, making extraction difficult, and even if you detect the subset of the digits that make up the counter, it is in hexadecimal instead of decimal.
  • In the mean time, when the ids of some other entity increment from 2100 to 2200, you know that for every entity of the kind you are troubleshooting, 10 entities of the other kind are being generated . So, if you suddenly see a newly issued id of the other kind in the 3000 range, you know that something for some reason generated more of that kind of entity than expected. No such hint is available when using UUIDs, because they are just random numbers.
  • On a subsequent test run, starting with the same initial database state, you can expect the exact same sequential ids to be issued, so you have the exact same ids to troubleshoot.  Not so with UUIDs, which are entirely different from run to run.

So, what it boils down to is that none of the most common lines of reasoning are applicable when troubleshooting UUIDs: you are constantly in the dark about any aspects that have to do with the identifiers of the entities that you are dealing with.

Let that sink in for a moment:

The identifier of an entity is what you use to identify the entity with.

It is a very important piece of information.

Arguably, in most scenarios, it is the most important piece of information about an entity.

UUIDs invalidate all previously known methods of reasoning about identifiers.

They are essentially useless to humans.

We don't want that. 

As a matter of fact, let me put it in a somewhat blunt words to drive home a point:

What kind of idiot thought that this would be a good idea?

UUIDs are quite often being used in situations where they are not really needed.  The only scenario where you really need UUIDs is when you have a decentralized system (consisting of "nodes") in which all of these conditions hold true:
  1. You want to have no single point of failure and therefore no single node issuing unique identifiers.
  2. You have such high performance requirements that you do not want the nodes to have to coordinate with each other in order to issue unique identifiers.
  3. You are for some reason unable to issue a guaranteed unique node id to each node, so as to trivially solve the problem of unique keys by making each key consist of node id + node-local sequential number.
If you do not have a situation that meets all of the above requirements, then you are only using GUIDs because you heard of some really smart and successful guys using them on some really monstrous systems, and you want to be like them.

The only kind of scenario that I can think of that would actually meet the above requirements would be a system with such a large number of nodes, and such a high new node join rate, that negotiation for a unique node id for each new node would be impractical.  There are probably not very many systems in existence on the planet with such requirements, which in turn means that every single one of them is a special case.  There is really no point in imposing a worldwide curse on computing just because a few special cases might benefit from it.

If you are using a database, then you probably already have a single point of failure.  So, go ahead and use an SQL SEQUENCE, which caches thousands of ids at a time, and has been part of the standard since SQL2003.

Many people appear to be under the impression that UUIDs are necessary for replication, but that is not true. What is necessary for replication is row identifiers that are unique over all nodes that participate in the replication of a specific table. That is "system-wide per-table unique identifiers", which not even system-unique identifiers, and certainly a far cry from "globally-unique" identifiers.  A unique row identifier could be created by concatenating a unique node identifier with a node-local, table-specific, sequential row number.  It is an arbitrary choice of Microsoft SQL Server to require a ROWGUIDCOL of the xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx format for merge replication, (and transactional replication with queued updating subscriptions,) and if we are to believe the documentation this requirement can be circumvented by creating your own GUIDs instead of using Microsoft's newid() function.

Another thing that is sometimes cited as a benefit of UUIDs is their alleged ability to be issued off-line.  "Off line" was a condition that computing systems could suffer from in the old times. It is generally not an issue today, and the vast majority of those who cite this as a benefit of UUIDs do not really have an application at hand which really needs to be able to issue ids off-line.  However, even in the extremely rare case where "off-line" may be an issue even today, it can be taken care of with special handling; we really do not need to pollute everything everywhere with nonsensical entity identifiers just because some exceedingly rare special cases might benefit from them.

I believe that UUIDs have introduced a fundamental and completely unwanted paradigm shift in programming: we have gone from systematic absolute determinism (leaving nothing to chance) to syatematic non-absolute determinism (leaving something to, a however minuscule, chance.)  You see, that's what the almost-absolute is: non-absolute. I am not sure all these people who are so happily using UUIDs realize this. I find it sacrilegious, like picking buffer sizes which are not a power of two.

Furthermore, I am not sure people realize that UUIDs represent a technological compromise.  Why are UUIDs only 128 bits instead of 256 bits?  256 bits would give even more guarantees of uniqueness, right?  How about 512 bits to really make sure no duplicate ever gets issued in this universe and in all parallel universes?  Wouldn't that be the ultimate?  Well, obviously, there will always be an even higher number of bits that will always be better, so what it boils down to is that a compromise was made.

The thing with GUIDs is that we don't want them to be huge, because then they would be wasteful, so someone had to come up with a number of bits that is small enough to not be too wasteful and yet large enough to give a reasonable guarantee against collisions.

However, if history has taught us anything, it is that technological compromises are always reasonable at the time that they are made, and not reasonable at a later point in time.  In our century 128 bits seem to be a good compromise, but with almost mathematical certainty there will be another century when this compromise will not be so good anymore.

I do believe that there will be a time, maybe in a couple of thousand years from now, maybe sooner, when we will be colonizing the galaxy, our population will be in the trillions, the number of individual devices embedded everywhere will number in the quadrillions, and the number of uniquely-identified items of information generated by those devices every Sol year will number in the quintillions, and then we will start running into trouble with duplicate UUIDs popping up every once in a while in distant areas of the galaxy, and then it will be like 640k of memory all over again, DLL hell all over again, two-digit-year millenium bug all over again, all of them combined.

I hope that when that time comes, we as a species will still have some low level understanding of how our computers work, so as to be able to fix them.  I fear we might not.

This post was inspired by a Stack Overflow answer that I wrote, here:

No comments:

Post a Comment