Choosing a Postgres Primary Key

https://supabase.com/blog/choosing-a-postgres-primary-key

Primary keys are important. They uniquely identify rows of data in tables, and make it easy to fetch data. The job of a database is to archive and recall data and you're going to have a hard time finding data without a good primary key or a good index.

Sometimes it makes sense to use a “natural key” (like an email column in a users table) and sometimes it’s better to use a “surrogate key”, a value made for the purpose of identifying a row (and has no other meaning).

At first glance, the question of which primary key to use is easy! Just throw a integer/serial on there, right? Numeric IDs are cool, but what about random value IDs or Universally Unique IDentifiers (UUIDs)?

Turns out the question of which identifier (and in this case, UUID) to use is complicated -- we're going to dive into some of the complexity and inherent trade-offs, and figure things out:

  • What are the choices for identifiers?
  • If we choose to use/add UUIDs, which ones should we choose?
  • How can we get these UUIDs into postgres?
  • Which UUIDs perform best?

But first, a quick history lesson.

A brief history of identifiers and why we use them

integer/biginteger

Let's think about identifying rows of data from first principles. What's the first way you might think of identifying things? Assigning them numbers!

We can set up a table like this:

-- Let's enable access to case-insensitive text
CREATE EXTENSION IF NOT EXISTS citext;
-- Heres a basic users table
CREATE TABLE users (
  id integer PRIMARY KEY,
  email citext NOT NULL CHECK (LENGTH(email) < 255),
  name text NOT NULL
);
-- Let's assume we don't want two users with the exact same email
CREATE UNIQUE INDEX users_email_uniq ON users USING BTREE (email);

This looks great, but what should id be on new rows? I don't know -- maybe the application can figure it out? If they store some value in memory? That doesn't seem right.

Maybe we could figure out the next integer from what's in the table itself -- we just need to be able to "count" upwards. We do have all the users tables rows in there, so we should be able to do it:

INSERT INTO users (id, email, name)
SELECT COUNT(*) + 1, '[email protected]', 'new_user' FROM users;

After running that query, we can double check our results:

idemailname
1[email protected]new user
2[email protected]new user

Using COUNT(*) in our query is not the most efficient (or even easiest) solution though, and hopefully it's clear why -- counting a sequence of numbers for primary keys is a feature built in to Postgres!

serial/bigserial is the right tool in our toolbox to maintain a shared, auto-incrementing sequence of numbers. Let's pretend we read the postgres documentation and use those instead.

serial/bigserial

serial is essentially a convenient macro for using Postgres sequences, a database-managed auto-incrementing stream of integer.

Let's hear it from the docs:

The data types smallserial, serial and bigserial are not true types, but merely a notational convenience for creating unique identifier columns (similar to the AUTO_INCREMENT property supported by some other databases).

Using a serial column to create the users table would look like this:

CREATE TABLE users (
  id serial PRIMARY KEY,
  email citext NOT NULL CHECK (LENGTH(email) < 255),
  name text NOT NULL
);

OK, now let's try inserting into it - we shouldn't have to specify id:

INSERT INTO users_serial (email, name) 
VALUES ('[email protected]', 'new user');
idemailname
1[email protected]new user

It works, as you might expect - now the application doesn't have to somehow magically know the right ID to use when inserting.

But what does serial actually do? Using a serial column is operationally similar to the following SQL:

CREATE SEQUENCE tablename_colname_seq AS integer;
CREATE TABLE tablename (
    colname integer NOT NULL DEFAULT nextval('tablename_colname_seq')
);
ALTER SEQUENCE tablename_colname_seq OWNED BY tablename.colname;

Back in application land, the INSERT statement returns, and provides the new id the database assigned our new row. Multiple application instances don't need to coordinate what ID to use -- they just don't, and find out from the database.

We've taken a somewhat meandering path to get here, but this is the standard solution for most reasonable database schemas.

Why not stop at serial?

There are few issues with sequences:

  • When writing automation that simply iterates through id values, note that serial columns can have gaps, even if you never DELETE (e.x. if an INSERT was rolled back — sequences live outside transactions).
  • When used from outside code serial may leak some data or give attackers an edge (e.x., if yoursite.com/users/50 works, how about yoursite.com/users/51?).
  • serial is PostgreSQL specific (i.e. not SQL standards compliant)

Don't be too put off by these reasons -- serial is still the go-to for most use-cases.

Even the last point about serial not being standards compliant is solved in Postgres 10+ by using...

integer/biginteger (again!)

Postgres 10 added support for the IDENTITY column syntax in CREATE TABLE (EDB has a great writeup on the addition).

This means we can happily go back to using integer/biginteger:

CREATE TABLE (
  id integer PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
  email citext NOT NULL CHECK (LENGTH(email) < 255),
  name text NOT NULL
)

So how does this work? Postgres does the same thing under the covers -- it generates a sequence. As this syntax is standards compliant, it's generally recommended practice for DBAs going forward, for the sake of the realm.

So integers are great, but information leakage is still a problem. How do we fix that? Well, make the numbers random, obviously.

Random Numeric IDs

Let's say the application using the DB has some Python code like the following:

from random import randrange 
from models import User 
MAX_RANDOM_USER_ID = 1_000_000_000
def create_user():
    """
    Add new user to the database
    """
    user_id = randrange(1, MAX_RANDOM_USER_ID)
    user = User(id=user_id, email="[email protected]", name="new user")
    db.save(user)

That looks good, but there's a problem -- random is a pseudorandom generator.

Pseudo-random numbers are not what we want for ensuring user IDs cannot be easily guessed/collide. It's possible to get the exact same sequence of values out of a pseudo-random number generator by using the same seed value.

Sometimes you want pseudo-random behavior (let's say for testing or fuzzing), but it's generally not desired for production systems that might run from a duplicated identical application image, since they could have weak pseudo-random seed initialization.

(Secure) Random Numeric IDs

At the very least we need a properly secure random numbers -- we need Python's secrets module:

from secrets import randbelow
# ...
def create_user():
    """
    Add new user to the database (using secure random numbers)
    """
    user_id = randrange(1, MAX_RANDOM_USER_ID)
    user = User(id=user_id, email="[email protected]", name="new user")
    db.save(user)

Now we have a secure random value coming in for our user IDs. But having values like 583247 and 8923916 get generated are cool and all, but there are a few problems:

  • These numbers are random and quite inscrutable
  • The keyspace is fairly small (maybe good for comments on a popular website, but not for IDs!)
  • People can still technically check them all (the guessing space is 1 to MAX_RANDOM_USER_ID!)

We need something better.

(Secure) Random UUIDs

Along comes UUIDs -- you're probably used to seeing them now, values like this UUIDv4:

468e8075-5815-4fe2-80d3-45a31827954b .

They're very random (almost always generated with secure random sources), and while they're even worse for remembering, they're near impossible to practically guess -- the search space is just too large!

More importantly, UUIDs introduce methodology to the madness -- different versions of UUID are derived different ways -- combined with other sources of randomness or known values.

There are a lot of versions of UUID, but let's discuss the ones we're more likely to use/see day to day.

UUIDv1

Version 1 UUIDs have three two components:

  • a 60 bit date-time (at nanosecond precision)
  • a 48 bit MAC address

But where's the randomness? Well v1s assume that you won't generate a ton of values in the same nanosecond (and there are some extra bits reserved for differentiating even when you do), but another source is the MAC address. MAC addresses uniquely (usually) identify network cards -- which is a security risk -- and those bits can be made random.

Here's what a UUIDv1 looks like:

a9957082-0b47-11ed-8a91-3cf011fe32f1

You can generate v1 UUIDs in Postgres natively thanks to the uuid-ossp contrib module. Here's how to generate a v1 UUID with random MAC address:

CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION
SELECT uuid_generate_v1mc();
          uuid_generate_v1mc
--------------------------------------
 dd1bbf10-0b47-11ed-80de-db48f6faaf86
 
(1 row)

UUIDv4

Version 4 UUIDs use all the available bits for randomness -- 122 bits worth!.

UUIDv4s look like this:

ce0b897d-03a0-4f54-8c97-41d29a325a23

These don't have a time component, but they don't have in time they make up for in randomness -- it is very unlikely for them to collide, so they make for excellent Global Unique IDentifiers ("GUID"s).

We can generate them in Postgres like this (with uuid-ossp):

SELECT gen_random_uuid();
           uuid_generate_v4
--------------------------------------
 6ca93dde-81d4-4ea0-bfe1-92ecb4d81ee4
 
(1 row)

Since Postgres would catch a collision on a PRIMARY KEY or UNIQUE INDEX column, we're done right? If we want to generate UUIDs all we need to do is choose UUID v1 or V4, and we won't leak any schema structure information to the outside world, right?

This is a workable solution, but as you might expect, it's not that easy.

The Post-UUIDv1/v4 era: A Cambrian explosion of identifiers

UUIDv1 and v4 were a start, but weren't enough for many companies out there. There are a couple shortcomings that plague both v1 and v2:

  • UUIDs are twice the size of bigint/bigserial
  • UUIDv1s contain a time element but they're not lexicographically sortable (this means they SORT terribly, relative to integer or a timestamp column)
  • UUIDv1s are less random than UUIDv4, and can collide/overlap in close enough time intervals, at large scale
  • UUIDv4s index terribly, as they're essentially random values (obviously, they SORT terribly as well)

Many of the world's biggest companies generated UUIDs at speeds that made all of these deficiencies a problem.

A cambrian explosion of UUIDs resulted, as noticed by the IETF -- this resulted in the new UUID formats (v6,v7,v8) being published in 2021.

Here's a quick list (from that IETF document):

That's... A lot of UUIDs. They're all slightly different, but the innovation was summed up by the IETF:

An inspection of these implementations details the following trends that help define this standard: Timestamps MUST be k-sortable. That is, values within or close to the same timestamp are ordered properly by sorting algorithms. Timestamps SHOULD be big-endian with the most-significant bits of the time embedded as-is without reordering. Timestamps SHOULD utilize millisecond precision and Unix Epoch as timestamp source. Although, there is some variation to this among implementations depending on the application requirements. The ID format SHOULD be Lexicographically sortable while in the textual representation. IDs MUST ensure proper embedded sequencing to facilitate sorting when multiple UUIDs are created during a given timestamp. IDs MUST NOT require unique network identifiers as part of achieving uniqueness. Distributed nodes MUST be able to create collision resistant Unique IDs without a consulting a centralized resource. The IETF went on to introduce three 3 new types of UUIDs that have these properties these companies were looking for: UUIDv6, UUIDv7, and UUIDv8.

So what's the difference you ask?

  • UUIDv6 - 62 bits of gregorian time + 48 bits of randomness
  • UUIDv7 - 36 bits of big endian unix timestamp (seconds since epoch + leapseconds w/ optional sub-second precision) + variable randomness up to 62 bits
  • UUIDv8 - variable size timestamp (32/48/60/64 bits) + variable size clock (8/12 bits) + variable randomness (54/62 bits)

It's not quite easy to work out what this all means but let's boil it down:

  • All of these UUIDs sort properly (the "high bits" of time are first, like putting the year before the month -- "2022/07")
  • UUIDv6 requires randomness
  • The data contained in the UUID can be variable (ex. UUIDv8), this means you can bytes that mean something else (ex. an encoding of the compute region you're running in)

Alright, done hearing about UUIDs? Let's get to the fun part.

Benchmarking ID generation with uuid-ossp and pg_idkit

With the history lesson behind us, let’s benchmark these ID generation mechanisms against each other! For UUIDv1 and UUIDv4 we can use uuid-ossp.

Unfortunately, uuid-ossp isn't quite so advanced as to have many of these newer UUIDs we've been discussing, so we’ll pull in pg_idkit here.

pg_idkit is built with Rust, so it gives us access to the following ID generation crates:

For each type of UUID, we can test the following:

  • Generation speed: **How fast can I generate IDs (let's say 1,000,000 of them)?
  • Table & Index size: How much larger do tables and associated indices get?
⚠️ Grain of salt required for these tests - we haven't audited the implementations of all the underlying crates!

Generation speed

Generation speed is pretty easy to test, we can enable \\timing mode on psql and run a simple benchmark with generate_series:

\timing -- enable psql timing mode
-- Generate IDs 1 million times with ksuid
SELECT COUNT(idkit_ksuid_generate()) FROM generate_series(1, 1000000);

Running all of the ID generation mechanisms on a single core of my machine (which happens to be an Oryx Pro), the lowest of 5 runs for each ID looks like this:

To be fair, generation speed shouldn’t be a deal breaker as it’s unlikely to be the bottle neck for most applications. That said, it is nice to have some data on where each ID generation mechanism lands.

ℹ️ Percona and CyberTec have some excellent posts on the topic:https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/ https://www.cybertec-postgresql.com/en/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys/

Table & Index size

We can check the size of our tables & related indices with this query (after running VACUUM):

SELECT
   relname  as table_name,
   pg_size_pretty(pg_total_relation_size(relid)) As "Table Size",
   pg_size_pretty(pg_indexes_size(relid)) as "Index Size",
   pg_size_pretty(pg_relation_size(relid)) as "Total Size"
   FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

Here are the sizes in tabular form:

These numbers are mostly a reflection of the length of the default settings of pg_idkit but probably worth having in front of you anyway.

With this, we probably have enough information to make a decision (and a new library to generate our UUIDs with)!

Which ID should you use?

As usual, it depends -- you didn't think it'd be that easy, did you?

All I can offer are some general rules of thumb that hopefully work for you:

  • integers and serial have obvious benefits for simplicity, storage, and sortability. You might not want to expose them to the world though.
  • If you want the ultimate in collision avoidance UUIDv4 is OK
  • UUIDv1 could have been great, but it doesn't lexicographically sort.
  • The best time-based ID seems to be xid, with good performance and sort friendliness
  • If you want to be a little more standards-oriented, UUID v6/v7

As usual, the best results will come from weighing all the options and finding what's best for your use-case, and doing appropriate testing on your data.

Possible Improvements

We’ve done some good exploration so far, but here are some ideas for interesting use cases for pg_idkit and measuring the impact of ID generation using it.

Usecase: Generating our created_at columns from our IDs

One interesting feature would be using at least partially time-based UUIDs for created_at columns -- we could save space by virtualizing our created_at columns:

-- At table creation
CREATE TABLE users (
  id text PRIMARY KEY DEFAULT idkit_ksuid_generate(),
  name text,
  email text,
);
-- An example query for a specific KSUID that uses created_at
SELECT *, idkit_ksuid_extract_timestamptz(id) 
FROM users
WHERE id = '0F755149A55730412B0AEC0E3B5B089C14B5B58D';

Ideally we could use the GENERATED ALWAYS AS ( ... ) syntax for generated columns while creating the table, but as the time of this post Postgres does not yet support virtual generated columns (only stored ones).

Benchmarking: Measuring index fragmentation

How fragmented do our indices get after use of each of these methods?

Luckily for us Postgres has the pgstattuple extension so we can find out -- thanks to Laurenz Albe on StackOverflow).

Integrating and including these tests in the pg_idkit README would greatly help people looking to make a decision.

Benchmarking: Measuring SORT friendliness

Another great metric to measure might be performance of these indices on certain common SORT patterns. While this is inherently workload-specific, it would be great to pick a workload and see what we get.

In most code bases, simple WHERE queries with SORTs abound, and one of the big benefits of UUIDv6, UUIDv7 and the other alternatives is lexicographic sorting, after all.

Knowing just how good a certain ID generation method is at maintaining locality would be nice to know.

Creating and using a function like idkit_uuidv1_extract_timestamptz and using it in a “functional index” (an index on an expression) could resolve the sort unfriendliness of UUIDv1 as well!

Wrap-up

Identifiers have a long history and are surprisingly unsolved at present. It can be confusing but thanks to the power of Postgres we don't have to over-think it -- tables can be migrated from one ID pattern to another, and we can use DDL statements in transactions.

Hopefully this article helps you head off some bikeshedding with your teammates when you next discuss which IDs are best to use.

More Postgres articles

Published

in bookmarks

Tagged

© 2010 - 2024 Daniel Nitsikopoulos. All rights reserved.

🕸💍  →