Normalizing a Django model

19 April 2023

All notes in this series:

As your Django database schema changes over time, inevitably one day you will need to normalize a model. Typically this consists of pulling out a column into its own table, and using foreign keys to store the relationship.

Applying this change involves writing a migration which does two main things: updates the schema, and migrates the data. In Django it’s quite easy to make the schema change, but it’s also quite easy to write a very inefficient data migration.

These notes focus on the problem of writing an efficient data migration.

Example Scenario §

Say you are storing a map of servers to locations. The database table might look like the following:

server_table

id    | location_name
------+--------------
1     | FLOOR-3
2     | FLOOR-3
3     | FLOOR-2
...
29873 | FLOOR-2
29874 | FLOOR-1
29875 | FLOOR-3

Initially you did this in Django with a simple Server model which contains a location_name field:

models.py

class Server(models.Model):
    location_name = models.CharField(max_length=16)
    # etc...

However, now you would like to normalize the schema by storing the location name in a separate table. The new tables should end up looking like this:

server_table

id    | location_id
------+--------------
1     | 1
2     | 1
3     | 2
...
29873 | 2
29874 | 3
29875 | 1

location_table

id    | name
------+--------------
1     | FLOOR-3
2     | FLOOR-2
3     | FLOOR-1
...

First, update your Django models.py to introduce a separate Location model:

models.py

class Location(models.Model):
    name = models.CharField(max_length=16, unique=True)

class Server(models.Model):
    location = models.ForeignKey(Location, on_delete=models.CASCADE)
    # etc...

Then run makemigrations to auto-generate a migration that handles the schema change, and follow the data migration documentation to add a custom function that is called during the migration by migrations.RunPython. What do we put in this custom function to migrate the data efficiently?

Naïve approach §

The naïve approach is to write the data migration from the perspective of the Server, and loop through each row in the Server model:

migration.py

# Not recommended
for server in Server.objects.all():
    location_fk, _ = Location.objects.get_or_create(name=server.location)
    server.location = location_fk
    server.save()

This will fire off two or three queries for each row in the Server model:

The get_or_create() will peform a SELECT query to get the location, if it already exists.
Additionally, if the location does not exist, the get_or_create() will perform an INSERT query to create the location.
The server.save() will perform an UPDATE of a single row in the server model.

Given N_s as the number of servers:

naive_approach__num_of_queries__best_case  = N_s * 2
naive_approach__num_of_queries__worst_case = N_s * 3

If the model is small, this will not be a problem, but if it is large (e.g. a million rows) this will likely be time consuming. This is a classic case of the 1+N problem (better known as the N+1 problem)

Better approach §

A better approach is to write the data migration from the perspective of the Location: the field we are trying to normalize. By turning the approach on its head we actually run fewer queries!

First, run one query to get a list of all unique location names, then loop through that list, and “bulk” update each server row:

migration.py

names = Server.objects.all().values_list("location_name", flat=True).distinct()
for name in names:
    location_fk = Location.objects.create(name=name)
    Server.objects.filter(location_name=name).update(location=location_fk)

This will fire off one query, then two more queries for each location name:

The Server.objects.all().values_list(...).distinct() will perform a SELECT DISTINCT query, which returns a list such as ["FLOOR-1", "FLOOR-2", "FLOOR-3", ...].
The create() will perform an INSERT. Since the location name is guaranteed to be unique, we do not need to do get_or_create().
The filter().update() will perform an UPDATE of all rows in the server model with that location name.

Given N_l as the number of locations:

better_approach__num_of_queries = 1 + (N_l * 2)

Comparison §

Every server has a unique location §

If every server has a unique location, then N_s == N_l.

# Assume N_s = N_l = 1,000,000
naive_approach__num_of_queries__best_case  = N_s * 2       = 2,000,000
naive_approach__num_of_queries__worst_case = N_s * 3       = 3,000,000
better_approach__num_of_queries            = 1 + (N_l * 2) = 2,000,001

So if every server has a unique location, and every location already exists, then the naïve approach wins by a single query. Though in this case I would question why normalization is even being done? If each location is unique, then there is no need to normalize!

There are 100 servers per location §

In a more typical case, where there are say 100 servers per location, the numbers would look quite different:

# Assume N_s = 1,000,000, and N_l = 10,000
naive_approach__num_of_queries__best_case  = N_s * 2       = 2,000,000
naive_approach__num_of_queries__worst_case = N_s * 3       = 3,000,000
better_approach__num_of_queries            = 1 + (N_l * 2) = 20,001

So in this more typical case, the better approach performs only ~1% the number of queries compared to the naïve approach.

Key points §

When writing Django migrations, take time to consider alternative perspectives that may result in fewer queries.
Calling .save() in a loop is a code smell indicating that a different perspective would probably be more efficient.
Fewer queries being faster is only a rule of thumb. If speed matters, make sure to profile your migrations.
If the amount of data being migrated is sufficiently small, perhaps the naïve approach is good enough.
If the amount of data being migrated is sufficiently big, perhaps the call to .all().values_list(...).distinct() will pull too much data into Python’s memory. Consider chunking the call.