smhk

Normalizing a Django model

As your Django database schema changes over time, inevitably one day you will need to normalize a model. Typically this consists of pulling out a column into its own table, and using foreign keys to store the relationship.

Applying this change involves writing a migration which does two main things: updates the schema, and migrates the data. In Django it’s quite easy to make the schema change, but it’s also quite easy to write a very inefficient data migration.

These notes focus on the problem of writing an efficient data migration.

Example Scenario §

Say you are storing a map of servers to locations. The database table might look like the following:

server_table
id    | location_name
------+--------------
1     | FLOOR-3
2     | FLOOR-3
3     | FLOOR-2
...
29873 | FLOOR-2
29874 | FLOOR-1
29875 | FLOOR-3

Initially you did this in Django with a simple Server model which contains a location_name field:

models.py
class Server(models.Model):
    location_name = models.CharField(max_length=16)
    # etc...

However, now you would like to normalize the schema by storing the location name in a separate table. The new tables should end up looking like this:

server_table
id    | location_id
------+--------------
1     | 1
2     | 1
3     | 2
...
29873 | 2
29874 | 3
29875 | 1
location_table
id    | name
------+--------------
1     | FLOOR-3
2     | FLOOR-2
3     | FLOOR-1
...

First, update your Django models.py to introduce a separate Location model:

models.py
class Location(models.Model):
    name = models.CharField(max_length=16, unique=True)

class Server(models.Model):
    location = models.ForeignKey(Location, on_delete=models.CASCADE)
    # etc...

Then run makemigrations to auto-generate a migration that handles the schema change, and follow the data migration documentation to add a custom function that is called during the migration by migrations.RunPython. What do we put in this custom function to migrate the data efficiently?

Naïve approach §

The naïve approach is to write the data migration from the perspective of the Server, and loop through each row in the Server model:

migration.py
# Not recommended
for server in Server.objects.all():
    location_fk, _ = Location.objects.get_or_create(name=server.location)
    server.location = location_fk
    server.save()

This will fire off two or three queries for each row in the Server model:

  • The get_or_create() will peform a SELECT query to get the location, if it already exists.
  • Additionally, if the location does not exist, the get_or_create() will perform an INSERT query to create the location.
  • The server.save() will perform an UPDATE of a single row in the server model.

Given N_s as the number of servers:

naive_approach__num_of_queries__best_case  = N_s * 2
naive_approach__num_of_queries__worst_case = N_s * 3

If the model is small, this will not be a problem, but if it is large (e.g. a million rows) this will likely be time consuming. This is a classic case of the 1+N problem (better known as the N+1 problem)

Better approach §

A better approach is to write the data migration from the perspective of the Location: the field we are trying to normalize. By turning the approach on its head we actually run fewer queries!

First, run one query to get a list of all unique location names, then loop through that list, and “bulk” update each server row:

migration.py
names = Server.objects.all().values_list("location_name", flat=True).distinct()
for name in names:
    location_fk = Location.objects.create(name=name)
    Server.objects.filter(location_name=name).update(location=location_fk)

This will fire off one query, then two more queries for each location name:

  • The Server.objects.all().values_list(...).distinct() will perform a SELECT DISTINCT query, which returns a list such as ["FLOOR-1", "FLOOR-2", "FLOOR-3", ...].
  • The create() will perform an INSERT. Since the location name is guaranteed to be unique, we do not need to do get_or_create().
  • The filter().update() will perform an UPDATE of all rows in the server model with that location name.

Given N_l as the number of locations:

better_approach__num_of_queries = 1 + (N_l * 2)

Comparison §

Every server has a unique location §

If every server has a unique location, then N_s == N_l.

# Assume N_s = N_l = 1,000,000
naive_approach__num_of_queries__best_case  = N_s * 2       = 2,000,000
naive_approach__num_of_queries__worst_case = N_s * 3       = 3,000,000
better_approach__num_of_queries            = 1 + (N_l * 2) = 2,000,001

So if every server has a unique location, and every location already exists, then the naïve approach wins by a single query. Though in this case I would question why normalization is even being done? If each location is unique, then there is no need to normalize!

There are 100 servers per location §

In a more typical case, where there are say 100 servers per location, the numbers would look quite different:

# Assume N_s = 1,000,000, and N_l = 10,000
naive_approach__num_of_queries__best_case  = N_s * 2       = 2,000,000
naive_approach__num_of_queries__worst_case = N_s * 3       = 3,000,000
better_approach__num_of_queries            = 1 + (N_l * 2) = 20,001

So in this more typical case, the better approach performs only ~1% the number of queries compared to the naïve approach.

Key points §

  • When writing Django migrations, take time to consider alternative perspectives that may result in fewer queries.
  • Calling .save() in a loop is a code smell indicating that a different perspective would probably be more efficient.
  • Fewer queries being faster is only a rule of thumb. If speed matters, make sure to profile your migrations.
  • If the amount of data being migrated is sufficiently small, perhaps the naïve approach is good enough.
  • If the amount of data being migrated is sufficiently big, perhaps the call to .all().values_list(...).distinct() will pull too much data into Python’s memory. Consider chunking the call.