Normalizing a Django model
All notes in this series:
As your Django database schema changes over time, inevitably one day you will need to normalize a model. Typically this consists of pulling out a column into its own table, and using foreign keys to store the relationship.
Applying this change involves writing a migration which does two main things: updates the schema, and migrates the data. In Django it’s quite easy to make the schema change, but it’s also quite easy to write a very inefficient data migration.
These notes focus on the problem of writing an efficient data migration.
Example Scenario §
Say you are storing a map of servers to locations. The database table might look like the following:
Initially you did this in Django with a simple Server
model which contains a location_name
field:
However, now you would like to normalize the schema by storing the location name in a separate table. The new tables should end up looking like this:
First, update your Django models.py
to introduce a separate Location
model:
Then run makemigrations
to auto-generate a migration that handles the schema change, and follow the data migration documentation to add a custom function that is called during the migration by migrations.RunPython
. What do we put in this custom function to migrate the data efficiently?
Naïve approach §
The naïve approach is to write the data migration from the perspective of the Server
, and loop through each row in the Server
model:
This will fire off two or three queries for each row in the Server
model:
- The
get_or_create()
will peform aSELECT
query to get the location, if it already exists. - Additionally, if the location does not exist, the
get_or_create()
will perform anINSERT
query to create the location. - The
server.save()
will perform anUPDATE
of a single row in the server model.
Given N_s
as the number of servers:
If the model is small, this will not be a problem, but if it is large (e.g. a million rows) this will likely be time consuming. This is a classic case of the 1+N problem (better known as the N+1 problem)
Better approach §
A better approach is to write the data migration from the perspective of the Location
: the field we are trying to normalize. By turning the approach on its head we actually run fewer queries!
First, run one query to get a list of all unique location names, then loop through that list, and “bulk” update each server row:
This will fire off one query, then two more queries for each location name:
- The
Server.objects.all().values_list(...).distinct()
will perform aSELECT DISTINCT
query, which returns a list such as["FLOOR-1", "FLOOR-2", "FLOOR-3", ...]
. - The
create()
will perform anINSERT
. Since the location name is guaranteed to be unique, we do not need to doget_or_create()
. - The
filter().update()
will perform anUPDATE
of all rows in the server model with that location name.
Given N_l
as the number of locations:
Comparison §
Every server has a unique location §
If every server has a unique location, then N_s == N_l
.
So if every server has a unique location, and every location already exists, then the naïve approach wins by a single query. Though in this case I would question why normalization is even being done? If each location is unique, then there is no need to normalize!
There are 100 servers per location §
In a more typical case, where there are say 100 servers per location, the numbers would look quite different:
So in this more typical case, the better approach performs only ~1% the number of queries compared to the naïve approach.
Key points §
- When writing Django migrations, take time to consider alternative perspectives that may result in fewer queries.
- Calling
.save()
in a loop is a code smell indicating that a different perspective would probably be more efficient. - Fewer queries being faster is only a rule of thumb. If speed matters, make sure to profile your migrations.
- If the amount of data being migrated is sufficiently small, perhaps the naïve approach is good enough.
- If the amount of data being migrated is sufficiently big, perhaps the call to
.all().values_list(...).distinct()
will pull too much data into Python’s memory. Consider chunking the call.