Normalizing a Django model
All notes in this series:
As your Django database schema changes over time, inevitably one day you will need to normalize a model. Typically this consists of pulling out a column into its own table, and using foreign keys to store the relationship.
Applying this change involves writing a migration which does two main things: updates the schema, and migrates the data. In Django it’s quite easy to make the schema change, but it’s also quite easy to write a very inefficient data migration.
These notes focus on the problem of writing an efficient data migration.
Example Scenario §
Say you are storing a map of servers to locations. The database table might look like the following:
id | location_name
------+--------------
1 | FLOOR-3
2 | FLOOR-3
3 | FLOOR-2
...
29873 | FLOOR-2
29874 | FLOOR-1
29875 | FLOOR-3Initially you did this in Django with a simple Server model which contains a location_name field:
class Server(models.Model):
location_name = models.CharField(max_length=16)
# etc...However, now you would like to normalize the schema by storing the location name in a separate table. The new tables should end up looking like this:
id | location_id
------+--------------
1 | 1
2 | 1
3 | 2
...
29873 | 2
29874 | 3
29875 | 1id | name
------+--------------
1 | FLOOR-3
2 | FLOOR-2
3 | FLOOR-1
...First, update your Django models.py to introduce a separate Location model:
class Location(models.Model):
name = models.CharField(max_length=16, unique=True)
class Server(models.Model):
location = models.ForeignKey(Location, on_delete=models.CASCADE)
# etc...Then run makemigrations to auto-generate a migration that handles the schema change, and follow the data migration documentation to add a custom function that is called during the migration by migrations.RunPython. What do we put in this custom function to migrate the data efficiently?
Naïve approach §
The naïve approach is to write the data migration from the perspective of the Server, and loop through each row in the Server model:
# Not recommended
for server in Server.objects.all():
location_fk, _ = Location.objects.get_or_create(name=server.location)
server.location = location_fk
server.save()This will fire off two or three queries for each row in the Server model:
- The
get_or_create()will peform aSELECTquery to get the location, if it already exists. - Additionally, if the location does not exist, the
get_or_create()will perform anINSERTquery to create the location. - The
server.save()will perform anUPDATEof a single row in the server model.
Given N_s as the number of servers:
naive_approach__num_of_queries__best_case = N_s * 2
naive_approach__num_of_queries__worst_case = N_s * 3If the model is small, this will not be a problem, but if it is large (e.g. a million rows) this will likely be time consuming. This is a classic case of the 1+N problem (better known as the N+1 problem)
Better approach §
A better approach is to write the data migration from the perspective of the Location: the field we are trying to normalize. By turning the approach on its head we actually run fewer queries!
First, run one query to get a list of all unique location names, then loop through that list, and “bulk” update each server row:
names = Server.objects.all().values_list("location_name", flat=True).distinct()
for name in names:
location_fk = Location.objects.create(name=name)
Server.objects.filter(location_name=name).update(location=location_fk)This will fire off one query, then two more queries for each location name:
- The
Server.objects.all().values_list(...).distinct()will perform aSELECT DISTINCTquery, which returns a list such as["FLOOR-1", "FLOOR-2", "FLOOR-3", ...]. - The
create()will perform anINSERT. Since the location name is guaranteed to be unique, we do not need to doget_or_create(). - The
filter().update()will perform anUPDATEof all rows in the server model with that location name.
Given N_l as the number of locations:
better_approach__num_of_queries = 1 + (N_l * 2)Comparison §
Every server has a unique location §
If every server has a unique location, then N_s == N_l.
# Assume N_s = N_l = 1,000,000
naive_approach__num_of_queries__best_case = N_s * 2 = 2,000,000
naive_approach__num_of_queries__worst_case = N_s * 3 = 3,000,000
better_approach__num_of_queries = 1 + (N_l * 2) = 2,000,001So if every server has a unique location, and every location already exists, then the naïve approach wins by a single query. Though in this case I would question why normalization is even being done? If each location is unique, then there is no need to normalize!
There are 100 servers per location §
In a more typical case, where there are say 100 servers per location, the numbers would look quite different:
# Assume N_s = 1,000,000, and N_l = 10,000
naive_approach__num_of_queries__best_case = N_s * 2 = 2,000,000
naive_approach__num_of_queries__worst_case = N_s * 3 = 3,000,000
better_approach__num_of_queries = 1 + (N_l * 2) = 20,001So in this more typical case, the better approach performs only ~1% the number of queries compared to the naïve approach.
Key points §
- When writing Django migrations, take time to consider alternative perspectives that may result in fewer queries.
- Calling
.save()in a loop is a code smell indicating that a different perspective would probably be more efficient. - Fewer queries being faster is only a rule of thumb. If speed matters, make sure to profile your migrations.
- If the amount of data being migrated is sufficiently small, perhaps the naïve approach is good enough.
- If the amount of data being migrated is sufficiently big, perhaps the call to
.all().values_list(...).distinct()will pull too much data into Python’s memory. Consider chunking the call.