Normalization and Denormalization in MongoDB

This article is excerpted from Chapter 8 of the "MongoDB: The Definitive Guide" and thoroughly answers the following two questions:

There are many ways to represent data, and one of the most important questions is to what extent data should be normalized. Normalization is the process of distributing data across multiple collections, where different collections can reference each other's data. While many documents can reference a piece of data, that piece of data is stored in only one collection. Therefore, if you need to modify that piece of data, you only need to change the document that stores that data. However, MongoDB does not provide a join tool, so performing join queries across different collections requires multiple queries.

Denormalization, on the other hand, is the opposite of normalization: it embeds all the data needed by each document within the document itself. Each document has its own copy of the data, rather than all documents sharing a single copy of the data. This means that if the information changes, all related documents need to be updated, but when executing queries, only one query is needed to retrieve all the data.

Deciding when to use normalization and when to use denormalization can be challenging. Normalization can improve write speeds, while denormalization can enhance read speeds. It is essential to weigh these factors carefully based on the specific needs of your application.

Examples of Data Representation

Suppose you want to save information about students and courses. One way to represent this is to use a students collection (where each student is a document) and a classes collection (where each course is a document). Then, use a third collection studentsClasses to store the relationship between students and courses.

> db.studentsClasses.findOne({"studentsId": id});
{
  "_id": ObjectId("..."),
  "studentId": ObjectId("...");
  "classes": [
    ObjectId("..."),
    ObjectId("..."),
    ObjectId("..."),
    ObjectId("...")
  ]
}

If you are familiar with relational databases, you might have previously created this type of table join, even though each document might only contain one student and one course (rather than a list of course "_id"). Placing courses in an array is somewhat in line with MongoDB's style, but in practice, data is often not stored this way because it requires many queries to retrieve the actual information.

Suppose you want to find the courses a student has selected. You would first look up the students collection to find the student information, then query studentClasses to find the course "_id", and finally query the classes collection to get the desired information. To find the course information, you would need to make three requests to the server. You likely do not want to organize your data in MongoDB this way unless student and course information changes frequently and there are no strict requirements for read speed.

If you embed the course references within the student document, you can save one query:

{
  "_id": ObjectId("..."),
  "name": "John Doe",
  "classes": [
    ObjectId("..."),
    ObjectId("..."),
    ObjectId("..."),
    ObjectId("...")
  ]
}

The "classes" field is an array that contains the course "_id" that John Doe needs to attend. When you need to find out information about these courses, you can use these "_id" to query the classes collection. This process only requires two queries. If the data does not need to be accessed frequently and does not change often ("frequently" is a higher requirement than "occasionally"), then this data organization is very good.

If you need to further optimize read speed, you can completely denormalize the data by saving course information as embedded documents within the student's "classes" field, allowing you to retrieve the student's course information with just one query:

{
  "_id": ObjectId("..."),
  "name": "John Doe"
  "classes": [
    {
      "class": "Trigonometry",
      "credites": 3,
      "room": "204"
    },
    {
      "class": "Physics",
      "credites": 3,
      "room": "159"
    },
    {
      "class": "Women in Literature",
      "credites": 3,
      "room": "14b"
    },
    {
      "class": "AP European History",
      "credites": 4,
      "room": "321"
    }
  ]
}

The advantage of this approach is that you can get the student's course information with just one query, but the downside is that it will consume more storage space and make data synchronization more challenging. For example, if the credits for Physics change from 3 to 4, then every student document that has taken the Physics course needs to be updated, not just the "Physics" document.

Finally, you can also mix embedded data and referenced data: create an array of sub-documents to store common information, and when you need to query for more detailed information, you can find the actual documents through references:

{
  "_id": ObjectId("..."),
  "name": "John Doe",
  "classes": [
    {
      "_id": ObjectId("..."),
      "class": "Trigonometry"
    },
    {
      "_id": ObjectId("..."),
      "class": "Physics"
    }, {
      "_id": ObjectId("..."),
      "class": "Women in Literature"
    }, {
      "_id": ObjectId("..."),
      "class": "AP European History"
    }
  ]
}

This method is also a good choice because the embedded information can be modified as needs change. If you want to include more (or less) information on a page, you can place more (or less) information in the embedded documents.

Another important issue to consider is whether information is updated more frequently or read more frequently. If this data will be updated periodically, normalization is a better choice. If the data does not change frequently, it is not worth sacrificing read/write speed to optimize update efficiency.

For example, a textbook example of normalization might involve storing users and user addresses in different collections. However, people rarely change their addresses, so it is not advisable to sacrifice the efficiency of every query for such a rare occurrence (someone changing their address). In this case, the address should be embedded in the user document.

If you decide to use embedded documents, when updating documents, you need to set up a cron job to ensure that every update you make successfully updates all documents. For example, if we try to propagate updates across multiple documents and the server crashes before all documents are updated, you need to be able to detect this issue and redo the unfinished updates.

In general, the more frequently data is generated, the less it should be embedded in other documents. If the number of embedded fields or embedded fields grows indefinitely, then that content should be stored in a separate collection and accessed via references, rather than embedded in other documents. Information such as comment lists or activity lists should be stored in separate collections and not embedded in other documents.

Finally, if certain fields are part of the document data, then those fields need to be embedded in the document. If a certain field is often excluded when querying documents, then that field should be placed in another collection rather than embedded in the current document.

Better for Embedding	Better for Referencing
Sub-documents are small	Sub-documents are large
Data does not change frequently	Data changes frequently
Eventual data consistency is acceptable	Intermediate data must be consistent
Document data increases slightly	Document data increases significantly
Data usually requires a second query to obtain	Data is usually not included in the results
Fast reads	Fast writes

Suppose we have a user collection. Here are some fields that might be needed, along with whether they should be embedded in the user document.

Account Preferences

User preferences are specific to each user and are likely to be queried together with other user information in the user document. Therefore, user preferences should be embedded in the user document.

Recent Activity

This field depends on the frequency of growth and change of recent activities. If it is a fixed-length field (like the last 10 activities), then this field should be embedded in the user document.

Friends

Friend information should generally not be embedded in the user document, at least not completely. The next section will cover social network applications.

All User-Generated Content

Should not be embedded in the user document.

Cardinality

The number of references to other collections contained in a collection is called cardinality. Common relationships include one-to-one, one-to-many, and many-to-many. Suppose there is a blog application. Each blog post has a title, which is a one-to-one relationship. Each author can have multiple posts, which is a one-to-many relationship. Each post can have multiple tags, and each tag can be used in multiple posts, making it a many-to-many relationship.

In MongoDB, many can be split into two subcategories: many and few. For example, the relationship between authors and posts may be one-to-few: each author has only published a few articles. The relationship between blog posts and tags may be many-to-few: the number of posts is likely greater than the number of tags. The relationship between blog posts and comments is one-to-many: each post can have many comments.

Once you determine the relationships of few and many, it becomes easier to weigh between embedded data and referenced data. Generally speaking, "few" relationships are better suited for embedding, while "many" relationships are better suited for referencing.

Friends, Followers, and Other Complications

Keep close friends, and stay away from enemies.

Many social applications need to link people, content, followers, friends, and other entities. Weighing whether to use embedded or referenced forms for these highly interconnected data can be challenging. This section will discuss considerations related to social graph data. Typically, following, friends, or favorites can be simplified into a publish-subscribe system: a user can subscribe to notifications related to another user. Thus, there are two basic operations that need to be efficient: how to save subscribers and how to notify all subscribers of an event.

There are three common ways to implement subscriptions. The first way is to embed content producers in subscriber documents:

{
    "_id": ObjectId("..."),
    "username": "batman",
    "email": "[email protected]",
    "following": [
        ObjectId("..."),
        ObjectId("...")
    ]
}

Now, for a given user document, you can query all activities of interest to that user using a query like db.activities.find({"user": {"$in": user["following"]}}). However, for a newly published activity, if you want to find all users interested in that activity, you would have to query the "following" field of all users.

Another way is to embed subscribers in producer documents:

{
    "_id": ObjectId("..."),
    "username": "joker",
    "email": "[email protected]",
    "followers": [
        ObjectId("..."),
        ObjectId("..."),
        ObjectId("..."
    ]
}

When this producer publishes a new piece of information, we can immediately know which users need to be notified. The downside is that if you need to find a list of users that a user is following, you must query the entire user collection. The pros and cons of these two methods are exactly opposite.

Additionally, both methods have another problem: they can make user documents grow larger and change more frequently. Typically, the "following" and "followers" fields do not even need to be returned: how often do you query the follower list? If users frequently follow or unfollow certain people, it can lead to a lot of fragmentation. Therefore, the final solution is to further normalize the data by storing subscription information in a separate collection to avoid these drawbacks. This level of normalization may seem excessive, but it is very useful for fields that change frequently and do not need to be returned with other fields in the document. Normalizing the "followers" field makes sense.

Use one collection to store the relationship between publishers and subscribers, where the document structure might look like this:

{
    "_id": ObjectId("..."),   // The "_id" of the followed user
    "followers": [
        ObjectId("..."),
        ObjectId("..."),
        ObjectId("..."
    ]
}

This can make user documents more concise, but it requires additional queries to obtain the followers list. Since the size of the "followers" array often changes, you can enable "usePowerOf2Sizes" on this collection to ensure that the users collection remains as small as possible. If the followers collection is stored in another database, it can also be compressed without significantly affecting the users collection.

Addressing the Wil Wheaton Effect

Regardless of the strategy used, embedded fields can only be effective when the number of subdocuments or references is not particularly large. For more famous users, this can lead to document overflow when saving the followers list. One solution for this situation is to use "contiguous" documents when necessary. For example:

> db.users.find({"username": "wil"})
{
    "_id": ObjectId("..."),
    "username": "wil",
    "email": "[email protected]",
    "tbc": [
        ObjectId("123"),    // just for example
        ObjectId("456")     // same as above
    ],
    "followers": [
        ObjectId("..."),
        ObjectId("..."),
        ObjectId("..."),
        ...
    ]
}
{
    "_id": ObjectId("123"),
    "followers": [
        ObjectId("..."),
        ObjectId("..."),
        ObjectId("..."),
        ...
    ]
}
{
    "_id": ObjectId("456"),
    "followers": [
        ObjectId("..."),
        ObjectId("..."),
        ObjectId("..."),
        ...
    ]
}

In this case, you need to add logic in the application to retrieve data from the "tbc" (to be continued) array.

A Few Words

No silver bullet.

@2014-06-25 15:42:13