When I first started learning about NoSQL databases, I was often unsure about the best approach for structuring data that depended on each other. I knew that you could either have a parent document that embeds another form of data, or you could use two separate collections and reference the data when needed. My first instructor taught me how to design database schemas for NoSQL using an “embed first” approach for documents with dependent data. For example, a social media post with comments embedded inside it. To better understand this architecture, I decided to create an application with a social aspect, allowing users to publish posts and comments on those posts.
Starting off I immediately went with the embed approach for the posts and comments relationship. This made querying for comments quite simple as you just need to query the post and the comments come along. The environments of my first applications were all so small that I did not initially worry about scaling.
However, while taking a break I was scrolling on Twitter...
...and came across some viral posts. I thought about all of the comments associated with each post. While a tweet of mine could go relatively unnoticed and would query quite quickly in my small project, the latest viral post on Twitter amassed over one thousand comments, not evening counting the quotes -- ( imagine your favorite meme / post here ). The most commented Instagram post to date amassed over 43 MILLION comments. Obviously, I would not have to worry about that level of scale in my project. But, what if I did? My curiosity could not be contained and I had to figure out what would be the better solution.
If I were to have a post with embedded comments go insanely viral, I may have a longer time querying and loading the post. Heavy comments are unnecessary when people want to see the post quickly and first. The next thought is the fact that MongoDB has a document size limit of 16 MB. I could just imagine how quickly that space would get eaten up if a viral post had all of its comments embedded inside of it. Furthermore, the pagination of comments would be more difficult and less efficient.
If posts and comments were their own collections, and referenced each other by ID, I believed it to be a better long-term solution. Firstly, we would no longer worry about hitting the maximum document size of 16 MB and the longer loading times and effort from the backend it would take to do so. Secondly, the pagination would be easier to accomplish since the comments are their own collection, instead of digging into the posts and populating all the comments. Updating a comment by adding a like or sub comment would cause you to update the entire post document. Also, the complexity of writing the queries increases. All of these reasons could cause significant issues and a less-than-ideal DX as the application scales.
While everything seemed better in the two collection scenario, there were some issues that I did not have the knowledge to solve at the time. One of the issues was that the query process would become more complex and less efficient. However, only the former was true. I had to learn about the aggregate pipeline that allows you to chain multiple queries into one. I did not understand it completely at first, but with documentation and practice, I was able to understand it well enough for it to be beneficial for my situation. Typically, pipeline aggregations have 3 to 4 steps.
1. Find what it is you want to match
2. Lookup the associated data you want to include
3. Decide how to display or chop up that data
4. Decide what fields of the targeted data you want to include.
Here is an example:
db.posts.aggregate([
// Step 1: Match the specific post
{ $match: { _id: "post233" } },
// Step 2: Lookup comments associated with the post
{
$lookup: {
from: "comments", // The comments collection to join
localField: "_id", // The _id field in posts
foreignField: "post_id", // The post_id field in comments
as: "comments" // Output array name for joined comments
}
},
// Step 3: Sort comments by likes in descending order and limit to //10
{
$addFields: {
comments: {
$slice: [
{ $sortArray: { input: "$comments", sortBy: { likes: -1 } } },
10 // Limit to the top 10 comments
]
}
}
},
// Step 4: Optionally project specific fields (in main target i.e // post. 1 = include)
{
$project: {
title: 1,
content: 1,
comments: 1
}
]);
Another area of caution was losing efficiency from a very large collection of comments. If a post had a somewhere between 10 and 100 comments, and I wanted to find 1 comment, it would be pretty quick to retreive that comment. But in a dedicated comments collection I may have 100 thousand comments in total, it will take much longer to find that one comment. How can I battle this? An answer I found was indexing. When setting up your schema, you want to create and index strucutre of the referenced ids. This is a one time thing and you will be able to experience the benefits ongoing. This allows MongoDB to quickly group together documents based upon the reference value.
await db.collections('comments').createIndex({post_id: 1});
This will create indexes for each post_id in ascending order. Now, let's say our app has over 1 million total comments, and we need to find 20 comments for a post. Instead of searching through all 1 million comments, we now jump directly to the index of the 20 comments we need! There is a minor trade off of storage and write time (since removing a post or adding a new post with a comment will cause a shift in the index), but it is well worth the increase in efficiency for read operations.
With all of this in mind, I came to the conclusion that it is better long-term to have a separate collection of posts and comments versus having comments embedded inside of posts. Thank you for reading, and if you have any thoughts, counterpoints, advice, or insight I would love to hear! I also would like to know about your thoughts on when embedding data is absolutely the better option to ensure I am aware of the why and the when in every case.
-- Cassius Reynolds
Email: cassius.reynolds.dev@gmail.com
Github: Github_Profile
LinkedIn: LinkedIn_Profile