MongoDB: Aggregation

René Kulik on 18.04.2021

In this blogpost I want to talk about MongoDB’s aggregation framework and how I use it in my projects. As it is very powerful and contains a bunch of operations, I won’t go through its whole functionality and only focus on the features I am using for my use cases. For a deeper dive, I highly recommend the official documentation.

I will explain the functionality in the context of a fictional example application. Imagine you are working on a SaaS for freelancers to help them manage their clients and related projects. This application might have endpoints for querying those resources. With the help of the aggregation framework we will try to process the requested data records in a way that you could return the query results from the endpoints without the need of further formatting.

Database structure

In a relational database the entity-relationship model for such a service might look like this:

ERM

In MongoDB you could achieve the same structure by using multiple collections. But to keep things simple and to reduce reads to the database I usually try to aim for denormalization and only have one collection for storing everything. For our SaaS we could have a users collection containing the following document:

{
  "_id": "026c6774-34aa-4224-9106-c73f7f3ad51c",
  "name": "John Doe",
  "clients": [
    {
      "_id": "4c975554-a7a2-4577-8d8e-256147d06e33",
      "name": "Client 1",
      "projects": [
        {
          "_id": "ffe2d006-8a46-4d9e-9e13-0ae469c9bdb9",
          "name": "Project A"
        },
        {
          "_id": "abt4e3076-6u15-3k7e-5z14-5he865l3evb1",
          "name": "Project B"
        }
      ]
    },
    {
      "_id": "fe42cc47-9698-4c58-9903-7996581af5db",
      "name": "Client 2",
      "projects": [
        {
          "_id": "8de6d107-7f31-5r5e-8g31-4yu742k5ged2",
          "name": "Project C"
        },
        {
          "_id": "6tm6n321-5g56-3t1e-2b51-8xz324g5gac1",
          "name": "Project D"
        }
      ]
    }
  ]
}

The advantage of this approach is that you get all information with one query. The downside is that the documents could get quite big and it becomes more difficult to keep them in sync.

Aggregations

Based on this data structure we will implement aggregations to query client- and project-resources and compute the results.

Clients

The clients endpoint should return a list of clients, containing related projects. Therefore we are using the following aggregation:

db.users.aggregate([
  // First stage
  {
    $match: {
      _id: "026c6774-34aa-4224-9106-c73f7f3ad51c",
    },
  },
  // Second stage
  {
    $unwind: {
      path: "$clients",
    },
  },
  // Third stage
  {
    $group: {
      _id: "$clients",
    },
  },
]);

Let me explain what happens here in more detail.

First stage

Select a document by user-ID using $match. This stage passes the following document to the next stage:

{
  "_id": "026c6774-34aa-4224-9106-c73f7f3ad51c",
  "name": "John Doe",
  "clients": [
    {
      "_id": "4c975554-a7a2-4577-8d8e-256147d06e33",
      "name": "Client 1",
      "projects": [
        {
          "_id": "ffe2d006-8a46-4d9e-9e13-0ae469c9bdb9",
          "name": "Project A"
        },
        {
          "_id": "abt4e3076-6u15-3k7e-5z14-5he865l3evb1",
          "name": "Project B"
        }
      ]
    },
    {
      "_id": "fe42cc47-9698-4c58-9903-7996581af5db",
      "name": "Client 2",
      "projects": [
        {
          "_id": "8de6d107-7f31-5r5e-8g31-4yu742k5ged2",
          "name": "Project C"
        },
        {
          "_id": "6tm6n321-5g56-3t1e-2b51-8xz324g5gac1",
          "name": "Project D"
        }
      ]
    }
  ]
}

Second stage

$unwind outputs a new document for each element in the clients array and returns the following result:

{
  "_id": "026c6774-34aa-4224-9106-c73f7f3ad51c",
  "name": "John Doe",
  "clients": [
    {
      "_id": "4c975554-a7a2-4577-8d8e-256147d06e33",
      "name": "Client 1",
      "projects": [
        {
          "_id": "ffe2d006-8a46-4d9e-9e13-0ae469c9bdb9",
          "name": "Project A"
        },
        {
          "_id": "abt4e3076-6u15-3k7e-5z14-5he865l3evb1",
          "name": "Project B"
        }
      ]
    }
  ]
}
{
  "_id": "026c6774-34aa-4224-9106-c73f7f3ad51c",
  "name": "John Doe",
  "clients": [
    {
      "_id": "fe42cc47-9698-4c58-9903-7996581af5db",
      "name": "Client 2",
      "projects": [
        {
          "_id": "8de6d107-7f31-5r5e-8g31-4yu742k5ged2",
          "name": "Project C"
        },
        {
          "_id": "6tm6n321-5g56-3t1e-2b51-8xz324g5gac1",
          "name": "Project D"
        }
      ]
    }
  ]
}

Third stage

The $group stage groups the documents by clients:

{
  "_id": "4c975554-a7a2-4577-8d8e-256147d06e33",
  "name": "Client 1",
  "projects": [
    {
      "_id": "ffe2d006-8a46-4d9e-9e13-0ae469c9bdb9",
      "name": "Project A"
    },
    {
      "_id": "abt4e3076-6u15-3k7e-5z14-5he865l3evb1",
      "name": "Project B"
    }
  ]
}
{
  "_id": "fe42cc47-9698-4c58-9903-7996581af5db",
  "name": "Client 2",
  "projects": [
    {
      "_id": "8de6d107-7f31-5r5e-8g31-4yu742k5ged2",
      "name": "Project C"
    },
    {
      "_id": "6tm6n321-5g56-3t1e-2b51-8xz324g5gac1",
      "name": "Project D"
    }
  ]
}

As you can see, we are able to query all relevant data and construct a response by just using internal database functionality.

Projects

For projects it gets a bit more complex. I made the experience that in most cases the raw project data is not enough. Usually you will also need information from the client a project belongs to. For example if you want to display the client’s name besides the project information in the frontend. In our case, let’s assume the ideal aggregation returns this documents:

{
  "_id": "ffe2d006-8a46-4d9e-9e13-0ae469c9bdb9",
  "name": "Project A",
  "client": {
    "_id": "4c975554-a7a2-4577-8d8e-256147d06e33",
    "name": "Client 1"
  }
}
{
  "_id": "abt4e3076-6u15-3k7e-5z14-5he865l3evb1",
  "name": "Project B",
  "client": {
    "_id": "4c975554-a7a2-4577-8d8e-256147d06e33",
    "name": "Client 1"
  }
}
{
  "_id": "8de6d107-7f31-5r5e-8g31-4yu742k5ged2",
  "name": "Project C",
  "client": {
    "_id": "fe42cc47-9698-4c58-9903-7996581af5db",
    "name": "Client 2"
  }
}
{
  "_id": "6tm6n321-5g56-3t1e-2b51-8xz324g5gac1",
  "name": "Project D",
  "client": {
    "_id": "fe42cc47-9698-4c58-9903-7996581af5db",
    "name": "Client 2"
  }
}

To generate the result we use this aggregation:

db.users.aggregate([
  // First stage
  {
    $match: {
      _id: "026c6774-34aa-4224-9106-c73f7f3ad51c",
    },
  },
  // Second stage
  {
    $unwind: {
      path: "$clients",
    },
  },
  // Third stage
  {
    $unwind: {
      path: "$clients.projects",
    },
  },
  // Fourth stage
  {
    $project: {
      _id: "$clients.projects._id",
      name: "$clients.projects.name",
      client: "$clients",
    },
  },
  // Fifth stage
  {
    $project: {
      "client.projects": 0,
    },
  },
]);

First Stage

Select a document by user-ID and return the following result:

{
  "_id": "026c6774-34aa-4224-9106-c73f7f3ad51c",
  "name": "John Doe",
  "clients": [
    {
      "_id": "4c975554-a7a2-4577-8d8e-256147d06e33",
      "name": "Client 1",
      "projects": [
        {
          "_id": "ffe2d006-8a46-4d9e-9e13-0ae469c9bdb9",
          "name": "Project A"
        },
        {
          "_id": "abt4e3076-6u15-3k7e-5z14-5he865l3evb1",
          "name": "Project B"
        }
      ]
    },
    {
      "_id": "fe42cc47-9698-4c58-9903-7996581af5db",
      "name": "Client 2",
      "projects": [
        {
          "_id": "8de6d107-7f31-5r5e-8g31-4yu742k5ged2",
          "name": "Project C"
        },
        {
          "_id": "6tm6n321-5g56-3t1e-2b51-8xz324g5gac1",
          "name": "Project D"
        }
      ]
    }
  ]
}

Second stage

$unwind outputs a new document for each element in the clients array and passes the following documents to the next stage:

{
  "_id": "026c6774-34aa-4224-9106-c73f7f3ad51c",
  "name": "John Doe",
  "clients": [
    {
      "_id": "4c975554-a7a2-4577-8d8e-256147d06e33",
      "name": "Client 1",
      "projects": [
        {
          "_id": "ffe2d006-8a46-4d9e-9e13-0ae469c9bdb9",
          "name": "Project A"
        },
        {
          "_id": "abt4e3076-6u15-3k7e-5z14-5he865l3evb1",
          "name": "Project B"
        }
      ]
    }
  ]
}
{
  "_id": "026c6774-34aa-4224-9106-c73f7f3ad51c",
  "name": "John Doe",
  "clients": [
    {
      "_id": "fe42cc47-9698-4c58-9903-7996581af5db",
      "name": "Client 2",
      "projects": [
        {
          "_id": "8de6d107-7f31-5r5e-8g31-4yu742k5ged2",
          "name": "Project C"
        },
        {
          "_id": "6tm6n321-5g56-3t1e-2b51-8xz324g5gac1",
          "name": "Project D"
        }
      ]
    }
  ]
}

Third stage

Use $unwind again to output a new document for each element in the projects array:

{
  "_id": "026c6774-34aa-4224-9106-c73f7f3ad51c",
  "name": "John Doe",
  "clients": [
    {
      "_id": "4c975554-a7a2-4577-8d8e-256147d06e33",
      "name": "Client 1",
      "projects": [
        {
          "_id": "ffe2d006-8a46-4d9e-9e13-0ae469c9bdb9",
          "name": "Project A"
        }
      ]
    }
  ]
}
{
  "_id": "026c6774-34aa-4224-9106-c73f7f3ad51c",
  "name": "John Doe",
  "clients": [
    {
      "_id": "4c975554-a7a2-4577-8d8e-256147d06e33",
      "name": "Client 1",
      "projects": [
        {
          "_id": "abt4e3076-6u15-3k7e-5z14-5he865l3evb1",
          "name": "Project B"
        }
      ]
    }
  ]
}
{
  "_id": "026c6774-34aa-4224-9106-c73f7f3ad51c",
  "name": "John Doe",
  "clients": [
    {
      "_id": "fe42cc47-9698-4c58-9903-7996581af5db",
      "name": "Client 2",
      "projects": [
        {
          "_id": "8de6d107-7f31-5r5e-8g31-4yu742k5ged2",
          "name": "Project C"
        }
      ]
    }
  ]
}
{
  "_id": "026c6774-34aa-4224-9106-c73f7f3ad51c",
  "name": "John Doe",
  "clients": [
    {
      "_id": "fe42cc47-9698-4c58-9903-7996581af5db",
      "name": "Client 2",
      "projects": [
        {
          "_id": "6tm6n321-5g56-3t1e-2b51-8xz324g5gac1",
          "name": "Project D"
        }
      ]
    }
  ]
}

Fourth stage

Pass along documents including _id, name and client to the next stage:

{
  "_id": "ffe2d006-8a46-4d9e-9e13-0ae469c9bdb9",
  "name": "Project A",
  "client": {
    "_id": "4c975554-a7a2-4577-8d8e-256147d06e33",
    "name": "Client 1",
    "projects": [
      {
        "_id": "ffe2d006-8a46-4d9e-9e13-0ae469c9bdb9",
        "name": "Project A"
      }
    ]
  }
}
{
  "_id": "abt4e3076-6u15-3k7e-5z14-5he865l3evb1",
  "name": "Project B",
  "client": {
    "_id": "4c975554-a7a2-4577-8d8e-256147d06e33",
    "name": "Client 1",
    "projects": [
      {
        "_id": "abt4e3076-6u15-3k7e-5z14-5he865l3evb1",
        "name": "Project B"
      }
    ]
  }
}
{
  "_id": "8de6d107-7f31-5r5e-8g31-4yu742k5ged2",
  "name": "Project C",
  "client": {
    "_id": "fe42cc47-9698-4c58-9903-7996581af5db",
    "name": "Client 2",
    "projects": [
      {
        "_id": "8de6d107-7f31-5r5e-8g31-4yu742k5ged2",
        "name": "Project C"
      }
    ]
  }
}
{
  "_id": "6tm6n321-5g56-3t1e-2b51-8xz324g5gac1",
  "name": "Project D",
  "client": {
    "_id": "fe42cc47-9698-4c58-9903-7996581af5db",
    "name": "Client 2",
    "projects": [
      {
        "_id": "6tm6n321-5g56-3t1e-2b51-8xz324g5gac1",
        "name": "Project D"
      }
    ]
  }
}

Fifth stage

Remove the projects field from the clients:

{
  "_id": "ffe2d006-8a46-4d9e-9e13-0ae469c9bdb9",
  "name": "Project A",
  "client": {
    "_id": "4c975554-a7a2-4577-8d8e-256147d06e33",
    "name": "Client 1"
  }
}
{
  "_id": "abt4e3076-6u15-3k7e-5z14-5he865l3evb1",
  "name": "Project B",
  "client": {
    "_id": "4c975554-a7a2-4577-8d8e-256147d06e33",
    "name": "Client 1"
  }
}
{
  "_id": "8de6d107-7f31-5r5e-8g31-4yu742k5ged2",
  "name": "Project C",
  "client": {
    "_id": "fe42cc47-9698-4c58-9903-7996581af5db",
    "name": "Client 2"
  }
}
{
  "_id": "6tm6n321-5g56-3t1e-2b51-8xz324g5gac1",
  "name": "Project D",
  "client": {
    "_id": "fe42cc47-9698-4c58-9903-7996581af5db",
    "name": "Client 2"
  }
}

Conclusion

I hope I could give you a little insight into the aggregation framework. It might be a bit overwhelming at first, as it comes with a lot of functionality, but it becomes very handy if you get some experience in using it.