Scaling GraphQL with Postgres - Lessons learned from our database timeout issues
Context
Last year at Cycle was a challenging year. We were on a big transition phase, experiencing a lot of technical issues while trying to monetize a product that had not yet delivered the vision we pitched. It was a difficult year.
On the technical side, the main problems were the database timeouts, which caused the servers to constantly restart and become unstable. The only way to “quick fix” it was to increase the health check period and force a restart of all the servers. That was dirty but moreover it was not fixing the real problems. Why were we experiencing these timeouts?
It was crucial for the company to resolve this issue, as Cycle is a mission-critical product that some of our users rely on for more than 6 hours a day. Having the servers down meant we were disrupting their processes.
The deployment problem
Every time we deployed, we encountered timeout issues. This occurred specifically when there were many users connected and working on Cycle. We realized that schema desynchronization between the database and the API could be a reason for these errors, as the API was requesting data that did not match the database. However, even when we didn't have any breaking migration to run, we still experienced the same issue. In fact, right after every deployment, we saw a surge in database requests. It was as if we had a large number of new users connecting to the app simultaneously.
Reflecting on this now, I realize how trivial this problem seems. In Cycle, everything is real-time. We use GraphQL subscriptions and simple web sockets for real-time collaboration. When the frontend loses connection, it needs to reconnect and relaunch all the observable queries required for workspace configuration/initialization. When our servers were restarting after a deployment, the hundreds of connected users would lose connection and then reconnect almost simultaneously, triggering multiple queries at the same time, resulting in a kind of DDoS on our servers 🫣.
The solution is simple: instead of relaunching all the queries every time the server reconnects, we need to check if the client is connected to the internet. If it is, there is no need to relaunch all the queries since the workspace configuration is up to date. When do we need to relaunch all the observable queries then? Only when you lose connection to the internet, for instance when you close your laptop and reopen it with your Cycle still open. That's it.
A small part of our ApolloLink configuration with the line that solved our reconnection issues.
The big query problem
Solving this problem wasn’t enough. We still experienced issues when some workspaces had very complex view configuration in Cycle. One thing we know for sure is that Cycle has a very flexible data model. It can model any relational data structure and it can be seen as a lightweight relational database. It’s a dream for no coders but not so much for developers.
To give you a sense of how intricate the data can be, the values are custom, the properties from those values are custom and the link between those properties and the type of docs are custom. The hierarchy links between those type of docs are also custom. This gives us a very deep graph to query when doing so in the context of a view configuration. To meet our performance targets, we optimised all our dataloaders and followed all the graphQL best practices (have a look here as our API is public) but doing so was not enough. Querying multi dimensional group by configurations with filters on those same levels was leading to database timeouts.
The more our users use Cycle the more they use the concepts extensively (even more than us) and the complex view configuration they created was the cause of those timeouts when querying them multiple times. But why ? Actually, while taking a closer look at the data structure, we ended up seeing that the tables responsible for the multi dimensional group by logic were missing proper indexes. Adding those simple indexes instantly solved our issues. This led us to review entirely our indexing logic and make drastic changes that improved significantly the whole app performance. This was a silly mistake we made and I recommend everyone to read the wonderful article “Use the index Luke” from Markus Winand to fully understand the impact a good indexing logic has and how to do it.
The big updates problem
The last source of database timeouts was found more quickly. Indeed, bulk updates are a significant part of Cycle, which people can use to change a large amount of data all at once. Doing so on a big data set can quickly become overwhelming for the servers. Let’s say you want to update the status of 300 documents, and each status change triggers notifications and automations. If multiple people do it all together, the servers will crash for sure.
Implementing rate limits and complexity limits is not enough if you want to provide a seamless user experience. That’s when we introduced our queuing system in Cycle. Doing so was a game changer for the way we approached the state management of the API. Every action potentially harmful to the state was batched and added to the queue.
Not only did it drastically improve the state health in general, but most queuing systems let you retry failed jobs and give you access to the input of those. It's a lot easier for debugging sessions and gives you more visibility over incoming data from your integrations.
The batch logic for the queuing system
What we learned from this
All I said above is not rocket science. This is common sense, but often times we focus on cutting-edge solutions to solve what seem like very complex problems, while the root cause is actually much simpler.
The general rule of thumb would be: protect your state. By state, I mean the interface between your business logic and the access to third-party API/database. This is the most critical part of your application and taking deep care of it may only improve the user (and dev) experience.
There are a lot of techniques out there to protect it. The ones I described above are only a very small subset and their impact can depend on the typology of application you are building. Here is an exhaustive list of the techniques that worked at Cycle so far:
- Extensive use of dataloaders to batch and cache the query results
- Pagination of every potentially unlimited list, including all things related to view configuration
- Rework of the indexing logic, specifically for deeply linked data
- Setup of a queuing system to batch and mitigate intensive mutations
- Rate and complexity limits to avoid a single user/app using too many API ressources
- Custom subscription service to have a complete control of the data we publish
With that mindset and a good observability setup, most API problems can be avoided and/or foreseen. This is how we achieved a > 99.95% uptime this year, and I can only hope this will continue to improve!