A Checklist for Building with GraphQL

What do you need to operate GraphQL In production?

Feb 21, 2023

Are you ready to adopt GraphQL and budget out the costs of building, maintaining, and operating this system for your organization? If so, this essay is for you. I’m going to walk you through the engineers and engineering investments you’ll need to be successful with GraphQL.

It’s worth noting that there are GraphQL vendors that can provide these requirements as a service. When vetting these vendors, this essay can help guide you in evaluating whether a vendor is right for your organization. Just take each requirement listed here and consider whether the vendor can handle it for you. If so, you’ve found a great partner.

Need Help?

Do you feel like you could use some help?

I help organizations build GraphQL services and teams. Reach out to me at consulting@blankenship.io

How many Engineers?

Let’s begin by tallying up the engineers we’ll need to run this service.

You don’t need an entire team of experienced GraphQL engineers to make this architecture work, but you should have at least two engineers who can mentor others through code reviews and support the team during difficult incidents.

These engineers will be responsible for the architecture of your GraphQL service, reviewing code when their team members suggest changes to the service or schema and can assist during incidents.

Why two? Redundancy. If we only have one, they’ll be on-call (through escalation policies) 24/7 and become a bottleneck for all major work in that codebase.

These two engineers need to be well-versed in GraphQL schema design to make sure the schema:

is well-connected for client queries
makes it easy to represent valid states
makes it difficult to query for invalid states
is performant given the downstream services that are being queried
makes it easy to apply a layer of access controls over the schema
is consistent in how users access data across models and domains

They should also be experienced in building and maintaining the GraphQL tooling in the following section.

Tooling

Next, we’re going to need to invest in some tooling. Let’s break down the minimum investments we’ll need to make.

Runtime safety

We need to protect our server against misbehaving clients and malicious users.

We’ll need query complexity guards to make sure a single query, or a few well-crafted queries, can’t act as a denial-of-service attack against our backend infrastructure.

We’ll also need an access control layer to ensure the user’s credentials making the query can access the data the query would return.

Without these safety checks, a malicious user can help our company make the news for the wrong reasons.

Introspection

We’ll need to be prepared for when this service starts misbehaving.

Tracing data is essential. These traces should connect an incoming query to its performance characteristics. At the very least, this should show the query and the resolvers that were fired. An ideal setup would also link the tracing data from our GraphQL service to the tracing data in downstream services used to resolve the query.

Health metrics are also important. We need to track the shape, size, and volume of incoming queries, the resolvers being invoked and their performance characteristics, requests to downstream services, and the overall performance of our API. These are in addition to our standard HTTP and host metrics we are used to collecting for other services.

We are going to have a large number of traces. The metrics help us identify the rough characteristics of queries that might be causing problems. This helps us sift through those traces to find the likely culprits during an incident.

Alerts

Finally, we’ll need to set up alerts that page out to the GraphQL team when the performance of our service degrades, or when our access control layer indicates that a malicious user may be attempting to search for vulnerabilities in our schema.

This can be particularly tricky. GraphQL is similar to an API Gateway and its performance characteristics are tightly coupled to the performance characteristics of downstream services. If we don’t take care when constructing our metrics and alerts, we’ll find that the GraphQL team gets paged for every incident in the company.

Conclusion

If you’re paying close attention, you’ll observe that the engineering skillset you’re searching for is quite similar to that of a database engineer. They are responsible for designing schemas and the tools are implementing are quite similar to those you’d want for something like PostgreSQL.

It’s important to remember that this is just the basic set of requirements you’ll need to safely operate a GraphQL service. As you scale, there are plenty of performance optimizations and architecture designs you will need to explore down the road.

I’m always interested in feedback and good conversations. Email me at essays@blankenship.io

William Blankenship

Discussion about this post