Five scalability pitfalls to avoid with your Kafka expérience

Indien Kafka is a high-performance, highly scalable event streaming platform. To unlock Kafka’s full potential, you need to carefully consider the stylisme of your expérience. It’s all too easy to write Kafka applications that perform poorly or eventually hit a scalability voilier wall. Since 2015, IBM has provided the IBM Event Streams munificence, which is a fully-managed Indien Kafka munificence running on IBM Cloud®. Since then, the munificence has helped many customers, as well as teams within IBM, resolve scalability and prospérité problems with the Kafka applications they have written.

This entrefilet describes some of the common problems of Indien Kafka and provides some recommendations for how you can avoid running into scalability problems with your applications.

1. Minimize waiting for network round-trips

Visible Kafka operations work by the fidèle sending data to the opérateur and waiting for a response. A whole round-trip might take 10 milliseconds, which sounds speedy, but limits you to at most 100 operations per assesseur. For this reason, it’s recommended that you try to avoid these kinds of operations whenever hypothétique. Fortunately, Kafka clients provide ways for you to avoid waiting on these round-trip times. You just need to ensure that you’re taking advantage of them.

Tips to maximize throughput:

Don’t check every flash sent if it succeeded. Kafka’s API allows you to decouple sending a flash from checking if the flash was successfully received by the opérateur. Waiting for acceptation that a flash was received can introduce network round-trip latency into your expérience, so aim to minimize this where hypothétique. This could mean sending as many messages as hypothétique, before checking to confirm they were all received. Or it could mean delegating the check for successful flash delivery to another thread of execution within your expérience so it can run in parallel with you sending more messages.

Don’t follow the processing of each flash with an lithographie commit. Committing offsets (synchronously) is implemented as a network round-trip with the server. Either commit offsets less frequently, or use the asynchronous lithographie commit function to avoid paying the price for this round-trip for every flash you process. Just be aware that committing offsets less frequently can mean that more data needs to be re-processed if your expérience fails.

If you read the above and thought, “Uh oh, won’t that make my expérience more complex?” — the answer is yes, it likely will. There is a trade-off between throughput and expérience complexity. What makes network round-trip time a particularly insidious pitfall is that léopard des neiges you hit this limit, it can require dilatante expérience changes to achieve further throughput improvements.

2. Don’t let increased processing times be mistaken for rôtir failures

One helpful feature of Kafka is that it monitors the “liveness” of consuming applications and disconnects any that might have failed. This works by having the opérateur track when each consuming fidèle last called “poll” (Kafka’s terminology for asking for more messages). If a fidèle doesn’t poll frequently enough, the opérateur to which it is connected concludes that it must have failed and disconnects it. This is designed to allow the clients that are not experiencing problems to step in and pick up work from the failed fidèle.

Unfortunately, with this scheme the Kafka opérateur can’t distinguish between a fidèle that is taking a mince time to process the messages it received and a fidèle that has actually failed. Consider a consuming expérience that loops: 1) Calls poll and gets back a batch of messages; or 2) processes each flash in the batch, taking 1 assesseur to process each flash.

If this rôtir is receiving batches of 10 messages, then it’ll be approximately 10 seconds between calls to poll. By default, Kafka will allow up to 300 seconds (5 minutes) between polls before disconnecting the fidèle — so everything would work éthérée in this scenario. But what happens on a really busy day when a backlog of messages starts to build up on the topic that the expérience is consuming from? Rather than just getting 10 messages back from each poll call, your expérience gets 500 messages (by default this is the optimal number of records that can be returned by a call to poll). That would result in enough processing time for Kafka to decide the expérience prière has failed and disconnect it. This is bad magazine.

You’ll be delighted to learn that it can get worse. It is hypothétique for a kind of feedback loop to occur. As Kafka starts to disconnect clients parce que they aren’t calling poll frequently enough, there are less instances of the expérience to process messages. The likelihood of there being a béant backlog of messages on the topic increases, leading to an increased likelihood that more clients will get béant batches of messages and take too mince to process them. Eventually all the instances of the consuming expérience get into a restart loop, and no useful work is done.

What steps can you take to avoid this happening to you?

The optimal amount of time between poll calls can be configured using the Kafka rôtir “max.poll.interval.ms” forme. The optimal number of messages that can be returned by any single poll is also configurable using the “max.poll.records” forme. As a rule of thumb, aim to reduce the “max.poll.records” in preferences to increasing “max.poll.interval.ms” parce que setting a béant optimal poll interval will make Kafka take raser to identify consumers that really have failed.

Kafka consumers can also be instructed to accalmie and resume the flow of messages. Pausing consumption prevents the poll method from returning any messages, but still resets the timer used to determine if the fidèle has failed. Pausing and resuming is a useful tactic if you both: a) expect that individual messages will potentially take a mince time to process; and b) want Kafka to be able to detect a fidèle failure segment way through processing an individual flash.

Don’t overlook the usefulness of the Kafka fidèle metrics. The topic of metrics could fill a whole entrefilet in its own right, but in this context the rôtir exposes metrics for both the average and optimal time between polls. Monitorage these metrics can help identify situations where a downstream system is the reason that each flash received from Kafka is taking raser than expected to process.

We’ll return to the topic of rôtir failures later in this entrefilet, when we image at how they can trigger rôtir group re-balancing and the disruptive effect this can have.

3. Minimize the cost of idle consumers

Under the hood, the protocol used by the Kafka rôtir to receive messages works by sending a “fetch” request to a Kafka opérateur. As segment of this request the fidèle indicates what the opérateur should do if there aren’t any messages to handball back, including how mince the opérateur should wait before sending an empty response. By default, Kafka consumers instruct the brokers to wait up to 500 milliseconds (controlled by the “fetch.max.wait.ms” rôtir forme) for at least 1 octet of flash data to become available (controlled with the “fetch.min.bytes” forme).

Waiting for 500 milliseconds doesn’t sound unreasonable, but if your expérience has consumers that are mostly idle, and scales to say 5,000 instances, that’s potentially 2,500 requests per assesseur to do absolutely nothing. Each of these requests takes CPU time on the opérateur to process, and at the extreme can percussion the prospérité and stability of the Kafka clients that are want to do useful work.

Normally Kafka’s approach to scaling is to add more brokers, and then evenly re-balance topic partitions across all the brokers, both old and new. Unfortunately, this approach might not help if your clients are bombarding Kafka with needless fetch requests. Each fidèle will send fetch requests to every opérateur leading a topic division that the fidèle is consuming messages from. So it is hypothétique that even after scaling the Kafka cluster, and re-distributing partitions, most of your clients will be sending fetch requests to most of the brokers.

So, what can you do?

Changing the Kafka rôtir forme can help reduce this effect. If you want to receive messages as soon as they arrive, the “fetch.min.bytes” must remain at its default of 1; however, the “fetch.max.wait.ms” setting can be increased to a larger value and doing so will reduce the number of requests made by idle consumers.

At a broader scope, does your expérience need to have potentially thousands of instances, each of which consumes very infrequently from Kafka? There may be very good reasons why it does, but perhaps there are ways that it could be designed to make more agissant use of Kafka. We’ll touch on some of these considerations in the next question.

4. Choose appropriate numbers of topics and partitions

If you come to Kafka from a contexte with other publish–subscribe systems (for example Plaidoirie Queuing Telemetry Envoi, or MQTT for caleçon) then you might expect Kafka topics to be very lightweight, almost ephemeral. They are not. Kafka is much more comfortable with a number of topics measured in thousands. Kafka topics are also expected to be relatively mince lived. Practices such as creating a topic to receive a single reply flash, then deleting the topic, are uncommon with Kafka and do not play to Kafka’s strengths.

Instead, maquette for topics that are mince lived. Perhaps they share the lifetime of an expérience or an activity. Also aim to limit the number of topics to the hundreds or perhaps low thousands. This might require taking a different horizon on what messages are interleaved on a particular topic.

A related tourment that often arises is, “How many partitions should my topic have?” Traditionally, the advice is to overestimate, parce que adding partitions after a topic has been created doesn’t cassé the partitioning of existing data held on the topic (and hence can affect consumers that rely on partitioning to offer flash ordering within a division). This is good advice; however, we’d like to suggest a few additional considerations:

For topics that can expect a throughput measured in MB/assesseur, or where throughput could grow as you scale up your expérience—we strongly recommend having more than one division, so that the load can be spread across varié brokers. The Event Streams munificence always runs Kafka with a varié of 3 brokers. At the time of writing, it has a optimal of up to 9 brokers, but perhaps this will be increased in the future. If you pick a varié of 3 for the number of partitions in your topic then it can be balanced evenly across all the brokers.

The number of partitions in a topic is the limit to how many Kafka consumers can usefully share consuming messages from the topic with Kafka rôtir groups (more on these later). If you add more consumers to a rôtir group than there are partitions in the topic, some consumers will sit idle not consuming flash data.

There’s nothing inherently wrong with having single-partition topics as mince as you’re absolutely sure they’ll never receive significant messaging traffic, or you won’t be relying on ordering within a topic and are happy to add more partitions later.

5. Allumer group re-balancing can be surprisingly disruptive

Most Kafka applications that consume messages take advantage of Kafka’s rôtir group capabilities to coordinate which clients consume from which topic partitions. If your recollection of rôtir groups is a little hazy, here’s a quick refresher on the key points:

Allumer groups coordinate a group of Kafka clients such that only one fidèle is receiving messages from a particular topic division at any given time. This is useful if you need to share out the messages on a topic among a number of instances of an expérience.

When a Kafka fidèle joins a rôtir group or leaves a rôtir group that it has previously joined, the rôtir group is re-balanced. Commonly, clients join a rôtir group when the expérience they are segment of is started, and leave parce que the expérience is shutdown, restarted or crashes.

When a group re-balances, topic partitions are re-distributed among the members of the group. So for example, if a fidèle joins a group, some of the clients that are already in the group might have topic partitions taken away from them (or “revoked” in Kafka’s terminology) to give to the newly joining fidèle. The reverse is also true: when a fidèle leaves a group, the topic partitions assigned to it are re-distributed amongst the remaining members.

As Kafka has matured, increasingly sophisticated re-balancing algorithms have (and continue to be) devised. In early versions of Kafka, when a rôtir group re-balanced, all the clients in the group had to arrêt consuming, the topic partitions would be redistributed amongst the group’s new members and all the clients would start consuming again. This approach has two drawbacks (don’t worry, these have since been improved):

All the clients in the group arrêt consuming messages while the re-balance occurs. This has obvious repercussions for throughput.

Kafka clients typically try to keep a buffer of messages that have yet to be delivered to the expérience and fetch more messages from the opérateur before the buffer is drained. The intent is to prevent flash delivery to the expérience stalling while more messages are fetched from the Kafka opérateur (yes, as per earlier in this entrefilet, the Kafka fidèle is also trying to avoid waiting on network round-trips). Unfortunately, when a re-balance causes partitions to be revoked from a fidèle then any buffered data for the division has to be discarded. Likewise, when re-balancing causes a new division to be assigned to a fidèle, the fidèle will start to buffer data starting from the last committed lithographie for the division, potentially causing a spike in network throughput from opérateur to fidèle. This is caused by the fidèle to which the division has been newly assigned re-reading flash data that had previously been buffered by the fidèle from which the division was revoked.

More recent re-balance algorithms have made significant improvements by, to use Kafka’s terminology, adding “stickiness” and “cooperation”:

“Sticky” algorithms try to ensure that after a re-balance, as many group members as hypothétique keep the same partitions they had prior to the re-balance. This minimizes the amount of buffered flash data that is discarded or re-read from Kafka when the re-balance occurs.

“Cooperative” algorithms allow clients to keep consuming messages while a re-balance occurs. When a fidèle has a division assigned to it prior to a re-balance and keeps the division after the re-balance has occurred, it can keep consuming from uninterrupted partitions by the re-balance. This is synergistic with “stickiness,” which acts to keep partitions assigned to the same fidèle.

Despite these enhancements to more recent re-balancing algorithms, if your applications is frequently subject to rôtir group re-balances, you will still see an percussion on overall messaging throughput and be wasting network bandwidth as clients discard and re-fetch buffered flash data. Here are some suggestions emboîture what you can do:

Ensure you can éblouissement when re-balancing is occurring. At scale, collecting and visualizing metrics is your best préférence. This is a données where a breadth of metric pluies helps build the complete picture. The Kafka opérateur has metrics for both the amount of bytes of data sent to clients, and also the number of rôtir groups re-balancing. If you’re gathering metrics from your expérience, or its runtime, that spectacle when re-starts occur, then correlating this with the opérateur metrics can provide further acceptation that re-balancing is an aboutissement for you.

Avoid unnecessary expérience restarts when, for example, an expérience crashes. If you are experiencing stability issues with your expérience then this can lead to much more frequent re-balancing than anticipated. Searching expérience logs for common error messages emitted by an expérience écrasement, for example stack traces, can help identify how frequently problems are occurring and provide communiqué helpful for debugging the underlying aboutissement.

Are you using the best re-balancing algorithm for your expérience? At the time of writing, the gold courant is the “CooperativeStickyAssignor”; however, the default (as of Kafka 3.0) is to use the “RangeAssignor” (and earlier assignment algorithm) in preference to the cooperative sticky assignor. The Kafka dossier describes the expatriation steps required for your clients to pick up the cooperative sticky assignor. It is also worth noting that while the cooperative sticky assignor is a good all reprise choice, there are other assignors tailored to specific use cases.

Are the members for a rôtir group fixed? For example, perhaps you always run 4 highly available and lumineux instances of an expérience. You might be able to take advantage of Kafka’s static group membership feature. By assigning entier IDs to each prière of your expérience, static group membership allows you to side-step re-balancing altogether.

Commit the current lithographie when a division is revoked from your expérience prière. Kafka’s rôtir fidèle provides a listener for re-balance events. If an prière of your expérience is emboîture to have a division revoked from it, the listener provides the opportunity to commit an lithographie for the division that is emboîture to be taken away. The advantage of committing an lithographie at the lieu the division is revoked is that it ensures whichever group member is assigned the division picks up from this lieu—rather than potentially re-processing some of the messages from the division.

What’s Next?

You’re now an connaisseur in scaling Kafka applications. You’re invited to put these points into practice and try out the fully-managed Kafka offering on IBM Cloud. For any challenges in set up, see the Getting Started Guide and FAQs.

Lean more about Kafka and its use cases

Explore Event Streams on IBM Cloud

Event Streams for IBM Cloud Engineer

Articles similaires

Leave a Comment