Why Do You Need Better Kafka Monitoring for Your Data Engineering Team?

Why Do You Need Better Kafka Monitoring for Your Data Engineering Team?

Marc is the Head of Data Engineering at a large enterprise. The company has just launched a set of new cloud services that are in high demand. Marc’s Data Engineering department heavily depends on the Apache Kafka streaming for delivering critical services to their customers. He is responsible for his team meeting the SLAs, the productivity of engineers reporting to him, and for keeping engineering attrition to the minimum.


Recently, the engineers in Marc’s group have started complaining about spending too much time on setting up and tuning their Kafka clusters. They are also often overwhelmed with the amount of Kafka monitoring data, which dramatically impacts their productivity. The developers are using an open-source platform to gain insight into their Kafka streaming data, but the platform often fails, and they end up losing valuable data. Marc is now under pressure to prove that his decision to adopt Kafka for handling streaming data was the right one.


Every new Kafka related incident makes him even more stressed out, as he is concerned about the next meeting with his executives, who approved his budget and supported his choice of architecture. Then, on top of all the problems, there is the stress that naturally comes with a high-paced engineering environment. Marc is always on the hook to deliver, and he even started having sleeping problems.


If only he could ensure that app developers were more productive. Maybe the developers could start creating their own Kafka dashboards? What if the Data Engineering team had a granular visibility 24/7 that could help them optimize their Kafka clusters? Would it be the move in the right direction? Something like customized Kafka dashboards tracking how much storage space remains available for each topic partition when fine-tuning time and memory retention settings. Besides, wouldn’t it be awesome to add analytics to look for anomalies in metrics and logs? If the developers had access to the cool tracing instrumentation, that would help them obtain some meaningful Kafka Consumer Lag data points. 


Also, Marc might think about adding the alerts based on the number of failed replicated partitions that would help proactively isolate issues before the engineering team misses their SLOs. With all the mentioned real-time insights from streaming data, Marc would be able to prove that the outages that plagued them recently were just hard to predict deficiencies of the code, coupled with some misconfiguration issues. None of these outages could be considered as a reason to doubt his choice of Kafka as the preferred streaming technology.


Marc is hoping that his team would be able to exceed the SLAs by having the right observability. That would help them stay on top of any imminent and systemic app issues. With real-time monitoring, his team wouldn’t be overwhelmed by data, but more proactive. Being proactive would free more time for innovation. And he feels proud that he has built a team of dedicated engineers who love working on features that contribute to the success of their business. Marc is hopeful that the achievements of his team would reduce the risk of their engineers` jumping ship to that glitzy startup everyone is talking about.


For Marc’s engineering team, the high availability of his Kafka clusters is essential. All his engineers participate in a demanding on-call rotation. They dread the weekend or 2 AM wake-up calls. If only his Kafka environment were a little bit more stable. Continuous monitoring of the clusters, while they are being tuned, would help reduce the burden on the already stretched engineers.


Marc’s passion for building high-performing engineering teams, with a positive culture and his success in scaling the team, makes him the right candidate for a promotion. To be promoted, he needs to demonstrate that his engineering team is capable of delivering a reliable Kafka infrastructure that fulfills the customers` growing demand for cloud service.


Marc realized that an investment in monitoring could free the engineers from the tedious daily maintenance tasks. As the newly launched cloud services become more and more successful, new Kafka clusters need to be spun up quickly, which means more maintenance. For instance, it may become hard to track all the unpatched hosts without monitoring. Marc must work with his engineering managers to find ways to monitor and then automate such maintenance tasks.


Marc has risen through the ranks because of his ability to find order in chaos, his love for solving tough problems and driving innovation. He smiles, knowing that he has figured out how to remove the last obstacle on his road to promotion! 24/7 Kafka Monitoring… and happily, he drifts back to sleep after the latest ‘all hands on deck’ escalation call.


Contact Novaima Follow Novaima