My esteemed colleague Tony Jones discussed the merits of machine learning in a recent blog, and now it’s my turn to tackle the subject from an Operations perspective! First, I think the best way to look at this technology is to explain a little about the changes happening at MediaKind itself. We’re rapidly extending operations in the public cloud, enabling us to run both 24/7 and event-based channels that may exist only for a matter of hours. For example, at sports events where the broadcast is required only for the duration of the game, the whole video processing chain can be run in the public cloud. That approach means we can spin channels up rapidly, and infrastructure costs are limited to when the customer needs the broadcast.
However, it comes at a cost. There’s greater pressure on the MediaKind team to get telemetry off the system to analyze data at the necessary rate, both in real-time when events are happening and retrospectively, which means storing telemetry for later analysis. And there is a lot of telemetry coming out of the video processing, ranging from encoding, packaging, the networks, protocol, adapters, and so on. They all produce essential data and detailed insights around all the various buffers and the potential paths of all the services.
But if we want real-time analysis, there needs to be a quicker way to identify problem components so that when an alert comes up, it’s easy to understand and rectify a fault. Or, even if the problem is outside MediaKind’s domain (e.g., the problem is from the source video), we want to quickly identify this and notify the customer of an issue on their end. In this scenario, every second matters, and it’s just not practical for our team to spend a long time trawling through unnecessary data.
The importance of Grafana Machine Learning for MediaKind
The ongoing challenge for our industry has been to render the data in the most concise way possible. And that’s why the Grafana Machine-Learning tool has been useful for our team. To date, there have been too many graphs that need scanning by skilled people daily. Grafana uses an open-source algorithm called Prophet. It’s a powerful algorithm that can look back over the history of the data and detect both a trend and seasonality, i.e. where there are regular, but normal, peaks and troughs in data over a day or a week. It can flag a spike when the data goes outside the normal envelope, even for just a few seconds.
Whilst a single channel could be set up for detection by a fixed limit, when you’re monitoring multiple channels, particularly for live sports content like the Olympic Games, which could be broadcasting multiple events simultaneously, it’s very hard to have a static function that can simply unravel that data. Grafana can create customized filters automatically for all the channels we monitor, which can be anywhere around 100 at any one time.
Machine Learning can be a complex area with many parameters to tune, but conceptually and practically, Grafana Machine Learning is very easy to understand. The biggest gain in using tools such as this lies in observing and detecting the anomalies and filtering the incoming data so that we present less data to the human operator. That, in turn, enables the operator to make a quicker decision and diagnosis of the problem. It’s a handy and exciting tool for intelligent anomaly detection, and essentially, it means we can achieve a higher level of reliability.
In the video below, I discuss how machine learning is helping both MediaKind and the wider industry to spot the root causes of issues and resolve problems quickly. The discussion also addresses the recent impact of machine learning in our 2021 global hackathon, my biggest success stories around machine learning and AI, and some (possibly ill-advised) predictions for the future of these technologies…