Information Technology (IT) Operations management is a vexing problem for most companies that rely on IT systems for mission-critical business applications.
Despite the best intentions of engineers, good designs, and solid development practices, software and hardware systems deployed in companies in service of critical business applications are susceptible to outages, resulting in millions of dollars in labor, revenue loss, and customer satisfaction issues each year.
Artificial intelligence (AI) can help solve these problems. AI can help IT operations managers, also known as Site Reliability Engineers (SREs) in detecting issues early, predicting them before they occur, reducing event and alert noise, locating the specific application or infrastructure component that is the source of the issue, determining the scope of incident impact, and recommending relevant and timely actions.
All these analytics help reduce the mean times to detect (MTTD), identify/isolate (MTTI) and resolve (MTTR) an incident. This, in turn, saves millions of dollars by preventing direct costs (lost revenue, penalties, opportunity costs, etc.) and indirect costs (customer dissatisfaction, lost customers, lost references, etc.).
Fortunately, Watson AIOps can do all of this for you! Figure 1 shows the various types of data that Watson AIOps can process. In this article, we describe the AI in our Watson AIOps solution:
Figure 1: Input Data Types
Log anomaly prediction
IBM Watson AIOps’ state-of-the-art and multi patent-pending log anomaly detection technology is capable of automatically parsing IT application and infrastructure logs from log aggregation tools (e.g., Splunk, Humio, LogDNA, and Logstash) to automatically detect anomalies in real-time. This is much sooner than traditional thresholding-based or error-string-matching types of alerting techniques can, thereby significantly reducing the meant time to diagnose an incident. We use deep learning algorithms to both prepare features from logs during log parsing and to make anomaly predictions. Users don’t have to set static thresholds or manual rules to detect anomalies.
Metric anomaly prediction
Watson AIOps’ metric-based anomaly detection analyzes metrics data from various systems (e.g., New Relic, AppDynamics, and SolarWinds) to automatically learn the normal behavior of metrics in your company and detect anomalies from those metrics. It employs a set of time-tested time-series algorithms (e.g., Granger Causality, Robust Bounds, Variant/Invariant, Finite Domain, and Predominant Range) to capture seasonality and significant trends and to perform forecasting.
An event indicates that something that is noteworthy has happened in an IT operations environment. For example, an application has become unavailable or a disk is full/reaching capacity, etc. The goal of event grouping and classification is to reduce the noise for IT operations management personnel and help them focus on a few important events that need their immediate attention. Anomalies detected from metrics, logs, and tickets are grouped using multiple algorithms (e.g., Temporal, Spatial, and Association Rule mining) in Watson AIOps for event grouping.
Static and dynamic topology management
Application and network topology refer to a map or diagram that lays out the connections between different mission-critical applications in an enterprise. Static topology refers to a map that is constructed based on the build and deploys information on applications and infrastructure components. Dynamic topology, on the other hand, refers to a dynamic map that captures the resources and their relationships as the environment changes at run-time and provides a near-real-time visibility of the same.
With Topology Manager in Watson AIOps, you can compare the current topology with a historical one to answer questions such as “What happened?” and “What's happening now?” It helps you investigate the details that led up to an incident and see the topology (and status) changes over time. In addition, faults are localized on topology.
Fault localization and blast radius
Entity mentions are the names of the resources (e.g., service or application component names, server names, server IP addresses, pod IDs, node ID, etc.) that are referenced in anomalous logs, alerts, tickets, and events. Once events are grouped, entity mentions in anomalous logs, metrics, alerts, and events are extracted. These entities are resolved with topological resources to isolate the problem and to place the identified entities on the corresponding dynamic topology instances that match the time at which the mentions were noted. Traversing the topological graph in the application, infrastructure, and network layers enable us to map out the impacted components, known as blast radius.
Watson AIOps ingests and mines prior incident ticket data by connecting to tools such as ServiceNow to provide timely and relevant next-best-action recommendations for the currently diagnosed problem at hand. Current incident symptoms are framed as a query to the indexed ticket data to not only search and retrieve the top k relevant prior incident records, but also important entity-action (aka noun-verb) phrases are extracted from each relevant record to make it easy for SREs to get a quick glimpse of the suggested action. We apply various natural language processing techniques to extract entity and action phrases, including rule-based systems.
Insight delivery and action implementation
In Watson AIOps, all of the insights described above are delivered by via both ChatOps and dashboards. Real-time, in-the-moment insights are delivered via ChatOps to SREs directly in the place where they work. Within ChatOps, there is functionality to interact and share selected incident resolution suggestions with other collaborators, in addition to exploring the evidence of the insights. From ChatOps, SREs can launch log, metric, and ticket monitoring tools to explore further details. Similarly, SREs can launch interactive dashboards for detailed exploration of events, event groups, metric anomalies, and topology. Applicable actions/runbooks can then be automatically run via Runbook execution.
A note on AI model lifecycle management
Watson AIOps takes representative set of metrics, logs, and ticket data for training and building its unsupervised models. These models are set up to learn continuously by using up-to-date data from your environment and improve based on user feedback. All AI models are transparent and explainable to earn your trust:
Figure 2: AI pipelines in Watson AIOps.
What’s next for Watson AIOps?
In the next generations of our Watson AIOps solution, we envision fully instrumented, observable, self-aware, automated, and autonomic IT operations environments. AIOps solutions will not only be able to help resolve issues in a reactive mode, but help avoid issues from happening in the first place by designing the Development-Security-Operations (DevSecOps) lifecycle activities for efficient operations right from the get-go. For example, smart checks and gates prevent risky deployments from getting pushed to production, stop under-tested code modules from getting into deployment phases, and block code with risky security vulnerabilities from getting to the deployment phase. We can’t wait to shape the future and take you all with us in this journey.
Figure 3: Shifting-left in DevSecOps lifecycle while closing the loop’s virtuous feedback and feedforward cycles for efficient operations management.