AIOps

The central functions of our AIOps platforms include:

Ingestion

An AIOps platform can ingest, index and normalize events or telemetry from multiple domains, vendors, or sources, including infrastructure, networks, apps, the cloud, or existing monitoring tools (for cross-domain analysis). The platform must further enable data analytics using machine learning at least two points, including:

  • Real-time analysis at the point of ingestion (streaming analytics)
  • Historical analysis of stored data

Topology

AIOps platforms discover and assemble unified topology of IT assets, including applications, across domains. Topology can include physical proximity, logical dependence, or another dimension that captures the relationship between IT assets and services.

Correlation

The AIOps platform correlates and compresses events across telemetry domains or sources, reducing unnecessary human intervention. The correlation combines time and topology to group-related events.

Recognition

An AIOps platform processes the event and telemetry data to detect or predict important events or incidents. The platform continually learns and refines individual patterns of important events from operator input and reinforcement mechanisms.

Remediation

The AIOps platform continuously learns and improves associations between each important event and the operations response by either explicit operator specification or by observation. Analytics also facilitates automated insights, eases root cause determination, and enables automated actions for resolving identified issues.

Our platform follows a three-stage framework

  • Data ingestion and handling
  • Machine learning (ML) analytics
  • Remediation

Data Ingestion and Handling

AIOps platforms must be able to ingest data at rest (historical) and data in motion (real-time, streaming). These platforms allow for the ingestion, indexing, and storage of logs, event data, metrics, traces, and graph and document data

Machine Learning Analytics

AIOps platforms use the following types of analytic approaches:

  • Statistical, probabilistic analysis. A combination of univariate and multivariate analysis, including the use of correlation, clustering, classifying, and extrapolation on metrics captured across IT entities.
  • Automated pattern discovery and prediction. Discovering patterns, clusters, or groups that implicitly describe correlations in historical and/or streaming data. These patterns may then be used to predict incidents with varying degrees of probability.
  • Anomaly detection. Using the patterns discovered by the previous components to determine normal behavior and then to discern departures from that normal behavior, both univariate and multivariate. Transcending the mere detection of outliers, they must be correlated with business impact and other concurrent processes such as release management to be fully useful and not just create more alert noise (see Augment Decision Making in DevOps Using AI Techniques).
  • Probable cause determination. Pruning down the network of correlations established by the automated pattern discovery and ingestion of graph data to define causality chains linking cause and effect.
  • Topological analysis. AIOps platforms may use the application, network, infrastructure, or other topologies to provide contextualized analysis. Deriving patterns from data within a topology will establish relevancy and illustrate hidden dependencies. Using topology as part of causality determination can greatly increase its accuracy and effectiveness.

Remediation

As the technology matures, users will be able to leverage prescriptive advice from the platform, enabling the action stage. The steps for this are shown in Figure 4.

The Future of AI-Assisted Automation — Triage and Remediation of Problems

Time to Value

A common complaint among Gartner clients is that the length of time required to deploy, configure and receive value from an AIOps solution may be as long as six months and, in extreme cases, up to two years. AIOps is an emerging technology, which means that best practices in the area are still evolving. But organizations are reluctant to invest in a product when the potential payoff is so distant on the time horizon.

To counteract this, vendors are responding with initiatives to speed up deployments. These include:

  • Moving to SaaS-based deployments
  • Improving out-of-the-box integrations for common interfaces
  • Repeatable workflows built into the system based on field-tested best practices
  • Reducing the number of false positives generated by the system

AIOps in DevOps

As part of the general trend of “shift left” which is the merging of IT operations tools with the DevOps pipeline, early adopters are experimenting with AIOps earlier in the development pipeline. Combined with the increasing use of automation, developers are using AI to more quickly and securely deliver software that is easier to manage in production. Examples of AIOps in the DevOps pipeline are shown in Figure 5.

Applying AIOps Platforms across a Spectrum of Use Cases over the Life Cycle of an Application