
Root-Cause Analysis (RCA)
Benefited IT Teams


Challenges
In the cloud-centric virtualized world, IT operations teams are increasingly overwhelmed with Root-Cause Analysis (RCA) of application-delivery infrastructures.
RCA is becoming too complex with the increase of collected data volumes and types. Traditional IT methods are turning out to be limited in analyzing such a large volume of heterogeneous infrastructure data, analysis of real-time data, and automated IT event correlations. The result – high Mean Time to Acknowledge (MTTA) and Mean Time to Repair (MTTR), and hence poor performance of applications, users, and relevant business outcomes. There is a lot of data shrouded in noise; and, unfortunately, not much actionable data exists in the RCA process to take fast and (if possible) automated remediation actions to alleviate performance issues before they impact application and user performances.
Solution
When a user complains about lack of access (or slow response) from an application, the root cause can be at various domains/layers – at server domain (e.g., high resource utilization at servers, database connection issue, etc.), at network domain (e.g., packet drops, high CPU usage at router/switch, broadcast attack, etc.), or at services domain (e.g., connection capacity limit at load balancers, DNS failures, authentication errors, etc.). A simple Active Directory (AD) and/or LDAP down problem can permeate and create poor application performance. Inclusions of third-party services/APIs in application delivery make the RCA process even more challenging. One issue is clear from this – IT domains are interdependent and closely related. The problem is one layer/domain can easily affect other domains; ultimately end users suffer from any of these issues. Therefore, a disjointed triage in the RCA process where each IT team (network, application, service, etc.) is performing triage in their own limited circle (without cross-domain IT event correlations) becomes ineffective.
Problem is one IT
layer/domain can
easily affect other
domains;
ultimately end
users suffer from
any of these
issues
Ennetix xVisor AIOps platform streamlines the RCA process with a holistic approach in triaging performance issues. xVisor continuously measures and analyzes end-to-end application-delivery infrastructures – each user device, each network hop/link, each path, each server, each network function, each API gateway, etc.
xVisor RCA process starts with dynamic application-service topology discovery so that each new application/service path/device can be continuously measured and investigated for performance deviations. Manual configuration of topologies to measure performance is limited in today’s dynamic IT environments where service and application access points change frequently, based on demand and time of the day.
xVisor AIOps intelligently analyzes large volumes of complex infrastructure data and performs intelligent IT event correlations using sophisticated AI/ML algorithms, making it faster to detect anomalies in infrastructure performance behavior, reducing the manual labor of correlation and threshold-based analysis in the RCA process.
Conclusion
Continuous analysis of real-time data based on application-service topology helps xVisor to automatically group similar infrastructure events and IT event correlations, downgrade/discard events that are just symptoms, and reduce event noise to make the RCA process very effective. xVisor can quickly discover patterns and predict impending issues, thereby preventing problems before they affect user/application performance.
Results of such continuous and innovative RCA – xVisor can reduce MTTA from hours to minutes, and MTTR from days to minutes. xVisor can also provide preemptive notifications (e.g., upgrade suggestions, dynamic path changes, scaling of resources, etc.) and can allow smooth integrations for automating IT Service Management (ITSM) processes.