POD-diagnosis: Error diagnosis of sporadic operations on cloud applications


Sherry Xu, Liming Zhu, Ingo Weber, Len Bass and Daniel Sun




Applications in the cloud are subject to sporadic changes due to operational activities such as upgrade, redeployment, and on-demand scaling. These operations are also subject to interferences from other simultaneous operations. Increasing the dependability of these sporadic operations is non-trivial, particularly since traditional anomaly-detection-based diagnosis techniques are less effective during sporadic operation periods. A wide range of legitimate changes confound anomaly diagnosis and make baseline establishment for “normal” operation difficult. The increasing frequency of these sporadic operations (e.g. due to continuous deployment) is exacerbating the problem. Diagnosing failures during sporadic operations relies heavily on logs, while log analysis challenges stemming from noisy, inconsistent and voluminous logs from multiple sources remain largely unsolved. In this paper, we propose Process Oriented Dependability (POD)-Diagnosis, an approach that explicitly models these sporadic operations as processes. These models allow us to (i) determine orderly execution of the process, and (ii) use the process context to filter logs, trigger assertion evaluations, visit fault trees and perform on-demand assertion evaluation for online error diagnosis and root cause analysis. We evaluated the approach on rolling upgrade operations in Amazon Web Services (AWS) while performing other simultaneous operations. During our evaluation, we correctly detected all of the 160 injected faults, as well as 46 interferences caused by concurrent operations. We did this with 91.95% precision. Of the correctly detected faults, the accuracy rate of error diagnosis is 96.55%.

