Learning Spatiotemporal Failure Dependencies for Resilient Edge Computing Services

Learning Spatiotemporal Failure Dependencies for Resilient Edge Computing Services

Abstract

Edge computing services are exposed to infrastructural failures due to geographical dispersion, ad hoc deployment, and rudimentary support systems. Two unique characteristics of the edge computing paradigm necessitate a novel failure resilience approach. First, edge servers, contrary to cloud counterparts with reliable data center networks, are typically connected via ad hoc networks. Thus, link failures need more attention to ensure truly resilient services. Second, network delay is a critical factor for the deployment of edge computing services. This restricts replication decisions to geographical proximity and necessitates joint consideration of delay and resilience. In this article, we propose a novel machine learning based mechanism that evaluates the failure resilience of a service deployed redundantly on the edge infrastructure. Our approach learns the spatiotemporal dependencies between edge server failures and combines them with the topological information to incorporate link failures. Ultimately, we infer the probability that a certain set of servers fails or disconnects concurrently during service runtime. Furthermore, we introduce Dependency- and Topology-aware Failure Resilience (DTFR), a two-stage scheduler that minimizes either failure probability or redundancy cost, while maintaining low network delay. Extensive evaluation with various real-world failure traces and workload configurations demonstrate superior performance in terms of availability, number of failures, network delay, and cost with respect to the state-of-the-art schedulers.

Grafik Top
Authors
  • Aral, Atakan
  • Brandić, Ivona
Grafik Top
Shortfacts
Category
Journal Paper
Divisions
Scientific Computing
Subjects
Datenverarbeitungsmanagement
Kuenstliche Intelligenz
Parallele Datenverarbeitung
Systemarchitektur Allgemeines
Journal or Publication Title
IEEE Transactions on Parallel and Distributed Systems (TPDS)
ISSN
1045-9219
Publisher
IEEE
Page Range
pp. 1578-1590
Number
7
Volume
32
Date
22 December 2020
Export
Grafik Top