Client devices at the edge are generating increasingly large amounts of rich data suitable for learning powerful statistical models. However, privacy concerns and heavy communication load make it infeasible to move the client data to a centralized location for training. In many distributed learning setups, client nodes carry out gradient computations on their local data while the central master server receives the local gradients and aggregates them to take the global model update step. To guarantee robustness against straggling communication links, we consider a hierarchical setup with ne clients and nh reliable helper nodes that are available to aid in gradient aggregation at the master. To achieve resiliency against straggling client-to-helpers links, we propose two approaches leveraging coded redundancy. First is the Aligned Repetition Coding (ARC) that repeats gradient components on the helper links, allowing significant partial aggregations at the helpers, resulting in a helpers-to-master communication load (CHM) of O(nh). ARC however results in a client-to-helpers communication load (CEH) of Θ(nh), which is prohibitive for client nodes due to limited and costly bandwidth. We thus propose Aligned Minimum Distance Separable Coding (AMC) that achieves optimal CEH of Θ(1) for a given resiliency threshold by using MDS code over the gradient components, while achieving a CHM of O(ne).


Saurav Prakash

University of Southern California

Amirhossein Reisizadeh

UC Santa Barbara

Ramtin Pedarsani

University of California, Santa Barbara

Salman Avestimehr

University of Southern California

Session Chair

Chao Tian

Texas A&M University