How can multitask learning manage conflicting gradient updates across tasks?

Conflicts among task gradients arise when a single set of parameters must serve different objectives. In deep networks, backpropagated gradients from task-specific losses can point in opposing directions, producing gradient conflict and causing negative transfer where learning for one task degrades another. Consequences include slower convergence, unstable training, and models that perform well on some tasks but poorly on others. Foundational work by Rich Caruana at Microsoft Research framed multi-task learning as leveraging shared representations to improve generalization, highlighting both its promise and its liabilities.

Causes of conflicting gradients

At root, conflicts reflect differences in task geometry and data distributions. Tasks with divergent input–output mappings or with imbalanced data contribute disproportionately large gradients; shared layers then receive mixed signals. Domain and cultural variation amplify this: a speech recognition model trained across languages or regional accents may see gradients that reflect distinct phonetic structures, while medical models trained across health systems can receive gradients shaped by local clinical practices. These territorial and environmental nuances mean conflicts are not only mathematical but also socio-technical, affecting fairness and local utility.

Strategies to manage conflicts

A range of algorithmic responses aims to align or arbitrate gradient updates. Loss weighting adapts per-task loss scale; for example, task uncertainty weighting proposed by Alex Kendall at University of Cambridge and Yarin Gal at University of Oxford frames weights by predictive uncertainty to balance objectives. Gradient normalization rescales task gradients so no single task dominates the update. Gradient projection (also called gradient surgery) modifies conflicting gradients by projecting them onto non-conflicting subspaces, reducing interference while preserving useful signal. Optimization-focused approaches cast training as a multi-objective problem and seek Pareto-optimal updates that trade off tasks explicitly. Architectural choices such as task-specific layers or soft parameter sharing reduce the degree of forced compromise in shared parameters.

Each method has trade-offs: weighting and normalization simplify training but may not fully remove trade-offs; projection techniques can preserve individual task gains yet add computational cost; multi-objective optimizers yield principled balances but may slow practical training. In applied settings, combining techniques and incorporating domain knowledge about cultural or environmental differences often yields the best outcomes, ensuring models respect local variation while maintaining overall robustness.