From Wikipedia, the free encyclopedia
Content deleted Content added
|
|
|||
| Line 24: | Line 24: | ||
|
==Design and features== |
==Design and features== |
||
|
Horovod implements synchronous data-parallel training, in which each worker process maintains a replica of the model and computes gradients on different mini-batches of data.<ref name=”UberIntro2017″ /> The gradients are aggregated across workers using the ring-allreduce communication pattern rather than a central parameter server, which reduces communication bottlenecks and can improve scaling on multi-GPU clusters.<ref name=”UberIntro2017″ /> |
Horovod implements synchronous data-parallel training, in which each worker process maintains a replica of the model and computes gradients on different mini-batches of data.<ref name=”UberIntro2017″ /> The gradients are aggregated across workers using the ring-allreduce communication pattern rather than a central parameter server, which reduces communication bottlenecks and can improve scaling on multi-GPU clusters.<ref name=”UberIntro2017″ /> |
||
|
==See also== |
==See also== |
||
Latest revision as of 03:43, 28 November 2025
Horovod is a free and open-source software framework for distributed deep learning training using TensorFlow, Keras, PyTorch, and Apache MXNet. Horovod is hosted under the Linux Foundation AI (LF AI).[3] Horovod has the goal of improving the speed, scale, and resource allocation when training a machine learning model.[4]
Horovod was created at Uber as part of the company’s internal machine learning platform Michelangelo to simplify scaling TensorFlow models across many GPUs.[1] The first public release of the library, version 0.9.0, was tagged on GitHub in August 2017 under the Apache 2.0 licence.[2] In October 2017, Uber Engineering publicly introduced Horovod as an open-source component of its deep learning toolkit.[1]
In February 2018 Alexander Sergeev and Mike Del Balso published a technical paper describing Horovod’s design and benchmarking its performance on up to 512 GPUs, showing near-linear scaling for several image-classification models when compared with single-GPU baselines.[1]
In December 2018 Uber contributed Horovod to the LF Deep Learning Foundation (later LF AI & Data), making it a Linux Foundation project.[5][6][7] Horovod entered incubation under LF AI & Data and graduated as a full foundation project in 2020.[8]
Design and features
[edit]
Horovod implements synchronous data-parallel training, in which each worker process maintains a replica of the model and computes gradients on different mini-batches of data.[1] The gradients are aggregated across workers using the ring-allreduce communication pattern rather than a central parameter server, which reduces communication bottlenecks and can improve scaling on multi-GPU clusters.[1] Communication is built on top of collective-communication libraries such as MPI, NCCL, Gloo and Intel oneCCL, and supports both GPU and CPU training.[9]
- ^ a b c d e f Sergeev, Alexander (October 17, 2017). “Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow”. Uber Engineering Blog. Retrieved November 28, 2025.
- ^ a b “Releases · horovod/horovod”. horovod. Retrieved July 11, 2023.
- ^ Khari Johnson (December 13, 2018). “Uber brings Horovod project for distributed deep learning to Linux Foundation”. VentureBeat. Retrieved July 9, 2020.
- ^ “Projects – LF AI”. Linux Foundation – LF AI. Retrieved July 9, 2020.
Horovod, a distributed training framework for TensorFlow, Keras and PyTorch, improves speed, scale and resource allocation in machine learning training activities. Uber uses Horovod for self-driving vehicles, fraud detection, and trip forecasting. It is also being used by Alibaba, Amazon and NVIDIA.
- ^ “Projects – LF AI & Data”. LF AI & Data Foundation. Retrieved November 28, 2025.
- ^ Johnson, Khari (December 13, 2018). “Uber brings Horovod project for distributed deep learning to Linux Foundation”. VentureBeat. Retrieved November 28, 2025.
- ^ “Horovod: an open-source distributed training framework by Uber for TensorFlow, Keras, PyTorch, and MXNet”. Packt. April 9, 2019. Retrieved November 28, 2025.
- ^ “LF AI Foundation Announces Graduation of Horovod Project”. LF AI & Data Foundation. September 9, 2020. Retrieved November 28, 2025.
- ^ “Using Horovod for Distributed Training”. NASA Advanced Supercomputing Division. October 6, 2022. Retrieved November 28, 2025.
