Horovod (machine learning): Difference between revisions

Content deleted Content added

Latest revision as of 03:43, 28 November 2025

Horovod is a free and open-source software framework for distributed deep learning training using TensorFlow, Keras, PyTorch, and Apache MXNet. Horovod is hosted under the Linux Foundation AI (LF AI).^[3] Horovod has the goal of improving the speed, scale, and resource allocation when training a machine learning model.^[4]

Horovod was created at Uber as part of the company’s internal machine learning platform Michelangelo to simplify scaling TensorFlow models across many GPUs.^[1] The first public release of the library, version 0.9.0, was tagged on GitHub in August 2017 under the Apache 2.0 licence.^[2] In October 2017, Uber Engineering publicly introduced Horovod as an open-source component of its deep learning toolkit.^[1]

In February 2018 Alexander Sergeev and Mike Del Balso published a technical paper describing Horovod’s design and benchmarking its performance on up to 512 GPUs, showing near-linear scaling for several image-classification models when compared with single-GPU baselines.^[1]

In December 2018 Uber contributed Horovod to the LF Deep Learning Foundation (later LF AI & Data), making it a Linux Foundation project.^[5]^[6]^[7] Horovod entered incubation under LF AI & Data and graduated as a full foundation project in 2020.^[8]

Design and features

[edit]

Horovod implements synchronous data-parallel training, in which each worker process maintains a replica of the model and computes gradients on different mini-batches of data.^[1] The gradients are aggregated across workers using the ring-allreduce communication pattern rather than a central parameter server, which reduces communication bottlenecks and can improve scaling on multi-GPU clusters.^[1] Communication is built on top of collective-communication libraries such as MPI, NCCL, Gloo and Intel oneCCL, and supports both GPU and CPU training.^[9]

^ ^a ^b ^c ^d ^e ^f Sergeev, Alexander (October 17, 2017). “Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow”. Uber Engineering Blog. Retrieved November 28, 2025.
^ ^a ^b “Releases · horovod/horovod”. horovod. Retrieved July 11, 2023.
^ Khari Johnson (December 13, 2018). “Uber brings Horovod project for distributed deep learning to Linux Foundation”. VentureBeat. Retrieved July 9, 2020.
^ “Projects – LF AI”. Linux Foundation – LF AI. Retrieved July 9, 2020. Horovod, a distributed training framework for TensorFlow, Keras and PyTorch, improves speed, scale and resource allocation in machine learning training activities. Uber uses Horovod for self-driving vehicles, fraud detection, and trip forecasting. It is also being used by Alibaba, Amazon and NVIDIA.
^ “Projects – LF AI & Data”. LF AI & Data Foundation. Retrieved November 28, 2025.
^ Johnson, Khari (December 13, 2018). “Uber brings Horovod project for distributed deep learning to Linux Foundation”. VentureBeat. Retrieved November 28, 2025.
^ “Horovod: an open-source distributed training framework by Uber for TensorFlow, Keras, PyTorch, and MXNet”. Packt. April 9, 2019. Retrieved November 28, 2025.
^ “LF AI Foundation Announces Graduation of Horovod Project”. LF AI & Data Foundation. September 9, 2020. Retrieved November 28, 2025.
^ “Using Horovod for Distributed Training”. NASA Advanced Supercomputing Division. October 6, 2022. Retrieved November 28, 2025.

[UberIntro2017-1] ^ ^a ^b ^c ^d ^e ^f Sergeev, Alexander (October 17, 2017). “Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow”. Uber Engineering Blog. Retrieved November 28, 2025.

[horovodreleases-2] “Releases · horovod/horovod”. horovod. Retrieved July 11, 2023.

[vent_Uber-3] Khari Johnson (December 13, 2018). “Uber brings Horovod project for distributed deep learning to Linux Foundation”. VentureBeat. Retrieved July 9, 2020.

[lfai_Proj-4] “Projects – LF AI”. Linux Foundation – LF AI. Retrieved July 9, 2020. Horovod, a distributed training framework for TensorFlow, Keras and PyTorch, improves speed, scale and resource allocation in machine learning training activities. Uber uses Horovod for self-driving vehicles, fraud detection, and trip forecasting. It is also being used by Alibaba, Amazon and NVIDIA.

[LFAIProjects-5] “Projects – LF AI & Data”. LF AI & Data Foundation. Retrieved November 28, 2025.

[VentureBeat2018-6] Johnson, Khari (December 13, 2018). “Uber brings Horovod project for distributed deep learning to Linux Foundation”. VentureBeat. Retrieved November 28, 2025.

[Packt2019-7] “Horovod: an open-source distributed training framework by Uber for TensorFlow, Keras, PyTorch, and MXNet”. Packt. April 9, 2019. Retrieved November 28, 2025.

[LFGrad2020-8] “LF AI Foundation Announces Graduation of Horovod Project”. LF AI & Data Foundation. September 9, 2020. Retrieved November 28, 2025.

[NASAHorovod2022-9] “Using Horovod for Distributed Training”. NASA Advanced Supercomputing Division. October 6, 2022. Retrieved November 28, 2025.

[3]

[4]

[1]

[2]

[5]

[6]

[7]

[8]

[9]

@@ Line 24: / Line 24: @@
 ==Design and features==
 Horovod implements synchronous data-parallel training, in which each worker process maintains a replica of the model and computes gradients on different mini-batches of data.<ref name=”UberIntro2017″ /> The gradients are aggregated across workers using the ring-allreduce communication pattern rather than a central parameter server, which reduces communication bottlenecks and can improve scaling on multi-GPU clusters.<ref name=”UberIntro2017″ />
 ==See also==

Latest revision as of 03:43, 28 November 2025

Design and features

Must Read

Leave a Comment Cancel Reply

Leave a Comment