Horovod (machine learning): Difference between revisions

From Wikipedia, the free encyclopedia

Content deleted Content added


 

Line 24: Line 24:

==Design and features==

==Design and features==

Horovod implements synchronous data-parallel training, in which each worker process maintains a replica of the model and computes gradients on different mini-batches of data.<ref name=”UberIntro2017″ /> The gradients are aggregated across workers using the ring-allreduce communication pattern rather than a central parameter server, which reduces communication bottlenecks and can improve scaling on multi-GPU clusters.<ref name=”UberIntro2017″ />

Horovod implements synchronous data-parallel training, in which each worker process maintains a replica of the model and computes gradients on different mini-batches of data.<ref name=”UberIntro2017″ /> The gradients are aggregated across workers using the ring-allreduce communication pattern rather than a central parameter server, which reduces communication bottlenecks and can improve scaling on multi-GPU clusters.<ref name=”UberIntro2017″ />

==See also==

==See also==


Latest revision as of 03:43, 28 November 2025

Horovod is a free and open-source software framework for distributed deep learning training using TensorFlow, Keras, PyTorch, and Apache MXNet. Horovod is hosted under the Linux Foundation AI (LF AI).[3] Horovod has the goal of improving the speed, scale, and resource allocation when training a machine learning model.[4]

Horovod was created at Uber as part of the company’s internal machine learning platform Michelangelo to simplify scaling TensorFlow models across many GPUs.[1] The first public release of the library, version 0.9.0, was tagged on GitHub in August 2017 under the Apache 2.0 licence.[2] In October 2017, Uber Engineering publicly introduced Horovod as an open-source component of its deep learning toolkit.[1]

In February 2018 Alexander Sergeev and Mike Del Balso published a technical paper describing Horovod’s design and benchmarking its performance on up to 512 GPUs, showing near-linear scaling for several image-classification models when compared with single-GPU baselines.[1]

In December 2018 Uber contributed Horovod to the LF Deep Learning Foundation (later LF AI & Data), making it a Linux Foundation project.[5][6][7] Horovod entered incubation under LF AI & Data and graduated as a full foundation project in 2020.[8]

Design and features

[edit]

Horovod implements synchronous data-parallel training, in which each worker process maintains a replica of the model and computes gradients on different mini-batches of data.[1] The gradients are aggregated across workers using the ring-allreduce communication pattern rather than a central parameter server, which reduces communication bottlenecks and can improve scaling on multi-GPU clusters.[1] Communication is built on top of collective-communication libraries such as MPI, NCCL, Gloo and Intel oneCCL, and supports both GPU and CPU training.[9]

  1. ^ a b c d e f Sergeev, Alexander (October 17, 2017). “Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow”. Uber Engineering Blog. Retrieved November 28, 2025.
  2. ^ a b “Releases · horovod/horovod”. horovod. Retrieved July 11, 2023.
  3. ^ Khari Johnson (December 13, 2018). “Uber brings Horovod project for distributed deep learning to Linux Foundation”. VentureBeat. Retrieved July 9, 2020.
  4. ^ “Projects – LF AI”. Linux Foundation – LF AI. Retrieved July 9, 2020. Horovod, a distributed training framework for TensorFlow, Keras and PyTorch, improves speed, scale and resource allocation in machine learning training activities. Uber uses Horovod for self-driving vehicles, fraud detection, and trip forecasting. It is also being used by Alibaba, Amazon and NVIDIA.
  5. ^ “Projects – LF AI & Data”. LF AI & Data Foundation. Retrieved November 28, 2025.
  6. ^ Johnson, Khari (December 13, 2018). “Uber brings Horovod project for distributed deep learning to Linux Foundation”. VentureBeat. Retrieved November 28, 2025.
  7. ^ “Horovod: an open-source distributed training framework by Uber for TensorFlow, Keras, PyTorch, and MXNet”. Packt. April 9, 2019. Retrieved November 28, 2025.
  8. ^ “LF AI Foundation Announces Graduation of Horovod Project”. LF AI & Data Foundation. September 9, 2020. Retrieved November 28, 2025.
  9. ^ “Using Horovod for Distributed Training”. NASA Advanced Supercomputing Division. October 6, 2022. Retrieved November 28, 2025.

Leave a Comment

Your email address will not be published. Required fields are marked *

Exit mobile version