Parallel¶
-
class
ignite.distributed.launcher.
Parallel
(backend=None, nproc_per_node=None, nnodes=None, node_rank=None, master_addr=None, master_port=None, **spawn_kwargs)[source]¶ Distributed launcher context manager to simplify distributed configuration setup for multiple backends:
backends from native torch distributed configuration: “nccl”, “gloo”, “mpi” (if available)
XLA on TPUs via pytorch/xla (if installed)
using Horovod distributed framework (if installed)
Namely, it can 1) spawn
nproc_per_node
child processes and initialize a processing group according to providedbackend
(useful for standalone scripts) or 2) only initialize a processing group given thebackend
(useful with tools like torch.distributed.launch, horovodrun, etc).Examples
1) Single node or Multi-node, Multi-GPU training launched with torch.distributed.launch or horovodrun tools
Single node option with 4 GPUs
python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py # or if installed horovod horovodrun -np=4 python main.py
Multi-node option : 2 nodes with 8 GPUs each
## node 0 python -m torch.distributed.launch --nnodes=2 --node_rank=0 --master_addr=master --master_port=3344 --nproc_per_node=8 --use_env main.py # or if installed horovod horovodrun -np 16 -H hostname1:8,hostname2:8 python main.py ## node 1 python -m torch.distributed.launch --nnodes=2 --node_rank=1 --master_addr=master --master_port=3344 --nproc_per_node=8 --use_env main.py
User code is the same for both options:
# main.py import ignite.distributed as idist def training(local_rank, config, **kwargs): # ... print(idist.get_rank(), ": run with config:", config, "- backend=", idist.backend()) # ... backend = "nccl" # or "horovod" if package is installed with idist.Parallel(backend=backend) as parallel: parallel.run(training, config, a=1, b=2)
Single node, Multi-GPU training launched with python
python main.py
# main.py import ignite.distributed as idist def training(local_rank, config, **kwargs): # ... print(idist.get_rank(), ": run with config:", config, "- backend=", idist.backend()) # ... backend = "nccl" # or "horovod" if package is installed with idist.Parallel(backend=backend, nproc_per_node=4) as parallel: parallel.run(training, config, a=1, b=2)
Single node, Multi-TPU training launched with python
python main.py
# main.py import ignite.distributed as idist def training(local_rank, config, **kwargs): # ... print(idist.get_rank(), ": run with config:", config, "- backend=", idist.backend()) # ... with idist.Parallel(backend="xla-tpu", nproc_per_node=8) as parallel: parallel.run(training, config, a=1, b=2)
Multi-node, Multi-GPU training launched with python. For example, 2 nodes with 8 GPUs:
Using torch native distributed framework:
# node 0 python main.py --node_rank=0 # node 1 python main.py --node_rank=1
# main.py import ignite.distributed as idist def training(local_rank, config, **kwargs): # ... print(idist.get_rank(), ": run with config:", config, "- backend=", idist.backend()) # ... dist_config = { "nproc_per_node": 8, "nnodes": 2, "node_rank": args.node_rank, "master_addr": "master", "master_port": 15000 } with idist.Parallel(backend="nccl", **dist_config) as parallel: parallel.run(training, config, a=1, b=2)
- Parameters
backend (Optional[str]) – backend to use: nccl, gloo, xla-tpu, horovod. If None, no distributed configuration.
nproc_per_node (Optional[int]) – optional argument, number of processes per node to specify. If not None,
run()
will spawnnproc_per_node
processes that run input function with its arguments.nnodes (Optional[int]) – optional argument, number of nodes participating in distributed configuration. If not None,
run()
will spawnnproc_per_node
processes that run input function with its arguments. Total world size is nproc_per_node * nnodes. This option is only supported by native torch distributed module. For other modules, please setupspawn_kwargs
with backend specific arguments.node_rank (Optional[int]) – optional argument, current machine index. Mandatory argument if
nnodes
is specified and larger than one. This option is only supported by native torch distributed module. For other modules, please setupspawn_kwargs
with backend specific arguments.master_addr (Optional[str]) – optional argument, master node TCP/IP address for torch native backends (nccl, gloo). Mandatory argument if
nnodes
is specified and larger than one.master_port (Optional[int]) – optional argument, master node port for torch native backends (nccl, gloo). Mandatory argument if
master_addr
is specified.spawn_kwargs (Any) – kwargs to
idist.spawn
function.
- Return type
Changed in version 0.4.2:
backend
now accepts horovod distributed framework.Methods
Execute
func
with provided arguments in distributed context.-
run
(func, *args, **kwargs)[source]¶ Execute
func
with provided arguments in distributed context.Example
def training(local_rank, config, **kwargs): # ... print(idist.get_rank(), ": run with config:", config, "- backend=", idist.backend()) # ... with idist.Parallel(backend=backend) as parallel: parallel.run(training, config, a=1, b=2)
- Parameters
func (Callable) – function to execute. First argument of the function should be local_rank - local process index.
args (Any) – positional arguments of
func
(without local_rank).kwargs (Any) – keyword arguments of
func
.
- Return type