Shortcuts

Parallel

class ignite.distributed.launcher.Parallel(backend=None, nproc_per_node=None, nnodes=None, node_rank=None, master_addr=None, master_port=None, **spawn_kwargs)[source]

Distributed launcher context manager to simplify distributed configuration setup for multiple backends:

Namely, it can 1) spawn nproc_per_node child processes and initialize a processing group according to provided backend (useful for standalone scripts) or 2) only initialize a processing group given the backend (useful with tools like torch.distributed.launch, horovodrun, etc).

Examples

1) Single node or Multi-node, Multi-GPU training launched with torch.distributed.launch or horovodrun tools

Single node option with 4 GPUs

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py
# or if installed horovod
horovodrun -np=4 python main.py

Multi-node option : 2 nodes with 8 GPUs each

## node 0
python -m torch.distributed.launch --nnodes=2 --node_rank=0 --master_addr=master                 --master_port=3344 --nproc_per_node=8 --use_env main.py

# or if installed horovod
horovodrun -np 16 -H hostname1:8,hostname2:8 python main.py

## node 1
python -m torch.distributed.launch --nnodes=2 --node_rank=1 --master_addr=master                 --master_port=3344 --nproc_per_node=8 --use_env main.py

User code is the same for both options:

# main.py

import ignite.distributed as idist

def training(local_rank, config, **kwargs):
    # ...
    print(idist.get_rank(), ": run with config:", config, "- backend=", idist.backend())
    # ...

backend = "nccl"  # or "horovod" if package is installed

with idist.Parallel(backend=backend) as parallel:
    parallel.run(training, config, a=1, b=2)
  1. Single node, Multi-GPU training launched with python

python main.py
# main.py

import ignite.distributed as idist

def training(local_rank, config, **kwargs):
    # ...
    print(idist.get_rank(), ": run with config:", config, "- backend=", idist.backend())
    # ...

backend = "nccl"  # or "horovod" if package is installed

with idist.Parallel(backend=backend, nproc_per_node=4) as parallel:
    parallel.run(training, config, a=1, b=2)
  1. Single node, Multi-TPU training launched with python

python main.py
# main.py

import ignite.distributed as idist

def training(local_rank, config, **kwargs):
    # ...
    print(idist.get_rank(), ": run with config:", config, "- backend=", idist.backend())
    # ...

with idist.Parallel(backend="xla-tpu", nproc_per_node=8) as parallel:
    parallel.run(training, config, a=1, b=2)
  1. Multi-node, Multi-GPU training launched with python. For example, 2 nodes with 8 GPUs:

Using torch native distributed framework:

# node 0
python main.py --node_rank=0

# node 1
python main.py --node_rank=1
# main.py

import ignite.distributed as idist

def training(local_rank, config, **kwargs):
    # ...
    print(idist.get_rank(), ": run with config:", config, "- backend=", idist.backend())
    # ...

dist_config = {
    "nproc_per_node": 8,
    "nnodes": 2,
    "node_rank": args.node_rank,
    "master_addr": "master",
    "master_port": 15000
}

with idist.Parallel(backend="nccl", **dist_config) as parallel:
    parallel.run(training, config, a=1, b=2)
Parameters
  • backend (Optional[str]) – backend to use: nccl, gloo, xla-tpu, horovod. If None, no distributed configuration.

  • nproc_per_node (Optional[int]) – optional argument, number of processes per node to specify. If not None, run() will spawn nproc_per_node processes that run input function with its arguments.

  • nnodes (Optional[int]) – optional argument, number of nodes participating in distributed configuration. If not None, run() will spawn nproc_per_node processes that run input function with its arguments. Total world size is nproc_per_node * nnodes. This option is only supported by native torch distributed module. For other modules, please setup spawn_kwargs with backend specific arguments.

  • node_rank (Optional[int]) – optional argument, current machine index. Mandatory argument if nnodes is specified and larger than one. This option is only supported by native torch distributed module. For other modules, please setup spawn_kwargs with backend specific arguments.

  • master_addr (Optional[str]) – optional argument, master node TCP/IP address for torch native backends (nccl, gloo). Mandatory argument if nnodes is specified and larger than one.

  • master_port (Optional[int]) – optional argument, master node port for torch native backends (nccl, gloo). Mandatory argument if master_addr is specified.

  • spawn_kwargs (Any) – kwargs to idist.spawn function.

Return type

None

Changed in version 0.4.2: backend now accepts horovod distributed framework.

Methods

run

Execute func with provided arguments in distributed context.

run(func, *args, **kwargs)[source]

Execute func with provided arguments in distributed context.

Example

def training(local_rank, config, **kwargs):
    # ...
    print(idist.get_rank(), ": run with config:", config, "- backend=", idist.backend())
    # ...

with idist.Parallel(backend=backend) as parallel:
    parallel.run(training, config, a=1, b=2)
Parameters
  • func (Callable) – function to execute. First argument of the function should be local_rank - local process index.

  • args (Any) – positional arguments of func (without local_rank).

  • kwargs (Any) – keyword arguments of func.

Return type

None