Torch Distributed Multi-GPU FWI¶

Source file:

examples/multi-gpu/torch/fwi_marmousi_dist.py

This script is a data-parallel version of the 2D acoustic Marmousi Torch FWI example. It uses torch.distributed with one process per GPU.

Each rank keeps its own copy of the velocity model on one GPU. At each iteration:

rank 0 samples a global shot batch and broadcasts the shot indices
each rank receives a subset of those shots
each rank computes its local synthetic data and local loss contribution
model gradients are summed with torch.distributed.all_reduce
every rank applies the same optimizer step

The model is a plain inversion tensor, so the script does not wrap a module in DistributedDataParallel. The synchronized state is the gradient, loss summary, shot indices, and optional illumination panels used for visualization.

How the Split Works¶

torchrun starts one Python process per GPU. Each process receives environment variables such as RANK, WORLD_SIZE, and LOCAL_RANK. The example maps LOCAL_RANK to a CUDA device and initializes NCCL:

local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
dist.init_process_group(backend="nccl")
device = torch.device(f"cuda:{local_rank}")

The global inversion model is replicated on every rank:

inv_vp = torch.from_numpy(init_model).to(device).requires_grad_(True)
optimizer = torch.optim.Adam([inv_vp], lr=cfg["lr"], eps=1e-22)

At each epoch, rank 0 chooses the global shot batch and broadcasts the selected shot indices:

if rank == 0:
    shot_idx = sample_global_batch(...)
else:
    shot_idx = torch.empty(batchsize, dtype=torch.int64, device=device)
dist.broadcast(shot_idx, src=0)

Then the selected shot indices are split by rank. For example, with WORLD_SIZE=4 and batchsize=8, each rank receives two shots:

local_idx = torch.tensor_split(shot_idx, world_size)[rank]
syn = solver(wave, sources[local_idx], receivers[local_idx], models=[inv_vp])

Each rank computes only its local residual and backpropagates a loss normalized by the global selected-shot batch size. After local backward, the model gradient is summed across GPUs:

local_loss.backward()
dist.all_reduce(inv_vp.grad, op=dist.ReduceOp.SUM)
optimizer.step()

Because every rank starts from the same inv_vp, receives the same reduced gradient, and uses the same optimizer settings, all ranks keep identical model copies after each optimizer step.

Run¶

Use torchrun from the repository root. For two GPUs:

torchrun --standalone --nproc_per_node=2 \
  examples/multi-gpu/torch/fwi_marmousi_dist.py --backend cuda

The script also accepts --backend eager, which still uses one distributed process per CUDA device but runs each local propagation through the eager PropTorch backend.

Outputs¶

Only rank 0 writes figures. Outputs are saved under:

examples/multi-gpu/torch/multi_gpu_acoustic_fwi_cuda/
examples/multi-gpu/torch/multi_gpu_acoustic_fwi_eager/

Saved files include:

ricker.png
observed_data.png
loss.png
epoch_XXXX.png

The progress panel includes the true model, inverted model, gradient, source illumination, and receiver illumination. Illumination is shown only for inspection; it is not used by the optimizer.

Notes¶

Launch with one process per GPU.
The default CUDA path uses boundary saving through PropTorch(..., backend="cuda").
Rank 0 generates observed data once, then broadcasts it to the other ranks.
The loss is normalized by the global selected-shot batch, so ranks with different local shot counts still contribute the correct gradient scale.
If GPU memory is tight, reduce --batchsize or use fewer receivers in the shared Marmousi configuration.