Performance

Distributed simulation

When a circuit is too large for one machine, QXel partitions the statevector across multiple nodes with MPI. Each rank holds a slice of the statevector and exchanges amplitudes with the others as gates are applied. You do not change your circuit code; you launch the same script under mpirun.

Two parallelism modes

• SPPN (single process per node): one MPI process per node manages all of that node's GPUs. Simplest to launch. • SPPD (single process per device): one MPI process per GPU. More processes, often better scaling on multi-GPU nodes.

Launching

Run the same Python script under mpirun. The --host argument lists each host and how many processes to start on it. Your script builds and runs the circuit exactly as on a single node; QXel detects the MPI world and distributes automatically.

SPPN across two nodes (one process each):

bash
mpirun --host node0:1,node1:1 python your_script.py

SPPD across two nodes with four GPUs each (four processes per node, eight total):

bash
mpirun --bind-to none --host node0:4,node1:4 python your_script.py
Note Distributed runs use GPU kernels, so set compute_type='cuda' in your script. QXel uses OpenMPI with UCX for inter-node communication (tested with OpenMPI 5.0.7 + UCX 1.18.0). See the repository's Development guide for building UCX and MPI with CUDA support. On QXel SaaS this is handled for you, so you just request multiple nodes.