Performance
Distributed simulation
When a circuit is too large for one machine, QXel partitions the statevector across multiple nodes with MPI. Each rank holds a slice of the statevector and exchanges amplitudes with the others as gates are applied. You do not change your circuit code; you launch the same script under mpirun.
Two parallelism modes
• SPPN (single process per node): one MPI process per node manages all of that node's GPUs. Simplest to launch.
• SPPD (single process per device): one MPI process per GPU. More processes, often better scaling on multi-GPU nodes.
Launching
Run the same Python script under mpirun. The --host argument lists each host and how many processes to start on it. Your script builds and runs the circuit exactly as on a single node; QXel detects the MPI world and distributes automatically.
SPPN across two nodes (one process each):
bash
mpirun --host node0:1,node1:1 python your_script.pySPPD across two nodes with four GPUs each (four processes per node, eight total):
bash
mpirun --bind-to none --host node0:4,node1:4 python your_script.pyNote Distributed runs use GPU kernels, so set compute_type='cuda' in your script. QXel uses OpenMPI with UCX for inter-node communication (tested with OpenMPI 5.0.7 + UCX 1.18.0). See the repository's Development guide for building UCX and MPI with CUDA support. On QXel SaaS this is handled for you, so you just request multiple nodes.