fit <- didgpu(panel, ..., backend = "auto")9 GPU & performance
10 GPU & performance
The CUDA backend isn’t a gimmick. On a typical event-study panel with a few hundred units, twelve periods, and 500 cluster-bootstrap reps, the speedups look roughly like:
| Estimator | CPU (r) | CUDA | Speedup |
|---|---|---|---|
didgpu() (dynamic DID) |
~95 s | ~6 s | ~16× |
didgpu_cs() (CS-DID, DR + cluster boot) |
~210 s | ~9 s | ~23× |
didgpu_fect() (IFEct + boot) |
~430 s | ~14 s | ~31× |
(Indicative numbers on an RTX PRO Blackwell with CUDA 13.2 and 60k-cell panels. Exact numbers depend on panel shape and bootstrap rep count. See tools/bench-*.R for the reproducible benchmark scripts.)
10.1 When the GPU pays for itself
The fixed cost of a GPU call is non-trivial (~10 ms per kernel launch), so the breakeven point is around a few thousand panel cells or ~50 bootstrap reps, whichever comes first. Below that, the R backend is fine and sometimes faster. Above it, the GPU wins by an order of magnitude.
10.2 Setup
Detailed CUDA setup instructions (Windows + Linux, including the “DLL relocation” workaround for CUDA 13) live in inst/doc/cuda_setup.md. The short version:
- Install the NVIDIA CUDA Toolkit (≥ 12.0).
- On Linux: nothing else.
- On Windows: install Rtools45 + the MSVC Build Tools (nvcc on Windows invokes cl.exe internally), then either set
CUDA_PATHor rely on the defaultProgram Filesinstall location. - Reinstall didgpu from source. Verify with
didgpu::didgpu_has_cuda_support().
10.3 Picking a backend
backend = "auto" does the right thing:
- If CUDA is compiled in and the panel is large enough for the GPU to win, it uses CUDA;
- otherwise it falls back to the Rcpp backend (
cpu); - and if that path doesn’t support a feature you asked for (controls, continuous treatment, normalized estimator), it falls back further to the pure-R backend.
You can force a backend explicitly with backend = "r", "cpu", or "cuda". The output of any backend is bit-for-bit identical on the deterministic point estimate; only the bootstrap RNG stream differs between the GPU and CPU bootstraps (the SEs converge as B → ∞).
10.4 Bookkeeping: which Blackwell card?
If you have a current-generation Blackwell card (RTX PRO Blackwell, RTX 50-series), make sure you’re on didgpu ≥ 0.1.1 — earlier versions silently returned all-zero CUDA results on Blackwell because the kernels were compiled only for sm_75–sm_90. The fix added sm_120 and a compute_120 PTX fallback; this is in every release from 0.1.1 onward.
A full benchmark chapter with reproducible bench scripts and per-method profiling output is in progress.