# CUDA-accelerated Grimme's D3

## Overview
We support a GPU-accelerated implementation of Grimme's D3 dispersion (van der Waals) correction using CUDA, which can be used with ASE and LAMMPS in conjunction with SevenNet. This follows the implementation of [Grimme's D3 method](https://doi.org/10.1063/1.3382344). We have ported the code from the [original fortran code](https://www.chemie.uni-bonn.de/grimme/de/software/dft-d3). While D3 method is significantly faster than DFT, existing CPU implementations were slower than SevenNet. To address this, we have adopted CUDA and single precision (FP32) operations to accelerate the code.

:::{caution}
Currently, the D3 implementation does not support mulit-GPU or multi-core parallelism.
:::
:::{caution}
The implementation requires a GPU with a [compute capability](https://developer.nvidia.com/cuda/gpus) of **at least 6.0**.
The target compute capability follows the setting of LibTorch, except for version 5.0.
:::


## Usage
### ASE

The SevenNet package provides ASE calculators: `D3Calculator()` and `SevenNetD3Calculator()`. See {doc}`./ase_calculator` for their usages.

### Install and Usage of GPU-D3 in LAMMPS

To use the LAMMPS `d3` pair style, you need to patch the `CMakeList.txt` and the pair-style source code in the LAMMPS source. Detailed instructions for this procedure and its usage can be found in {doc}`lammps_torch` or {doc}`lammps_mliap`.

## Input parameters for D3 dispersion

This section explains the required input parameters for the `d3` pair style itself. Note that, in practice, the D3 dispersion term is almost always used together with other classical or machine-learning interatomic potentials via the `pair/hybrid` command.

The `d3` pair style uses the following syntax:
```
pair_style d3  {sq_r_cut} {sq_r_cn_cut} {type_of_damping} {name_of_functional}
pair_coeff * * {space_separated_chemical_species}
```

### Cutoff radii
`sq_r_cut` and `sq_r_cn_cut` are **square of cutoff radii** for energy/force calculation and coordination number, respectively. Units are squared Bohr, where 1 Bohr = 0.52917721 Å.
Default values are `9000` and `1600` (Bohr²), which correspond to `50.2022` `21.1671`, respectively. This is also the default values used in VASP.[^1]


### Damping type
Available `type_of_damping` are as follows:
- `damp_zero`: Zero damping
- `damp_bj`: Becke-Johnson (BJ) damping
- `damp_zerom`: Modified version of zero damping
- `damp_bjm`: Modified version of BJ damping

### Available XC functionals
Available `name_of_functional` options include all functionals as in the original Fortran code. SevenNet-0 is trained on the 'PBE' functional, so you should specify 'pbe' in the script when using it. For other supporting functionals, check 'List of parametrized functionals' in [here](https://www.chemie.uni-bonn.de/grimme/de/software/dft-d3). Also, we are actively collecting and updating the D3 parameters for other functionals, such as `r2scan` and `wb97m`.

## Cautions
- Selective(or no) periodic boundary condition: implemented, But only PBC/noPBC can be checked through original FORTRAN code; selective PBC cannot
- 3-body term, n > 8 term: not implemented (as to VASP)
- It can be slower than the CPU with a small number of atoms.
- The maximum number of atoms that can be calculated is 46,340 (overflow issue).
- There can be occurred small amounts of numerical error
  - The introduction of some FP32 operations can lead to minor numerical errors, particularly in pressure calculations, but these are generally smaller than those seen with SevenNet.
  - If the error is too large, ensure that the `fmad=false` option in `patch_lammps.sh` is correctly applied during build.

## Contributors
- Hyungmin An: Ported the original Fortran D3 code to C++ with OpenMP and MPI.
- Gijin Kim: Accelerated the C++ D3 code with OpenACC[^2] and CUDA, and currently maintains it.

[^1]: On the [VASP DFT-D3](https://www.vasp.at/wiki/index.php/DFT-D3) page, the `VDW_RADIUS` and `VDW_CNRADIUS` are `50.2` and `20.0`, respectively (units are Å). However, when running VASP 6.3.2, the default values in the OUTCAR file are `50.2022` and `21.1671`. These values are the same as our defaults.
[^2]: Since OpenACC is not compatible with libtorch, we chose to use the CUDA.