Skip to content

SIGSEGV: Segmentation fault, when training a GAP model with big train dataset #696

@MushroomYa

Description

@MushroomYa

Hi QUIP developer,

I want to use gap_fit to train a GAP mode, here is my input parameter:

#!/bin/bash
#SBATCH --time=71:59:59
#SBATCH --partition=hugemem
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=40
#SBATCH --mem=1400G
#SBATCH --job-name=GAPtrain
module restore myenv
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_STACKSIZE=10000
ulimit -s unlimited
~/software/QUIP/build/linux_x86_64_gfortran_openmp/gap_fit atoms_filename=train.xyz  e0={'C':0.0} energy_parameter_name=energy 
force_parameter_name=forces virial_parameter_name=virial sparse_jitter=1.0e-8 default_sigma={0.002 0.2 0.2 0.0} do_copy_at_file=F sparse_separate_file=T gp_f
ile=GAP.xml  gap={distance_Nb order=2 \
                 cutoff=5.0 \
                 covariance_type=ARD_SE \
                 theta_uniform=1.0 \
                 n_sparse=20 \
                 delta=1.0} \

The output is below:

Restoring modules from user's myenv
libAtoms::Hello World: 2025-12-13 23:32:33
libAtoms::Hello World: git version  https://github.com/libAtoms/QUIP.git,v0.9.14-38-ga4523cb0f-dirty
libAtoms::Hello World: QUIP_ARCH    linux_x86_64_gfortran_openmp
libAtoms::Hello World: compiled on  Dec 13 2025 at 22:54:35
libAtoms::Hello World: OpenMP parallelisation with 40 threads
libAtoms::Hello World: OMP_STACKSIZE=10000
libAtoms::Hello World: Random Seed = 84753448
libAtoms::Hello World: global verbosity = 0

Calls to system_timer will do nothing by default


================================ Input parameters ==============================

config_file =
atoms_filename = train.xyz
at_file = //MANDATORY//
gap = "distance_Nb order=2 cutoff=5.0 covariance_type=ARD_SE theta_uniform=1.0 n_sparse=20 delta=1.0"
e0 = C:0.0
local_property0 = 0.0
e0_offset = 0.0
e0_method = isolated
default_kernel_regularisation = //MANDATORY//
default_sigma = "0.002 0.2 0.2 0.0"
default_kernel_regularisation_local_property = 0.001
default_local_property_sigma = 0.001
sparse_jitter = 1.0e-8
hessian_displacement = 1.0e-2
hessian_delta = 1.0e-2
baseline_param_filename = quip_params.xml
core_param_file = quip_params.xml
baseline_ip_args =
core_ip_args =
energy_parameter_name = energy
local_property_parameter_name = local_property
force_parameter_name = forces
virial_parameter_name = virial
stress_parameter_name = stress
hessian_parameter_name = hessian
config_type_parameter_name = config_type
kernel_regularisation_parameter_name = sigma
sigma_parameter_name = sigma
force_mask_parameter_name = force_mask
local_property_mask_parameter_name = local_property_mask
parameter_name_prefix =
config_type_kernel_regularisation =
config_type_sigma =
kernel_regularisation_is_per_atom = T
sigma_per_atom = T
do_copy_atoms_file = T
do_copy_at_file = F
sparse_separate_file = T
sparse_use_actual_gpcov = F
gap_file = gap_new.xml
gp_file = GAP.xml
verbosity = NORMAL
rnd_seed = -1
openmp_chunk_size = 0
do_ip_timing = F
template_file = template.xyz
sparsify_only_no_fit = F
dryrun = F
condition_number_norm =
linear_system_dump_file =
mpi_blocksize_rows = 0
mpi_blocksize_cols = 100
mpi_print_all = F
export_covariance = F

========================================  ======================================


============== Gaussian Approximation Potentials - Database fitting ============


Initial parsing of command line arguments finished.
Found 1 GAPs.
Descriptors have been parsed
XYZ file read
Old GAP: {distance_Nb order=2 cutoff=5.0 covariance_type=ARD_SE theta_uniform=1.0 n_sparse=20 delta=1.0}
New GAP: {distance_Nb order=2 cutoff=5.0 covariance_type=ARD_SE theta_uniform=1.0 n_sparse=20 delta=1.0 Z={6 6 }}
Multispecies support added where requested

===================== Report on number of descriptors found ====================

---------------------------------------------------------------------
Descriptor 1: distance_Nb order=2 cutoff=5.0 covariance_type=ARD_SE theta_uniform=1.0 n_sparse=20 delta=1.0 Z={6 6 }
Number of descriptors:                        77623148
Number of partial derivatives of descriptors: 1397216664

========================================  ======================================


========================= Memory Estimate (per process) ========================

Descriptors
Descriptor 1 :: x 1 77623148 memory 620 MB
Descriptor 1 :: xPrime 1 1397216664 memory 11 GB
Subtotal 11 GB

Covariances
yY 20 2537171 memory 405 MB * 2
yy 20 20 memory 3200  B
A 20 2537191 memory 405 MB * 2
Subtotal 1623 MB

Peak1 12 GB
Peak2 1623 MB
PEAK  12 GB

Free system memory  1612 GB
Total system memory 1622 GB

========================================  ======================================


========== Report on number of target properties found in training XYZ: ========

Number of target energies (property name: energy) found: 2033
Number of target local_properties (property name: local_property) found: 0
Number of target forces (property name: forces) found: 2522940
Number of target virials (property name: virial) found: 12198
Number of target Hessian eigenvalues (property name: hessian) found: 0

================================= End of report ================================

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Backtrace for this error:

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x15552e315ba0 in ???
#0  0x15552e315ba0 in ???
#1  0x15552e314dd5 in ???
#2  0x15552d79d5af in ???
#3  0x716532 in ???
#4  0x15552f053915 in ???
#5  0x15552db2e1c9 in ???
#6  0x15552d7888d2 in ???
#7  0xffffffffffffffff in ???
/var/spool/slurmd/job30932840/slurm_script: line 24: 3005521 Segmentation fault      /software/QUIP/build/linux_x86_64_gfortran_openmp/gap_fit atoms_filename=train.xyz e0={'C':0.0} energy_parameter_name=energy force_parameter_name=forces virial_parameter_name=virial sparse_jitter=1.0e-8 default_sigma={0.002 0.2 0.2 0.0} do_copy_at_file=F sparse_separate_file=T gp_file=GAP.xml gap={distance_Nb order=2 cutoff=5.0 covariance_type=ARD_SE theta_uniform=1.0 n_sparse=20 delta=1.0}

Could you help me to solve this problem? Thank you so much. If you need any further information, please let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions