VASP stops MLFF training without apparent reason

Message

eduardoarielmenendezproupin · #1 Post by **eduardoarielmenendezproupin** » Thu Aug 08, 2024 12:55 pm

Hi,
I have VASP stopping a training run of molecular dynamics with MLFF at a small number of steps without apparent reason -at least to me. I set NSW=6000, but VASP stops at step 17, and I see no indication why in the output files OUTCAR and ML_LOGFILE
Here are the input and output files (with POTCAR deleted)
Run2b
https://uses0-my.sharepoint.com/:f:/g/p ... g?e=SL8NB9

I had made a previous run that stopped at the step 133, with the same input files, the only difference was the INCAR file with a RANDOM_SEED

Run2
https://uses0-my.sharepoint.com/:f:/g/p ... A?e=UOcpgX

Both runs are a continuation of a previous run that finished due to time limit after more than 100 steps. I just copied the ML_ABN as ML_AB, and CONTCAR as POSCAR in Run2/ and Run2b/. This initial run is here
Run1-22jul
https://uses0-my.sharepoint.com/:f:/g/p ... g?e=GCgldS

Thank you

Eduardo Menendez-Proupin
University of Seville

#2 Post by **ferenc_karsai** » Fri Aug 16, 2024 3:03 pm

I could not run your job on a one of our machines with 512 GByte because I ran out of memory. So I estimated that with your current parameters the design matrix alone needs around 950 GByte memory if fully filled alone. So with the rest of arrays needed for machine learning and the VASP ab-initio arrays the job will be clearly above 1 TByte of total required memory.
The operation system usually does lazy allocation that means, when allocating arrays the system logs in that memory will be required but only accesses it when it is actually filled. So this way actually running out of memory is delayed to the point where the matrices are filled and not at the beginning of the allocation. The symptoms of running out of memory can be different, I have already seen that the code crashes but I have also seen that it gets stuck during the allocation of some helping arrays. This could be the case in your example, but it's just a speculation from me.
Be careful I just realized that the memory estimation at the beginning of the ML_LOGFILE is broken for ML_MODE=TRAIN starting from and ML_AB file (continuation run) and it will be fixed in the next version!
Depending on how much memory you have you should set ML_MCONF and ML_MB to smaller values. If you don't specify anything ML_MCONF is set to the number of training structures from the ML_AB file plus MIN(NSW,1500). Please have a look at this vasp wiki page:
wiki/index.php/Best_practices_for_machi ... lculations.
You have a system with 7 atom types with 760 atoms in the unit cell. Both are considered huge for the on-the-fly MLFF and DFT. About the atom types you most likely can't do much, but possibly try to reduce the number of atoms in the systems.

Another thing is that we saw that you have a lot of Hydrogen atoms in the system. Although you set POTIM=0.5 they maybe still move to fast in the system, so please try to increase their mass by a factor 4-8 (wiki/index.php/POMASS).

My Community

VASP stops MLFF training without apparent reason

VASP stops MLFF training without apparent reason

Re: VASP stops MLFF training without apparent reason