Page 1 of 1

Memory issues for larger systems

Posted: Thu Nov 01, 2012 2:58 am
by bhaskarchilukuri
Hello,

I run my VASP jobs on an AMD opteron having 4 processors with 16 cores on each (Total of 4x16 = 64 cores). It has a total memory of 64 GB.

Most calculations run fine on this node. But recently I had to run a larger system of gold surface with nearly 400 Au atoms. The job stated crashing after certain SCF steps before completing the first optimization cycle.

I figured out that the job runs out of memory after some time as I see that in the UNIX log file.

I ran the same job different times on parallel by using 25, 36, 28 cores respectively. I monitored how the code is using the memory as it ran the job in each different case.

What I have noticed is that each core is using roughly around 1GB of ram only irrespective of how many cores you pick and how much free memory you have on the node.
For example,
If I run the job with 25 cores, it uses nearly 24-26 GB.
If I run the job with 36 cores, it uses nearly 35-37 GB.
If I run the job with 48 cores, it uses nearly 46-48 GB.
.
.
.

Ultimately, my question is:
How can I get VASP to use the free memory available to finish the job instead of just using 1GB per core and crashing out after some time ?


I am attaching the VASP log file here which indicates that the job crashes due to a very generic UNIX error (SIGNAL 9) which really doesn't tell you that the job ran out of memory.
I had to look up in the UNIX log file to find that the job did ran out of memory.

running on 36 nodes
distr: one band on 6 nodes, 6 groups
vasp.5.2.12 11Nov11 complex
POSCAR found type information on POSCAR C
POSCAR found : 1 types and 512 ions
LDA part: xc-table for Ceperly-Alder, standard interpolation
POSCAR, INCAR and KPOINTS ok, starting setup
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
WARNING: small aliasing (wrap around) errors must be expected
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution resort distribution

resort distribution
resort distribution
resort distribution
FFT: planning ...( 14 )
WAVECAR not read
entering main loop
N E dE d eps ncg rms rms(c)
RMM: 1 0.317052588125E+05 0.31705E+05 -0.71846E+05 5136 0.126E+03
RMM: 2 0.100362415386E+05 -0.21669E+05 -0.24838E+05 5136 0.307E+02
RMM: 3 0.236375640794E+04 -0.76725E+04 -0.10883E+05 5136 0.245E+02
RMM: 4 -0.240864746685E+04 -0.47724E+04 -0.47982E+04 5136 0.197E+02
RMM: 5 -0.429003846826E+04 -0.18814E+04 -0.16881E+04 5136 0.126E+02
RMM: 6 -0.505042959014E+04 -0.76039E+03 -0.64604E+03 5136 0.985E+01
RMM: 7 -0.532618844136E+04 -0.27576E+03 -0.26137E+03 5136 0.606E+01
RMM: 8 -0.545307140609E+04 -0.12688E+03 -0.11234E+03 5136 0.449E+01
RMM: 9 -0.554774843934E+04 -0.94677E+02 -0.92157E+02 12242 0.271E+01
RMM: 10 -0.555209501201E+04 -0.43466E+01 -0.51133E+01 13531 0.406E+00
RMM: 11 -0.555241417275E+04 -0.31916E+00 -0.19683E+00 12445 0.114E+00
RMM: 12 -0.555244832743E+04 -0.34155E-01 -0.28424E-01 13107 0.285E-01 0.109E+02
RMM: 13 -0.527612620274E+04 0.27632E+03 -0.27108E+02 10277 0.146E+01 0.592E+01
RMM: 14 -0.518459214017E+04 0.91534E+02 -0.41872E+02 10297 0.199E+01 0.874E+00
RMM: 15 -0.518386255369E+04 0.72959E+00 -0.10694E+01 10917 0.406E+00 0.122E+00
RMM: 16 -0.518388532905E+04 -0.22775E-01 -0.13082E+00 11857 0.893E-01 0.102E+00
RMM: 17 -0.518389338419E+04 -0.80551E-02 -0.12588E-01 10309 0.388E-01 0.641E-01
RMM: 18 -0.518387677956E+04 0.16605E-01 -0.42371E-02 10363 0.160E-01 0.270E-01
RMM: 19 -0.518389993534E+04 -0.23156E-01 -0.55966E-02 10280 0.162E-01 0.217E-01
RMM: 20 -0.518390873521E+04 -0.87999E-02 -0.13270E-02 10327 0.103E-01 0.245E-01
RMM: 21 -0.518393224705E+04 -0.23512E-01 -0.24408E-02 10292 0.130E-01 0.136E-01
RMM: 22 -0.518393962678E+04 -0.73797E-02 -0.95333E-04 7987 0.406E-02 0.996E-02
RMM: 23 -0.518395248316E+04 -0.12856E-01 -0.31120E-03 10272 0.510E-02 0.346E-02
RMM: 24 -0.518395844731E+04 -0.59641E-02 -0.56841E-04 7197 0.165E-02 0.260E-02
RMM: 25 -0.518395976503E+04 -0.13177E-02 -0.88745E-05 6256 0.112E-02 0.847E-03
RMM: 26 -0.518395999067E+04 -0.22564E-03 -0.24774E-05 6170 0.628E-03 0.532E-03
RMM: 27 -0.518396020845E+04 -0.21778E-03 -0.13539E-05 5559 0.455E-03 0.270E-03
RMM: 28 -0.518396027706E+04 -0.68612E-04 -0.76265E-06 4584 0.362E-03

=====================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)



Let me know if you have any suggestions.
Thank you in advance.

Memory issues for larger systems

Posted: Thu Nov 22, 2012 4:48 pm
by huanxiong
We have also found the same error:
The vasp is died at the time "aborting loop because EDIFF is reached".
Both the vasp 5.2.12 and 5.3.2 have the problem, while vasp 5.2.2 and vasp 4.6.32 have not such problem.

Memory issues for larger systems

Posted: Mon Nov 26, 2012 8:27 am
by bill686433
I got the same problems but I don't konw how to solve it. So do you still have this problem or you solved it?

Memory issues for larger systems

Posted: Mon Nov 26, 2012 12:09 pm
by juhL
The reason for your jobs to break is the allocation of huge errors after the first geometry step (NSTEP==1). This is done to call the PEAD routine in order to calculate the response to electric fields. However, in most cases you don't need this, i.e., you allocate those arrays, go into the subroutine, do nothing, and deallocate the arrays. So as long as you don't need this, simply patch this section (in the main.F) in such a way, that this call is only executed if the respective flags are set in the INCAR file, but not generally after (NSTEP==1).

This of course does not solve all memory issues with vasp, but for this particular problem:

Calculation runs with vasp 4.X, breaks after first iteration cycle with vasp 5.X,

this does the trick, since it is the initialization of the large arrays that kills your calculations.

Memory issues for larger systems

Posted: Thu Nov 29, 2012 2:45 pm
by huanxiong
Thanks the suggestions of juhL
We find that the problem can be temporarily solved by modify the following lines (about line 2636) in the main.F :
IF (NSTEP==1) THEN
! IF (.NOT.INFO%LONESW) THEN
! CALL ALLOCW(WDES,W_F,WTMP,WTMP)
! CALL ALLOCW(WDES,W_G,WTMP,WTMP)
! DEALLOCATE(CHAM, CHF)
! ALLOCATE(CHAM(WDES%NB_TOT,WDES%NB_TOT,WDES%NKPTS,WDES%ISPIN), &
! CHF (WDES%NB_TOT,WDES%NB_TOT,WDES%NKPTS,WDES%ISPIN))
!! ! setup for scf subspace rotation
!! INFO%LONESW=.TRUE.
!! CALL SETUP_SUBROT_SCF(INFO,WDES,LATT_CUR,GRID,GRIDC,GRID_SOFT,SOFT_TO_C,IO%IU0,IO%IU5,IO%IU6)
!! INFO%LONESW=.FALSE.
! ENDIF
CALL PEAD_ELMIN( &
HAMILTONIAN,KINEDEN, &
P,WDES,NONLR_S,NONL_S,W,W_F,W_G,LATT_CUR,LATT_INI, &
T_INFO,DYN,INFO,IO,MIX,KPOINTS,SYMM,GRID,GRID_SOFT, &
GRIDC,GRIDB,GRIDUS,C_TO_US,B_TO_C,SOFT_TO_C,E, &
CHTOT,CHTOTL,DENCOR,CVTOT,CSTRF, &
CDIJ,CQIJ,CRHODE,N_MIX_PAW,RHOLM,RHOLM_LAST, &
CHDEN,SV,DOS,DOSI,CHF,CHAM,ECONV, &
NSTEP,LMDIM,IRDMAX,NEDOS, &
TOTEN,EFERMI,LDIMP,LMDIMP)
! IF (.NOT.INFO%LONESW) THEN
! CALL DEALLOCW(W_F)
! CALL DEALLOCW(W_G)
! DEALLOCATE(CHAM, CHF)
! ALLOCATE(CHAM(1,1,1,1),CHF (1,1,1,1))
! ENDIF
ENDIF