Parallelization/MPI problem on Cray XC30/40 (hangs)

Problems running VASP: crashes, internal errors, "wrong" results.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
ahampel
Global Moderator
Global Moderator
Posts: 2
Joined: Tue Feb 16, 2016 11:41 am

Parallelization/MPI problem on Cray XC30/40 (hangs)

#1 Post by ahampel » Wed Feb 24, 2016 4:36 pm

Hi,

The last few weeks I encountered a quite weird behaviour of vasp on a cray xc30 or xc40 machine (CSCS Daint or Dora). I was used to normal openmpi cluster (CSCS Monch) and my self compiled vasp version runs very nice there. Then I switched to Dora and I first compiled vasp 5.3.5 with intel/15.0.1.133, intel fft (cray-libsci unloaded) and cray-mpich/7.2.2 as suggested by Peter Larsson (https://www.nsc.liu.se/~pla/blog/2015/0 ... cray-xc40/). I did extensive testing and everything looked really good. So I went on to vasp 5.4.1 and did the compilation with the same modules. Tests look also really good on a single node. One important test-case for me is a relaxation with (S)GGA+U , but here I encounter a problem, when I use more than one node (number of cores does not matter): The first two relaxation steps run fine, but then everything goes bananas. Sometime VASP does not find a electronic minimum or more often it just hangs after the last iteration. The OUTCAR looks in the end like this:

----------------------------------------- Iteration 3( 14) ---------------------------------------


POTLOK: cpu time 0.0480: real time 0.0462
SETDIJ: cpu time 0.0040: real time 0.0055
EDDAV: cpu time 14.1769: real time 14.2087
DOS: cpu time 0.0240: real time 0.0214

The problem is, that it freezes and I have to watch my jobs the whole time, to see if their are still working.

I tried with several setups for compilation: older mpich version, older intel version, fft from cray-libsci, tried the -DMPI_barrier_after_bcast flag, only -O1 optimization, without avx/avx2 support and I also contacted the CSCS team. They can reproduce my problem, but have no clue what could be the problem. I'm using 5.4.1 with the latest patches. I attach an example makefile and INCAR, POSCAR files.

I also tried to change NSIM,NCORE,KPAR but still no luck. When I use more and more nodes the problem becomes even worse. Has anyone encountered a similar problem? With vasp 5.3.5 or on CSCS Monch or on ETH Euler the same calculation runs fine and it hangs always at the same step, even when numerical parameters are changed. I have also a traceback for this error, but somehow I'm not allowed to upload txt files?

# Precompiler options

CPP_OPTIONS= -DMPI -DHOST=\"CrayXC-Intel\" \
-DIFC \
-DCACHE_SIZE=32000 \
-DPGF90 \
-DscaLAPACK \
-Davoidalloc \
-DMPI_BLOCK=128000 \
-Duse_collective \
-DnoAugXCmeta \
-Duse_bse_te \
-Duse_shmem \
-Dtbdyn \
-DVASP2WANNIER90 \
-DMPI_barrier_after_bcast

CPP = fpp -f_com=no -free -w0 $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)

FC = ftn -I$(MKLROOT)/include/fftw -g -traceback -heap-arrays
FCL = ftn

FREE = -free -names lowercase

FFLAGS = -assume byterecl
OFLAG = -O1 -ip -xCORE-AVX2
#OFLAG = -O0 -g -traceback
OFLAG_IN = $(OFLAG)
DEBUG = -O0

BLAS = -mkl=cluster #sequential
LAPACK =
SCA = -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64

OBJECTS = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
INCS =

LLIBS = $(SCA) $(LAPACK) $(BLAS) \
/store/p504/ahampel/codes/dora/wannier90/1.2/wannier90-1.2/libwannier.a

OBJECTS_O2 += fftw3d.o fftmpi.o fftmpiw.o fft3dlib.o
OBJECTS_O1 += fft3dfurth.o mpi.o wave_mpi.o electron.o charge.o us.o

# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = cc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB = $(FREE)

OBJECTS_LIB= linpack_double.o getshmem.o

# Normally no need to change this
SRCDIR = ../../src
BINDIR = ../../bin


######## INCAR ###########
SYSTEM=lanio3-sggau-rel

#Startup
ICHARG=2
ISTART=0


# parameters that determine general accuracy
PREC = Accurate
ENCUT = 550
EDIFF = 1e-7

LWAVE = .FALSE.

# relaxation
EDIFFG = -0.001
ISIF = 3
NSW = 100
IBRION = 2

ISMEAR = 0
SIGMA = 0.001

NEDOS = 1001

# spin polarization
ISPIN = 2
MAGMOM = 4*0 4 4 4 4 12*0
LORBIT = 11

# parallelization & perfomance
IALGO=38
LPLANE = .TRUE.
NCORE=24
LSCALU = .FALSE.
NSIM = 2
KPAR = 2

# LSDA+U
LMAXMIX=4
LDAU = .TRUE.
LDAUTYPE= 1
LDAUL= -1 2 -1
LDAUU= 0 5 0
LDAUJ= 0 1 0

#### job output #######
Running on 8 nodes (nid00[701-707,712]), 24 tasks/node
running on 192 total cores
distrk: each k-point on 24 cores, 8 groups
distr: one band on 24 cores, 1 groups
using from now: INCAR
vasp.5.4.1 24Jun15 (build Jan 28 2016 14:53:33) complex

POSCAR found type information on POSCAR La Ni O
POSCAR found : 3 types and 20 ions

...

0.10177E+01 -0.20455E+01 94944 0.338E+01 0.138E+01
DAV: 2 -0.137266193277E+03 -0.33294E+00 -0.12622E+01119188 0.450E+01 0.173E+01
DAV: 3 -0.136069236957E+03 0.11970E+01 -0.26720E+00104420 0.259E+01 0.724E+00
DAV: 4 -0.135969111317E+03 0.10013E+00 -0.44404E-01127460 0.931E+00 0.350E+00
DAV: 5 -0.135941844668E+03 0.27267E-01 -0.21654E-01142004 0.220E+00 0.166E+00
DAV: 6 -0.135932672328E+03 0.91723E-02 -0.17165E-01141520 0.216E+00 0.162E+00
DAV: 7 -0.135930126912E+03 0.25454E-02 -0.30734E-02137968 0.142E+00 0.926E-01
DAV: 8 -0.135930465065E+03 -0.33815E-03 -0.33151E-02142880 0.606E-01 0.551E-01
DAV: 9 -0.135928787348E+03 0.16777E-02 -0.72248E-03155584 0.670E-01 0.228E-01
DAV: 10 -0.135929064014E+03 -0.27667E-03 -0.66241E-03153224 0.319E-01 0.162E-01
DAV: 11 -0.135928908660E+03 0.15535E-03 -0.81901E-04141224 0.164E-01 0.126E-01
DAV: 12 -0.135928873697E+03 0.34963E-04 -0.32048E-04141420 0.829E-02 0.717E-02
DAV: 13 -0.135928875002E+03 -0.13046E-05 -0.52199E-05 91544 0.427E-02
2 F= -.13592888E+03 E0= -.13592887E+03 d E =-.132833E-01 mag= 4.0000
trial-energy change: -0.013283 1 .order -0.014666 -0.069607 0.040276
step: 0.6156(harm= 0.6335) dis= 0.00388 next Energy= -135.936694 (dE=-0.211E-01)
bond charge predicted
N E dE d eps ncg rms rms(c)
DAV: 1 -0.136080861388E+03 -0.15199E+00 -0.30120E+00 95640 0.129E+01 0.465E+00
DAV: 2 -0.136098448919E+03 -0.17588E-01 -0.16609E+00120316 0.175E+01 0.655E+00
DAV: 3 -0.135954445365E+03 0.14400E+00 -0.35270E-01106132 0.925E+00 0.239E+00
DAV: 4 -0.135942004858E+03 0.12441E-01 -0.46751E-02127024 0.301E+00 0.900E-01
DAV: 5 -0.135937623995E+03 0.43809E-02 -0.37683E-02140576 0.838E-01 0.411E-01
DAV: 6 -0.135937041107E+03 0.58289E-03 -0.16542E-02122660 0.113E+00 0.438E-01
DAV: 7 -0.135936778213E+03 0.26289E-03 -0.34062E-03155024 0.354E-01 0.217E-01
DAV: 8 -0.135936701451E+03 0.76762E-04 -0.76129E-04127800 0.297E-01 0.662E-02
DAV: 9 -0.135936718409E+03 -0.16958E-04 -0.37154E-04137044 0.139E-01 0.264E-02

fish
Newbie
Newbie
Posts: 12
Joined: Tue Jun 14, 2005 1:13 pm
License Nr.: 198
Location: Argonne National Lab

Re: Parallelization/MPI problem on Cray XC30/40 (hangs)

#2 Post by fish » Thu Oct 06, 2016 5:27 pm

I have seen similar behavior when using K point parallelism. My jobs hang when I use K point parallelism and finish normally when I don't.

I compiled VASP with no optimization "-o0" and the job completes normally with K point paralllelism.

Have you resolved this issue and could you share the solution?

ahampel
Global Moderator
Global Moderator
Posts: 2
Joined: Tue Feb 16, 2016 11:41 am

Re: Parallelization/MPI problem on Cray XC30/40 (hangs)

#3 Post by ahampel » Wed Oct 26, 2016 9:49 am

Hi,

unfortunately I did not resolve the problem. It seems to be a problem related to mpi communication and ist hard to fix or identify. For me the best solution was to go back to vasp 5.3.5, which runs really stable on our cray cluster and gives the same results as the 5.4.1 version. Regarding the 5.4.1 version, I use it only when necessary for small systems or when I do not need +U and magnetism.

Post Reply