ZPOTRF, Sub-Space-Matrix not Hermitian error occur for large systems using MPI, but not serial

Problems running VASP: crashes, internal errors, "wrong" results.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
dhfphysics
Newbie
Newbie
Posts: 21
Joined: Thu Apr 29, 2010 12:29 am
License Nr.: 5-730
Location: Corvallis, OR

ZPOTRF, Sub-Space-Matrix not Hermitian error occur for large systems using MPI, but not serial

#1 Post by dhfphysics » Mon Dec 17, 2012 7:49 pm

Hello All,
It has long been known in our little group that we cannot run certain large VASP problems in MPI (openMPI) because of the errors mentioned in the title. I have recently decided to run tests on a different linux (Intel Xeon) cluster than we usually use to see if things are different, but the same behavior occurs. Here is a description of the problem with one of the INCARs, and a POSCAR. Atoms have reasonable starting bond lengths and I am using the lapack_double supplied with VASP, along with intel mkl BLAS. (Same problem occurs using intel mkl's LAPACK).

-VASP version 5.3.2 (several older versions have shown same type of problem)
-Intel ifort 10.1 compiler (newer compiler on our usual cluster which displays the same type of problem)
-openmpi-1.4.3
-intel mkl 10.0.010 for BLAS (BLAS = -lmkl_em64t -lguide -lpthread; we also have a much newer mkl version on our usual cluster)
-optimization flags: -xT -O2 -ip

The ZPOTRF error often occurs before the results of the first non-self-consistent DAV step has been written to OSZICAR. When the INCAR is made more simple and with lower ENCUT, AMIX, AMIN, thousands of Sub-Space-Matrix not Hermitian errors appear instead.

The following INCAR is perhaps the simplest of dozens that I have tried in an attempt to circumvent the problem. This produces many Sub-Space-Matrix errors. Variants of AMIX, AMIN, ISMEAR, PREC, LASPH, LDAU (ultimately we need this), ALGO, NPAR, KPAR, ENCUT, EDIFF have been tried as well as dynamic library compilation vs. "-static -static-intel". In general, simpler INCARs with slower mixing cause Sub-Space errors while more accurate or default mixing INCARS exit with ZPOTRF errors. (I have been surprised by a few exceptions to this rule, with ZPOTRF errors on gentle INCARS.)
All tests start from random wavefunctions and initially run non-self consistent davidson electronic minimization method. Thus setting ICHARG=12 does not change anything.
The serial version works, the MPI version always fails. A supercell with fewer bulk unit cells in each slab (96 atoms, I believe) has no problem running in MPI, while this cell has 144 atoms. The following was tests on a single computer with 2 quad core processors, asking mpirun to use 8 processors and setting OMP_NUM_THREADS=1.

POTCAR: PAW_PBEs, default versions for Si, Zn, S

KPOINTS: Gamma centered: 6x6x1

INCAR:
SYSTEM = Si-ZnS
LWAVE=.FALSE.
LCHARG=.FALSE.
LVHAR=.FALSE.

ALGO = Normal
NSIM = 1
NPAR = 1
LREAL=Auto
#LASPH=.TRUE.
NELMDL=5
AMIX=0.04
AMIN=0.01

ENCUT = 280
PREC = normal

ISMEAR = 0
SIGMA = 0.03

EDIFF = 1e-5
EDIFFG = 1e-4
NSW = 0
IBRION = 2
ISIF = 1


POSCAR:
Si-ZnS test
5.468586
-1.0000000000000000 1.0000000000000000 0.0000000000000000
-1.0000000000000000 0.0000000000000000 1.0000000000000000
5.9443739999999989 5.9443739999999989 5.9443739999999989
Si S Zn
73 35 36
direct
0.3332701161042725 0.3334487121844025 0.0210358466499900
0.3333840668170857 -0.1667096936641326 0.0210216381980764
-0.1666185198799777 0.3332100618893786 0.0210218006941743
-0.1666817656802643 -0.1667567467642730 0.0210212118632949
-0.0000258056240640 0.0001347657523997 0.0350501862075766
-0.0000383021252124 0.4999523286191414 0.0350293281284112
-0.4997623162555701 -0.0001165234802117 0.0350370839237600
0.4999488340185663 -0.4999935350031234 0.0350405407763392
-0.0000953161261905 0.0001764813133159 0.0771043042690625
0.0000662100776604 -0.4999251321172099 0.0771123916639672
-0.4998903733395471 0.0000328945359120 0.0771011117058084
0.4999233615794916 -0.4998685875211670 0.0770939201916907
0.1666652893977039 0.1667685091423266 0.0911176886953862
0.1666567813998168 -0.3332765110218207 0.0911398642276121
-0.3332656159246743 0.1666198006608161 0.0911179003399126
-0.3334375498116086 -0.3332946888651714 0.0911147049424962
0.1665574557609579 0.1668111286560735 0.1331759288807631
0.1665292535687981 -0.3332520274838268 0.1331960830620760
-0.3332954851018329 0.1667543408928532 0.1331831549220725
-0.3332027983341149 -0.3333388473591442 0.1331844165842326
0.3332193312641562 0.3332779176448344 0.1471929305105706
0.3335277560113273 -0.1667230075731006 0.1471939686422086
-0.1666850365698827 0.3331966643467916 0.1471905741570010
-0.1668475649938693 -0.1665326231515160 0.1472011487344980
0.3333438745583680 0.3333291655932813 0.1892505599998482
0.3332362718045342 -0.1667173620617851 0.1892531628679436
-0.1666742151194670 0.3334405961571077 0.1892417716193337
-0.1667696033616081 -0.1664634015415564 0.1892454029147715
-0.0001089232847485 0.0000734215850406 0.2032666328455413
-0.0000448767171000 -0.4999183649907142 0.2032581593669197
-0.4999230747060615 0.0000553582085602 0.2032844189849891
-0.4998618838456362 0.4998619462415974 0.2032733670028024
-0.0000585389747536 -0.0000643573481173 0.2453220661565005
-0.0000231395845980 -0.4999714037944525 0.2453240769991303
0.4999608357224623 -0.0000497446587071 0.2453323888747364
0.4999315455992220 0.4999971029466527 0.2453252958540946
0.1666523768841424 0.1668105404802223 0.2593511413702672
0.1667665026237506 -0.3333924892834208 0.2593354588389654
-0.3333058263943880 0.1666222593481801 0.2593693976870021
-0.3333352710866327 -0.3333653020509709 0.2593267051876815
0.1665794890699174 0.1666073266049741 0.3013973817285648
0.1665834775297847 -0.3333995495892781 0.3014003575645577
-0.3332774613423672 0.1667357424302440 0.3014178887246128
-0.3335057376473508 -0.3332846395737344 0.3014129307014829
0.3333880542504090 0.3333251466523761 0.3154049207558848
0.3334424638162009 -0.1666397402181601 0.3154254632341582
-0.1666598465872767 0.3334627659744798 0.3154306098244862
-0.1667426023197670 -0.1667231579003317 0.3154124729993562
0.3334610853149438 0.3332026320530412 0.3574729648213454
0.3331970253866918 -0.1665537869048597 0.3574897623610454
-0.1666337259490237 0.3332506677851330 0.3574669275081340
-0.1665513814977407 -0.1668636432340909 0.3574838250862336
-0.0000831605305952 0.0000203826945777 0.3714958664570466
0.0001142952116475 0.4999426799231557 0.3715030867127696
-0.4999772331268719 -0.0000765156282789 0.3715136153441687
0.4998488818037131 0.4999741738348280 0.3714956340929624
0.0000666925842563 -0.0000441592546949 0.4135523853538227
0.0001084549512166 -0.4999652050338592 0.4135582606454121
-0.4999372637806231 -0.0001098015479439 0.4135694531829381
-0.4998886248842307 0.4998957607170313 0.4135522753683675
0.1668348917403356 0.1665541567986697 0.4275648278093045
0.1665294468356562 -0.3333352200875113 0.4275794760998223
-0.3333018989891179 0.1667711633087894 0.4275797965278231
-0.3335041938679780 -0.3333109435164714 0.4275757712579036
0.1667055874227011 0.1667844475422116 0.4696274452245786
0.1666303195999216 -0.3333844315893739 0.4696466524027317
-0.3333245598600232 0.1667053558077265 0.4696427963461762
-0.3334082817221474 -0.3334084981634393 0.4696234398212198
0.3333992600200615 0.3333291963913205 0.4836586922146967
0.3333680377132716 -0.1666194926997007 0.4836671987861440
-0.1667614352982923 0.3332547261288292 0.4836502704527834
-0.1666358992122718 -0.1668168867401114 0.4836500157863499
-0.1668381663838797 -0.1665721338500614 -0.4745090751937238
0.3334532912664145 0.3332790797279082 -0.4745300068701603
0.3334055987506374 -0.1667407106353387 -0.4745341582648517
-0.1667502219484245 0.3333209844134142 -0.4745207495929998
-0.0000112987142732 -0.0000227739527203 -0.4194901699352881
-0.0000721792617862 -0.4998627707737962 -0.4194936039155318
-0.4998669936539681 -0.0001626191087352 -0.4194820098262425
-0.4999141158442429 0.4999193844672645 -0.4194711729992875
0.1667171940574214 0.1666120966390137 -0.3644642711067068
0.1667504589640494 -0.3332053847838745 -0.3644438001753422
-0.3332206299601710 0.1665700871566402 -0.3644476646886273
-0.3332763281951785 -0.3333323702608670 -0.3644287699071860
0.3332236133180079 0.3333143082985517 -0.3094256650915417
0.3335237307440807 -0.1667363284439205 -0.3094183944232640
-0.1668712782460607 0.3334452144943940 -0.3094061514223238
-0.1667421791196450 -0.1666649141707319 -0.3094107092229308
0.0000392737315814 -0.0000729000673114 -0.2543661751667987
0.0000877981431496 0.4999891561819209 -0.2543780514392033
0.4999311334606509 -0.0000614114556772 -0.2543930072746277
-0.4999221398914857 0.4999716940087414 -0.2543813002756039
0.1666345715966626 0.1668121623301703 -0.1993527476409120
0.1667211837664618 -0.3333439743995965 -0.1993369236102139
-0.3334284986483446 0.1666845692499294 -0.1993518138330508
-0.3333364065442332 -0.3332678759885740 -0.1993464409485898
0.3332007169904655 0.3334318354040041 -0.1443090490903434
0.3333706018066668 -0.1666569569890427 -0.1442948026992679
-0.1665352585122272 0.3331861808934581 -0.1442992774475660
-0.1667306519494440 -0.1666340481512335 -0.1443085133537441
-0.0000857649025797 0.0000034890829546 -0.0892731168159685
0.0000100426369947 -0.4999844870701367 -0.0892613734642091
-0.4999545002653688 -0.0000676829355163 -0.0892743718044648
-0.4999065467193972 0.4999787416073116 -0.0892656252184963
0.1666519364827267 0.1665998696335678 -0.0342502968860840
0.1666437101160805 -0.3333262332409225 -0.0342594382638962
-0.3332937860019511 0.1666675022349230 -0.0342277071350919
-0.3334054147761223 -0.3333625334215875 -0.0342286433070994
0.0000101828226405 0.0000790856298458 -0.4610767043287973
0.0000462303468558 -0.4999691659455030 -0.4610764967455639
-0.4999526783464271 -0.0000041730853240 -0.4610754131825739
0.4999440099983854 -0.4999804396968608 -0.4610818526085538
0.1668091007962695 0.1665332825967287 -0.4060533308546262
0.1666442899203099 -0.3333353281192306 -0.4060338950061567
-0.3333302069344450 0.1666847352603549 -0.4060760603763299
-0.3333283567088552 -0.3332272011705013 -0.4060448373663480
0.3332695241498024 0.3333387375149279 -0.3509996664451893
0.3332734726397492 -0.1667240169987021 -0.3510225205936363
-0.1665269479656771 0.3332648032195944 -0.3510181840579200
-0.1665275399897213 -0.1666222012565122 -0.3510101134726178
-0.0000828600109596 -0.0000800598807327 -0.2959837692387465
0.0000296313156918 -0.4999707681729446 -0.2959674549440757
-0.4999282506171521 0.0000635263703810 -0.2959797622767003
-0.4999861955373690 -0.4998541000444462 -0.2959729556425186
0.1666389916459226 0.1666377914938493 -0.2409632374322218
0.1668050475629942 -0.3334049840165683 -0.2409440389841894
-0.3334121188254038 0.1666483973638052 -0.2409346036755921
-0.3334725732040010 -0.3332268964403695 -0.2409430574617726
0.3332921030975213 0.3334841783026276 -0.1859135093329313
0.3333481694034060 -0.1668347289231257 -0.1859072244158739
-0.1666837213840007 0.3334216927226507 -0.1859259855520721
-0.1666960416652966 -0.1666521514043575 -0.1859323229377019
0.0000108468905068 0.0000992618913873 -0.1308646862689997
-0.0000594799798549 0.4999751879400238 -0.1308700202053212
-0.4999881038585041 0.0000039856030325 -0.1308742288970177
-0.4999238082943203 0.4999318515899318 -0.1308805054945497
0.1666827000619334 0.1666951570805720 -0.0758581153802631
0.1667917787174852 -0.3332532218653256 -0.0758318469102036
-0.3333210930869563 0.1667585365110482 -0.0758483483244968
-0.3332766463649418 -0.3332342650979203 -0.0758293677918450
0.3333041812007727 0.3332313672856156 -0.0207962739802632
0.3333380245791318 -0.1667332373286115 -0.0208022946356239
-0.1665698509872580 0.3332422633996034 -0.0208051328868731
-0.1664969613970304 -0.1667691587511596 -0.0208046665733718

I can provide more info on compiler, openmpi etc, if needed
Thanks for your help.
David

<span class='smallblacktext'>[ Edited ]</span>
Last edited by dhfphysics on Mon Dec 17, 2012 7:49 pm, edited 1 time in total.

dhfphysics
Newbie
Newbie
Posts: 21
Joined: Thu Apr 29, 2010 12:29 am
License Nr.: 5-730
Location: Corvallis, OR

ZPOTRF, Sub-Space-Matrix not Hermitian error occur for large systems using MPI, but not serial

#2 Post by dhfphysics » Mon Dec 17, 2012 7:56 pm

Note: A single computer may not have enough memory (8GB) to handle this computation to termination, but a serial test shows that it can succesfully complete non-selfconsistent DAV cycles using about 6Gb. I also have tested similar INCARs running MPI on 2 and 4 machines (with NPAR set to 2, 4, respectively), and the same errors occur.

It would be very helpful if someone with a linux cluster could run this on their system. If it works, perhaps we could talk by email about compiler/MPI2 details.

Thanks,
fosterd@physics.oregonstate.edu
<span class='smallblacktext'>[ Edited Tue Dec 18 2012, 09:55PM ]</span>
Last edited by dhfphysics on Mon Dec 17, 2012 7:56 pm, edited 1 time in total.

Post Reply