Page 1 of 1

Failure of cRPA example with parallel calculation

Posted: Thu Jun 06, 2024 2:19 pm
by zhishuo_huang
Dear developers and users,

I am trying to do cRPA calculations with vasp 6.4.3 with wannier90 3.1.0, compiled with intel compiler 2022.
I first ran the calculation following the example (https://www.vasp.at/wiki/index.php/CRPA_of_SrVO3).
However, the last step of cRPA calculation for a set of automatically chosen imaginary frequency points failed with parallel calculation, while the serial run can finish successfully.
The error message is (a full information is shown in the PBS output file in the attachment):
[colo-chmlu-01:308605:0:308605] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308609:0:308609] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308581:0:308581] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308582:0:308582] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308583:0:308583] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308584:0:308584] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308585:0:308585] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308587:0:308587] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308588:0:308588] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308593:0:308593] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308594:0:308594] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308595:0:308595] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308596:0:308596] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308597:0:308597] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308598:0:308598] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308599:0:308599] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308600:0:308600] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308601:0:308601] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308602:0:308602] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308603:0:308603] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308604:0:308604] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308606:0:308606] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308607:0:308607] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308608:0:308608] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308610:0:308610] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308611:0:308611] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308612:0:308612] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308586:0:308586] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308589:0:308589] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308590:0:308590] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308591:0:308591] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[colo-chmlu-01:308592:0:308592] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 308586) ====
0 0x0000000000538a41 scala_mp_check_gdes_mat_size_() ???:0
1 0x0000000001bd88b2 chi_super_mp_calculate_xi_real_() ???:0
2 0x0000000001e5b9cc vamp_IP_do_rpa_() main.f90:0
3 0x0000000001e3173d MAIN__() ???:0
4 0x0000000000408512 main() ???:0
5 0x0000000000022555 __libc_start_main() ???:0
6 0x0000000000408429 _start() ???:0
=================================
==== backtrace (tid: 308591) ====
0 0x0000000000538a41 scala_mp_check_gdes_mat_size_() ???:0
1 0x0000000001bd88b2 chi_super_mp_calculate_xi_real_() ???:0
2 0x0000000001e5b9cc vamp_IP_do_rpa_() main.f90:0
3 0x0000000001e3173d MAIN__() ???:0
4 0x0000000000408512 main() ???:0
5 0x0000000000022555 __libc_start_main() ???:0
6 0x0000000000408429 _start() ???:0
=================================
...
...
Image PC Routine Line Source
vasp_std 0000000001FDB35A Unknown Unknown Unknown
libpthread-2.17.s 00002B39B1152630 Unknown Unknown Unknown
vasp_std 0000000000538A41 Unknown Unknown Unknown
vasp_std 0000000001BD88B2 Unknown Unknown Unknown
vasp_std 0000000001E5B9CC Unknown Unknown Unknown
vasp_std 0000000001E3173D Unknown Unknown Unknown
vasp_std 0000000000408512 Unknown Unknown Unknown
libc-2.17.so 00002B39B1683555 __libc_start_main Unknown Unknown
vasp_std 0000000000408429 Unknown Unknown Unknown
==== backtrace (tid: 308581) ====
0 0x0000000000538a41 scala_mp_check_gdes_mat_size_() ???:0
1 0x0000000001bd88b2 chi_super_mp_calculate_xi_real_() ???:0
2 0x0000000001e5b9cc vamp_IP_do_rpa_() main.f90:0
3 0x0000000001e3173d MAIN__() ???:0
4 0x0000000000408512 main() ???:0
5 0x0000000000022555 __libc_start_main() ???:0
6 0x0000000000408429 _start() ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
vasp_std 0000000001FDB35A Unknown Unknown Unknown
libpthread-2.17.s 00002B16F59C2630 Unknown Unknown Unknown
vasp_std 0000000000538A41 Unknown Unknown Unknown
vasp_std 0000000001BD88B2 Unknown Unknown Unknown
vasp_std 0000000001E5B9CC Unknown Unknown Unknown
vasp_std 0000000001E3173D Unknown Unknown Unknown
vasp_std 0000000000408512 Unknown Unknown Unknown
libc-2.17.so 00002B16F5EF3555 __libc_start_main Unknown Unknown
vasp_std 0000000000408429 Unknown Unknown Unknown
======================================================================================

I attach the relevant files:
INCAR.CRPA_wan: the modified INCAR for cRPA at omega=0 with wannier orbital,
cRPA_Wan_parallel.o6135280: pbs output file,
makefile.include: make file for the compilation,
pbs_vasp6.4.3_intel_testcRPA: PBS script file.

I appreciate your time and any suggestion or explanation.

Best regards
Zhishuo Huang

Re: Failure of cRPA example with parallel calculation

Posted: Tue Jun 11, 2024 2:07 pm
by merzuk.kaltak
Dear Zhishuo Huang,

Thank you for submitting an error report.
There is indeed a bug in the code that is triggered when you use a large number of MPI ranks for such a small job.
The fix will be released in version 6.5.0.
For the time being I suggest you run this job with a smaller number of MPI ranks, e.g. 4 should suffice.

Note, I have updated the tutorial page, suggesting using the WANNIER90_WIN as of version 6.2.0 (including newer versions).

Moreover, a fresh CRPA tutorial will be published soon that works in conjunction with py4vasp.

Re: Failure of cRPA example with parallel calculation

Posted: Fri Jun 14, 2024 9:20 am
by zhishuo_huang
Dear Merzuk Kaltak,

Thank you for your information.

Best regards
Zhishuo Huang