Page 1 of 1

MPI not working for 6.4.2 on QDR IB fabric with Intel MPI 2021.7.0

Posted: Thu Feb 27, 2025 8:11 pm
by stephen_wheat

On our OmniPath network, with all things the same, vasp works fine. On our QDR IB network, we get the following error, which can be reproduced simply by doing "vasp --version".

The operation for running vasp is:
module load VASP/6.4.2

That is configured to load the following modules:
1) tbb/2022.0 3) compiler-rt/2025.0.4 5) mpi/2021.7.0 7) VASP/6.4.2
2) umf/0.9.1 4) compiler/2025.0.4 6) mkl/2025.0

vasp --version returns
[0] MPI startup(): FI_PSM3_UUID was not generated, please set it to avoid possible resources ownership conflicts between MPI processes
c035:pid4457.vasp: Failed to modify UD QP to INIT on mlx4_0: Operation not permitted
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:

vasp is built with the same modules as above and using arch/makefile.include.intel_omp as makefile.include with the following options:
FC = mpiifx -qopenmp
FCL = mpiifx
CC_LIB = icx
CXX_PARS = icpx
LLIBS = -lstdc++
FCL += -mkl
MKLROOT ?= /opt/intel/oneapi/mkl/latest
LLIBS += -L$(MKLROOT)/lib -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
INCS =-I$(MKLROOT)/include/fftw


Re: MPI not working for 6.4.2 on QDR IB fabric with Intel MPI 2021.7.0

Posted: Fri Feb 28, 2025 10:00 am
by ferenc_karsai

It will be not possible for us to test the error since we don't have the compilers installed here.

It's pretty likely a toolchain problem if it happens already with "--version". I guess you get the same error when you run VASP normally?

I would suggest to try with a different MPI version (that is kind of an old version you used).


Re: MPI not working for 6.4.2 on QDR IB fabric with Intel MPI 2021.7.0

Posted: Sat Mar 01, 2025 7:14 pm
by stephen_wheat

Thank you for the guidance.

I have updated the toolchain completely. The Intel suite are:
Compiler=compiler/2025.0.4
MKL = mkl/2025.0
MPI = mpi/2021.14

The operable parts of the makefile.include are below. It is interesting that the makefile.include for Intel has not been modified to use the new executables with the old executables no longer in the tool chain. They are more than deprecated; they just aren't there.

Note, I use icx, icpx, and mpiifx. And, I build with mkl vs qmkl.

# Default precompiler options
CPP_OPTIONS = -DHOST=\"LinuxIFC\" \
-DMPI -DMPI_BLOCK=8000 -Duse_collective \
-DscaLAPACK \
-DCACHE_SIZE=4000 \
-Davoidalloc \
-Dvasp6 \
-Dtbdyn \
-Dfock_dblbuf \
-D_OPENMP

CPP = fpp -f_com=no -free -w0 $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)

FC = mpiifx -qopenmp
FCL = mpiifx

FREE = -free -names lowercase

FFLAGS = -assume byterecl -w

OFLAG = -O2
OFLAG_IN = $(OFLAG)
DEBUG = -O0

# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = icx]
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB = $(FREE)

OBJECTS_LIB = linpack_double.o

# For the parser library
CXX_PARS = icpx
LLIBS = -lstdc++

VASP_TARGET_CPU ?= -xHOST
FFLAGS += $(VASP_TARGET_CPU)

FCL += -mkl
MKLROOT ?= /opt/intel/oneapi/mkl/latest
LLIBS += -L$(MKLROOT)/lib -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
INCS =-I$(MKLROOT)/include/fftw

I have built this for both my Sandybridge and my Broadwell partitions. I get the same test suite failures. One example is below; this happens for a couple of vasp_ncl tests and for a couple of vasp_std tests. I get the exact same error when trying to run vasp_std on a real problem.

My basic question is what is the most vanilla settings for makefile.include to use the Intel compiler, mpi, and mkl? Is something wrong with my makefile.include?

While waiting for a response, I will rebuild again but with qmkl.

bulk_InP_SOC_G0W0_sym step DIAG
------------------------------------------------------------------
bulk_InP_SOC_G0W0_sym step DIAG
entering run_vasp_nc
running 4 mpi-ranks, with 18 threads/rank, on 1 nodes
distrk: each k-point on 2 cores, 2 groups
distr: one band on 1 cores, 2 groups
vasp.6.5.0 16Dec24 (build Mar 01 2025 12:23:04) complex

POSCAR found : 2 types and 2 ions
scaLAPACK will be used
LDA part: xc-table for (Slater(with rela. corr.)+CA(PZ))
, standard interpolation
found WAVECAR, reading the header
number of bands has changed, file: 16 present: 240
trying to continue reading WAVECAR, but it might fail
POSCAR, INCAR and KPOINTS ok, starting setup
FFT: planning ... GRIDC
FFT: planning ... GRID_SOFT
FFT: planning ... GRID
reading WAVECAR
random initialization beyond band 16
the WAVECAR file was read successfully
initial charge from wavefunction
entering main loop
N E dE d eps ncg rms rms(c)
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libpthread-2.28.s 00007F0722894D10 Unknown Unknown Unknown
vasp_ncl 00000000004C6088 Unknown Unknown Unknown
vasp_ncl 000000000098070B Unknown Unknown Unknown
vasp_ncl 0000000001054505 Unknown Unknown Unknown
vasp_ncl 00000000011C60F6 Unknown Unknown Unknown
vasp_ncl 0000000001DA68CB Unknown Unknown Unknown
vasp_ncl 0000000001D985C6 Unknown Unknown Unknown
vasp_ncl 0000000001D71BC5 Unknown Unknown Unknown
vasp_ncl 000000000040CF5D Unknown Unknown Unknown
libc-2.28.so 00007F0721D7A7E5 __libc_start_main Unknown Unknown
vasp_ncl 000000000040CE7E Unknown Unknown Unknown


Re: MPI not working for 6.4.2 on QDR IB fabric with Intel MPI 2021.7.0

Posted: Mon Mar 03, 2025 2:21 pm
by ferenc_karsai

Here are some toolchains that we use daily to test VASP:
intel: oneapi-2024.0.2_intel-oneapi-mkl-2023.2.0_intel-oneapi-mpi-2021.10.0_
gfortran: gfortran-11.2_ompi-4.1.2_scalapack-2.1.0_fftw-3.3.10_openblas-0.3.18_gcc-11.2.0
nvidia: nvhpc-22.11_ompi-3.1.5_nvhpc-sdk/22.11_fftw/3.3.10-omp

We run the whole testsuite with these toolchains and see no problems with them. So it would be best if you could manage to try one of the toolchains.


Re: MPI not working for 6.4.2 on QDR IB fabric with Intel MPI 2021.7.0

Posted: Wed Mar 05, 2025 11:18 pm
by stephen_wheat

Ferenc,

Thank you for that. The OpenMPI chain is not an option here. The NV chain is strongly not desired. The Intel chain is where we need to end up.

I don't have a support agreement with Intel, so I can only use the latest OneAPI or the ones I happened to have downloaded, or the parts that are available by the EasyBuild environments.

The closest I could get to the recommended Intel chain is:
imkl/2023.1.0 and impi/2021.10.0-intel-compilers-2023.2.1. The only modules I have with those being loaded after a purge are:

Currently Loaded Modules:
1) imkl/2023.1.0 4) binutils/2.40-GCCcore-13.2.0 7) UCX/1.15.0-GCCcore-13.2.0
2) GCCcore/13.2.0 5) intel-compilers/2023.2.1 8) impi/2021.10.0-intel-compilers-2023.2.1
3) zlib/1.2.13-GCCcore-13.2.0 6) numactl/2.0.16-GCCcore-13.2.0

I then tried imkl/2023.1.0 and impi/2021.9.0-intel-compilers-2023.1.0.
This gave me the following modules:
Currently Loaded Modules:
1) imkl/2023.1.0 4) binutils/2.40-GCCcore-12.3.0 7) UCX/1.14.1-GCCcore-12.3.0
2) GCCcore/12.3.0 5) intel-compilers/2023.1.0 8) impi/2021.9.0-intel-compilers-2023.1.0
3) zlib/1.2.13-GCCcore-12.3.0 6) numactl/2.0.16-GCCcore-12.3.0

In all cases, I get the same test failures. Given that it is the same failure across all tool combinations, I have to look at what is common, and that makes me expect that the issue is related to the build process or the configuration. My command line to make is:
make DEPS=1 -j16 all

Am I missing something for that?


Re: MPI not working for 6.4.2 on QDR IB fabric with Intel MPI 2021.7.0

Posted: Fri Mar 07, 2025 9:51 am
by ferenc_karsai

The command line make looks ok. To be sure you can also try a compilation without parallelism (make all).


Re: MPI not working for 6.4.2 on QDR IB fabric with Intel MPI 2021.7.0

Posted: Fri Mar 07, 2025 7:14 pm
by stephen_wheat

Ferenc,

As much as I would like to say otherwise, that made no difference. Still the same crashes in the test suite as before.

What environment variable must be set or unset for test? I don't think that's the issue, as when I run vasp as a user across multiple nodes, I get the same errors.

Stephen