Diagonalization and Several Linalg Packages
Exact Diagonalization
This is a note for some problems encountered in doing Exact Diagonalization of quantum many body physics, also serving as a notebook for learning. Here we first don’t consider the symmetry of Hamiltonian, just focus on coding. Thank to gxliu from Julia community, which is really helpful. Windows上Julia代码比Python慢了一倍并且在MacOS上正好反过来?
Background
I am writing a program for exact diagonalization in quantum many body physics, but due to Python being too slow, I turned to Julia, and thought it will behave much more bettr. Here are my two programs, where Julia was written in imitating Python.
The Hamiltonian is:
where on-site potentials are independent Gaussian random numbers with mean zero and variance
I first write a python code:
1 | import os |
Imitating the above python code, I write its Julia version intended to accelerate computation.
1 | using SparseArrays |
Device and Benchmark
I have encountered the problem that my Python code runs twice as fast as Julia on my Windows platform, while in reverse on MacOS platform. So how come it can be true? We did a lot test on different platform.
Here is Hardware device and package dependence environment.
Hardware | OS | CPU | python version | numpy version | julia version | BLAS |
---|---|---|---|---|---|---|
AMD | Windows10 | AMD Ryzen5 4600H | 3.7.12 | 1.21.6 | 1.8.5 | OpenBLAS |
Mac | Ventura 13.2.1 | M1Max | 3.10.9 | 1.24.2 | 1.8.5 | OpenBLAS -> MKL |
Intel | Windows11 | i9-11900H | / | / | 1.8.5 | OpenBLAS -> MKL |
Linux | Centos7 | 双路 至强金牌 6238R | 3.9.12 | 1.21.5 | 1.8.5 | numpy MKL, Julia OpenBLAS |
There are two main parts in my code spending time mostly: prepare the matrix, and
Here is my ED benchmark on parameter of Hamiltonian N=16:
np.linalg.eigvalsh | Linear.eigvals | |
---|---|---|
OpenBLAS(AMD) | 70(scipy 75s) | ~160s |
OpenBLAS(AMD,1thread) | 120 | / |
OpenBLAS(mac) | 2min(scipy 1min) | 60s |
OpenBLAS(mac, 1thread) | 8min | / |
veclib(mac) | 7min09s | / |
OpenBLAS(Intel) | 5min | 140s |
MKL(Intel) | 33s | 37s |
MKL(Intel, 1 thread) | 66s | 231s |
MKL(Intel, Centos7) | 57s | ~85s |
MKL(Intel, Centos7,1 thread) | 64s | 95s |
my 1.21.6 run np.show_config()
instruction return:
1 | blas_info: |
my AMD machine run versioninfo()
return:
1 | Platform Info: |
LinearAlgebra.BLAS.get_config()
1 | LinearAlgebra.BLAS.LBTConfig |
my pytorch runtorch.__config__.show()
return
1 | PyTorch built with: |
Optimization
Noticing that the list of basis have been sorted already, so we need not use the list.index
and findfirst
function but bisect.bisect_left()
and searchsortedfirst
, and the Type of function half_filling
outcome has not been specified, so we could specify it to String
Type explicitly. Then the codes become:
1 | newcolumnstate[(j+1)%N] = 0 |
1 | using SparseArrays |
Then the preparation parts optimization outcome becomes:
python | julia | |
---|---|---|
before | 24s | 26s |
after | 4s | 1s |
Conclusion
综合来看,上述最初版本的程序经过优化后,AMD上python从最开始的70s缩减到了50s,Julia程序从最开始的170s缩减到了144s,在Linux优化到了72s第一次,65s第二次。值得注意的是scipy和numpy的线代库不一样(
Another advantage of using scipy.linalg over numpy.linalg is that it is always compiled with BLAS/LAPACK support, while for numpy this is optional. Therefore, the scipy version might be faster depending on how numpy was installed. Therefore, unless you don’t want to add scipy as a dependency to your numpy program, use scipy.linalg instead of numpy.linalg). numpy的库和LinearAlgebra的线代库都是可以选择的,建议根据设备不同编译的时候选择链接到不同线代库上。Julia对OpenBLAS的支持属实不怎么好,建议在intel的芯片上还是用MKL,理论上AMD+OpenBLAS=Intel+MKL,在python表现出来了,Julia却没有,就很怪)。
Math Package
Basic Math Package
This basic linear algebra also have
BLAS
Basic Linear Algebra Subprograms by Netliv in Fortran, not been optimized for operations too much. Various software and hardware manufacturers highly optimize the implementation of BLAS interfaces for their products.
LAPACK
Linear Algebra PACKage, whose bottom layer is BLAS, is more efficient than BLAS, by Netlib in Fortran.
ScaLAPACK
short for Scalable LAPACK, suitable for MIMD(multiple instruction, multiple data)
Advanced Math Package
based on BLAS, LAPACK and ScaLAPACK which could be seen as interface standards, other packages are implemented. In which Atlas and OpenBLAS are open-accessed while others are commercial such as MKL cuBLAS.
Atlas
Automatically Tuned Linear Algebra Software, is possible to automatically adjust operating parameters during runtime based on hardware.
OpenBLAS
Just as its name, Open BLAS. (It is astonishing that its initial release is at 22 March 2011, and one of the developers is Chinese, Zhang Xianyi, Wang Qian)The optimization of OpenBLAS is done at compile time, so its running efficiency is generally higher than Atlas. But this also determines that Openblas has a high hardware dependency, and if the machine is changed, it may need to be recompiled.
MKL
based on Intel ® Built with C++ and Fortran compilers, and implemented threading using OpenMP *(Open Multi-Processing, a kind of parallel programming model). Support for Linux/Win (Intel chips such i9-11900H).
Its bottom has:
- BLAS: All inter matrix operations (three levels) are threaded for dense and sparse BLAS. Many vector to vector operations (first level) and matrix to vector operations (second level) are aimed at Intel ® The dense matrix in 64 bit programs on the 64 architecture achieves threading. For sparse matrix, all secondary operations except triangle sparse matrix solver are threaded.
- LAPACK: Some computational routines have implemented threading for certain types of problems: linear equation solvers, orthogonal factorization, single valued factorization, and symmetric eigenvalue problems. LAPACK also calls BLAS, so even non threaded functions may run in parallel.
- ScaLAPACK: cluster oriented LAPACK Distributed memory parallel version.
- PARDISO: The three stages of the parallel direct sparse matrix solver are threaded: reordering (optional), factoring and solving (if multiple items on the right are used).
- DFTs: Discrete Fourier Transform
- VML: Vector Mathematics Library
- VSL: Vector Statistics Library
Armadillo
Using template metaprogramming technology, the easy-to-use C++matrix library, similar to Matlab, provides efficient LAPACK, BLAS and ATLAS packages, including many high-performance versions such as Intel MKL, AMD ACM and OpenBLAS.
bottom:
BLAS/LAPACK: Supports OpenBLAS, ACML, and MKL
Eigen(3)
Its bottom:
- BLAS/LAPACK:支持所有基于F77的BLAS或LAPACK库作为底层(EIGEN_USE_BLAS、EIGEN_USE_LAPACKE)
- MKL:支持MKL作为底层(EIGEN_USE_MKL_ALL)
- CUDA:支持在CUDA kernels里使用CUDA
- OpenMP:多线程优化
他们之间的关系
- 狭义的BLAS/LAPACK可理解为用于线性代数运算库的API
- Netlib实现了Fortran/C版的BLAS/LAPACK、CBLAS/CLAPACK
- 开源社区及商业公司针对API实现了BLAS(ATLAS、OpenBLAS)和LAPACK(MKL、ACML、CUBLAS)的针对性优化
- Eigen、Armadillo除自身实现线性代数运算库外还支持上述各种BLAS/LAPACK为基础的底层以加速运算
他们之间的对比:
- 备选:MKL、OpenBLAS、Eigen、Armadillo
- 接口易用程度:Eigen > Armadillo > MKL/OpenBLAS
- 速度:MKL≈OpenBLAS > Eigen(with MKL) > Eigen > Armadillo
其中: - OpenBLAS没有单核版本,强行指定OMP_NUM_THREADS=1性能损失大,不考虑
- MKL直接使用学习成本较高,但是性能最强
- Armadillo效率和接口易用性不如Eigen
- Eigen的原生BLAS/LAPACK实现速度不如MKL、OpenBLAS,但是使用MKL做后台性能和MKL原生几乎一样,所以可以视情况决定是否使用MKL
- AMD+openblas可以匹敌Intel+mkl,英特尔处理器无脑选择mkl,AMD 处理器下openblas的速度是mkl的两倍。
- 2018 MATLAB用户反馈在 Zen CPU 上的运行效率奇低,还不如几年前的 Intel CPU。人们才发现MATLAB 用的MKL在非 Intel CPU 上不开启 AVX2 指令集,后来有人逆向 MKL,挖出了那个著名的环境变量MKL_DEBUG_CPU_TYPE=5,在 Zen CPU 上强制启用 AVX2。但是2020 年 Intel 在新版本 MKL 里删掉了这个环境变量,一批 AMD 用户又各种降级。不过好在这个时候已经有可用的开源替代品 OpenBLAS 了,虽然某些性能还赶不上 MKL,但是也比没有加速强多了。并且从这个时候开始,NumPy 等一票计算库也渐渐向开源的 OpenBLAS 靠拢。后续有人进一步逆向 MKL,还能用动态库劫持的方法强制开启对 Zen AVX/AVX2 指令集的完整支持。
Above are focusing on CPU, nvidia invented cuBLAS in CUDA on GPU, others GPU accelerating packages have cuSOLVER, MAGMA, ROCm SMI Library.
Others parallel computation platform have, OpenCL, ROCm, except CUDA.
Compilation instruction
-O4 -msse2 -msse3 -msse4
-O4 是编译器优化级别的选项,表示最高级别的优化。该选项会启用各种优化技术,以尽可能提高代码执行速度和性能。不同的编译器可能对优化级别有不同的定义和实现。
-msse2、-msse3 和 -msse4 是指令集扩展选项,用于启用特定的SIMD(单指令多数据)指令集。这些指令集扩展允许并行处理多个数据元素,从而提高代码的执行效率。SSE(Streaming SIMD Extensions)指令集是一组在x86架构上用于向量化运算的指令集。
常见的SIMD指令集有:
-
SSE(Streaming SIMD Extensions):这是最早的SIMD指令集,包括SSE、SSE2、SSE3、SSSE3等版本。它们广泛支持x86和x86-64架构的处理器。
-
AVX(Advanced Vector Extensions):AVX是一组SIMD指令集扩展,包括AVX、AVX2和AVX-512等版本。它们提供更高的矢量化性能和更大的向量寄存器。
-
NEON:NEON是ARM架构上的SIMD指令集,广泛用于移动设备和嵌入式系统。
-
AltiVec:AltiVec是PowerPC架构上的SIMD指令集。