Exact Diagonalization

This is a note for some problems encountered in doing Exact Diagonalization of quantum many body physics, also serving as a notebook for learning. Here we first don’t consider the symmetry of Hamiltonian, just focus on coding. Thank to gxliu from Julia community, which is really helpful. Windows上Julia代码比Python慢了一倍并且在MacOS上正好反过来？

Background

I am writing a program for exact diagonalization in quantum many body physics, but due to Python being too slow, I turned to Julia, and thought it will behave much more bettr. Here are my two programs, where Julia was written in imitating Python.

The Hamiltonian is:

H=\sum_i [\omega_i n_i+V(n_i-1/2)(n_{i+1}-1/2)+c^{\dagger}_ic_{i+1}+c^{\dagger}_{i+1}c_{i}+c^{\dagger}_{i+2}c_{i}+c^{\dagger}_ic_{i+2}]

where on-site potentials $\omega_i$ are independent Gaussian random numbers with mean zero and variance $W^2$

I first write a python code:

import os
import numpy as np
import scipy.sparse as sp
import datetime

os.environ["OMP_NUM_THREADS"] = "1" # export OMP_NUM_THREADS=4
os.environ["OPENBLAS_NUM_THREADS"] = "1" # export OPENBLAS_NUM_THREADS=4 
os.environ["NUMEXPR_NUM_THREADS"] = "1" # export NUMEXPR_NUM_THREADS=6

t0 = datetime.datetime.now()
# 这一步是生成所有占据数为N/2的态，比如N=4的话就是即0011，0101,0110,1001,1010,1100一系列的态。最终返回一个列表，里面都是二次量子化基矢，以列表的形式存储。
# This step is to generate all computational basis which occupation is N/2, for example, if N=4, it is a series of states, namely 0011, 0101, 0110, 1001, 1010, 1100, listed in decimal order. Finally, return a list containing second quantization basis, stored in the form of a list.
def half_filling(N):
    basis = []
    for x in range(0, 2**N):
        xb = bin(x)[2:].zfill(N)
        if xb.count('1') == N//2:
            basis.append(list(map(int, xb)))
    return basis

# 这一部分是写哈密顿量，具体来说是用前文生成的半填充的基矢来展开哈密顿量
# This section is about writing the Hamiltonian, specifically using the half filled computational basis generated earlier to expand the Hamiltonian.

def sparse_hamiltonian(W,N,t=1,tprime=1,V=2):
    basis=half_filling(N) 
    basis=[list(i) for i in basis]
    L=len(basis) 
    H=sp.dok_matrix((L,L))
    omegalis=np.ones(N) # Here should be np.normal(0,W**2)
    
    for row in range(L): 
        rowstate=basis[row]
        temp=0
# Write diagonal elements first
        for i in range(N):
                    if rowstate[i]==1:
                        temp+=omegalis[i]
                        if rowstate[(i+1)%N]==1:
                            temp+=V/4
                        else:
                            temp-=V/4
                    else:
                        if rowstate[(i+1)%N]==1:
                            temp-=V/4
                        else:
                            temp+=V/4
        H[row,row]=temp
# 再写非对角元素，因为c_i^{\dagger}c_{i+1}保持粒子数守恒所以作用后的基仍在我们已经写好的基矢空间内。遍历每一个基矢作为矩阵的横坐标，如果是比如0110态，就找对应的1010态作为纵坐标，以此类推。如果你不熟悉，可以参见Sandvik的精确对角化教程。
# Write non diagonal elements again, because c_i^{\ dagger}c_ {i+1} maintains the conservation of particle numbers, so the basis after the action remains within the subspace spanned by computational basis we have already written. Traverse each basis as the x-axis of the matrix. For example if it is the 0110 state, find the 1010 state as the corresponding  y-axis, and so on. If you are not familiar with it, you can refer to Sandvik's tutorial on exact diagonalization.
        for j in range(N):
            if rowstate[j] == 0 and rowstate[(j+1)%N] == 1:
               newcolumnstate = rowstate.copy()
               newcolumnstate[j] = 1
               newcolumnstate[(j+1)%N] = 0
               column = basis.index(newcolumnstate)
               H[row, column] += t
            elif rowstate[j] == 0 and rowstate[(j+2)%N] == 1:
                newcolumnstate = rowstate.copy()
                newcolumnstate[j] = 1
                newcolumnstate[(j+2)%N] = 0
                column = basis.index(newcolumnstate)
                H[row, column] += tprime
    H = H + H.T
    for i in range(L):
        H[i, i] /= 2
    return H.toarray()

energy=np.linalg.eigvalsh(sparse_hamiltonian(1,16))
print(datetime.datetime.now()-t0)

Imitating the above python code, I write its Julia version intended to accelerate computation.

using SparseArrays
using LinearAlgebra

function half_filling(N)
    basis = []
    for x in range(0, 2^N-1)
        xb = string(x, base=2, pad=N)
        if count('1', xb) == N÷2
            push!(basis, map(x -> parse(Int64, x), split(xb, "")))
        end
    end
    return basis
end

function Hamiltonian(W,N,t=1,tprime=1,V=2)
    basis=half_filling(N)
    L=length(basis)
    H=spzeros(L,L)
    omegalis=ones(N)
    for row in 1:L
        rowstate=basis[row]
        temp=0
        for i in 1:N
            if rowstate[i]==1
                temp+=omegalis[i]
                if rowstate[mod1(i+1,N)]==1
                    temp+=V/4
                else
                    temp-=V/4
                end
            else
                if rowstate[mod1(i+1,N)]==1
                    temp-=V/4
                else
                    temp+=V/4
                end
            end
        H[row,row]=temp
        end
        for j in 1:N
            if rowstate[j]==0 && rowstate[mod1(j+1,N)]==1
                newcolumnstate=copy(rowstate)
                newcolumnstate[j]=1
                newcolumnstate[mod1(j+1,N)]=0
                column = findfirst(x -> isequal(x, newcolumnstate), basis)
                H[row,column]+=t

            elseif rowstate[j] == 0 && rowstate[mod1(j+2,N)] == 1
                newcolumnstate = copy(rowstate)
                newcolumnstate[j] = 1
                newcolumnstate[mod1(j+2,N)] = 0
                column = findfirst(x -> isequal(x, newcolumnstate), basis)
                H[row, column] += tprime
            end
        end
    end
    H=H+transpose(H)
    for k in 1:L
        H[k,k]/=2 
    end
    return Matrix(H)
end

@time eigvals(Hamiltonian(1,16))

Device and Benchmark

I have encountered the problem that my Python code runs twice as fast as Julia on my Windows platform, while in reverse on MacOS platform. So how come it can be true? We did a lot test on different platform.

Here is Hardware device and package dependence environment.

Hardware	OS	CPU	python version	numpy version	julia version	BLAS
AMD	Windows10	AMD Ryzen5 4600H	3.7.12	1.21.6	1.8.5	OpenBLAS
Mac	Ventura 13.2.1	M1Max	3.10.9	1.24.2	1.8.5	OpenBLAS -> MKL
Intel	Windows11	i9-11900H	/	/	1.8.5	OpenBLAS -> MKL
Linux	Centos7	双路至强金牌 6238R	3.9.12	1.21.5	1.8.5	numpy MKL, Julia OpenBLAS

There are two main parts in my code spending time mostly: prepare the matrix, and
Here is my ED benchmark on parameter of Hamiltonian N=16:

	np.linalg.eigvalsh	Linear.eigvals
OpenBLAS(AMD)	70(scipy 75s)	~160s
OpenBLAS(AMD,1thread)	120	/
OpenBLAS(mac)	2min(scipy 1min)	60s
OpenBLAS(mac, 1thread)	8min	/
veclib(mac)	7min09s	/
OpenBLAS(Intel)	5min	140s
MKL(Intel)	33s	37s
MKL(Intel, 1 thread)	66s	231s
MKL(Intel, Centos7)	57s	~85s
MKL(Intel, Centos7,1 thread)	64s	95s

my 1.21.6 run np.show_config() instruction return:

blas_info:
    libraries = ['cblas', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    libraries = ['cblas', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    language = f77
lapack_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas']
    language = f77
lapack_opt_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    language = f77
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
    not found = AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX,AVX512_CNL

my AMD machine run versioninfo() return:

Platform Info:
  OS: Windows (x86_64-w64-mingw32) x86_64架构，也称为AMD64 or Intel 64，w64意思是windows64位，mingw32的开发工具链(Minimalist GNU for Windows 32-bit)
  CPU: 12 × AMD Ryzen 5 4600H with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm Julia解释器所链接的数学库
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2) LLVM编译器工具链的libLLVM-13.0.1版本。ORCJIT是LLVM的即时编译器组件，znver2表示针对AMD Zen 2微架构进行了优化。与安装的Julia版本一起分发。
  Threads: 1 on 12 virtual cores 计算机当前使用了1个线程，并且您的处理器有12个虚拟核心。

LinearAlgebra.BLAS.get_config()

1
2
3

LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libopenblas64_.dll(支持ILP64（整数长为64位）数据类型的OpenBLAS库。)

my pytorch runtorch.__config__.show() return

PyTorch built with:
- C++ Version: 199711
- MSVC 193431937
- Intel(R) Math Kernel Library Version 2020.0.2 ProductBuild 20200624 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.7.3 (Git Hash6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
- OpenMP 2019
- LAPACK is enabled (usually provided by MKL)
- CPU capability usage: AVX2
- CUDA Runtime 11.7
- NVCC architecture flags: -gencode;arch=compute_37code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencodearch=compute_60,code=sm_60;-gencode;arch=compute_61code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencodearch=compute_75,code=sm_75;-gencode;arch=compute_80code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencodearch=compute_37,code=compute_37
- CuDNN 8.5
- Magma 2.5.4
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release,CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=C:/cbpytorch_1000000000000/work/tmp_bin/sccache-cl.exe,CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj /FS-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_XNNPACK-DSYMBOLICATE_MOBILE_DEBUG_HANDLE, LAPACK_INFO=mkl,PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1,TORCH_DISABLE_GPU_ASSERTS=OFF, TORCH_VERSION=2.0.0,USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1,USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON,USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON,USE_ROCM=OFF,

Optimization

Noticing that the list of basis have been sorted already, so we need not use the list.index and findfirst function but bisect.bisect_left() and searchsortedfirst, and the Type of function half_filling outcome has not been specified, so we could specify it to String Type explicitly. Then the codes become:

    newcolumnstate[(j+1)%N] = 0
        column=bisect.bisect_left(basis,newcolumnstate)
           H[row, column] += t
elif rowstate[j] == 0 and rowstate[(j+2)%N] == 1:
    newcolumnstate = rowstate.copy()
    newcolumnstate[j] = 1
    newcolumnstate[(j+2)%N] = 0
    # column = basis.index(newcolumnstate)
    column=bisect.bisect_left(basis,newcolumnstate)
    #换用这个查找函数后，构建矩阵的时间从之前24s到现在4.4s
    H[row, column] += tprime

using SparseArrays
using LinearAlgebra

function half_filling_v2(N)
    @assert mod(N,2) == 0
    
    N2 = N ÷ 2
    basis = String[]

    for x in range(0, 2^N-1)
        xb = string(x, base=2, pad=N)
        if count('1', xb) == N2 
            push!(basis, xb)
        end
    end
    return basis
end

function Hamiltonian_v2(W,N,t=1,tprime=1,V=2)
    basis = half_filling_v2(N)

    L = length(basis)
    H = spzeros(L,L)
    omegalis=ones(N)
    for row in 1:L
        rowstate=basis[row]
        temp=0
        for i in 1:N
            if rowstate[i]=='1'
                temp+=omegalis[i]
                if rowstate[mod1(i+1,N)]=='1'
                    temp+=V/4
                else
                    temp-=V/4
                end
            else
                if rowstate[mod1(i+1,N)]=='1'
                    temp-=V/4
                else
                    temp+=V/4
                end
            end
        H[row,row]=temp
        end
        for j in 1:N
            if rowstate[j]=='0' && rowstate[mod1(j+1,N)]=='1'
                statearray = collect(rowstate)
                statearray[j] = '1'
                statearray[mod1(j+1,N)] = '0'
                newcolumnstate = join(statearray)
                # column = findfirst(x -> isequal(x, newcolumnstate), basis)
                column=searchsortedfirst(basis,newcolumnstate)
                H[row,column]+=t

            elseif rowstate[j] == '0' && rowstate[mod1(j+2,N)] == '1'
                statearray = collect(rowstate)
                statearray[j] = '1'
                statearray[mod1(j+2,N)] = '0'
                newcolumnstate = join(statearray)
                column=searchsortedfirst(basis,newcolumnstate)
                H[row, column] += tprime
            end
        end
    end
    H=H+transpose(H)
    for k in 1:L
        H[k,k]/=2 
    end
    return Matrix(H)
end

a=Hamiltonian_v2(1,16)
# energy=eigvals(Hamiltonian(1,16))

Then the preparation parts optimization outcome becomes:

	python	julia
before	24s	26s
after	4s	1s

Conclusion

综合来看，上述最初版本的程序经过优化后，AMD上python从最开始的70s缩减到了50s，Julia程序从最开始的170s缩减到了144s，在Linux优化到了72s第一次，65s第二次。值得注意的是scipy和numpy的线代库不一样(
Another advantage of using scipy.linalg over numpy.linalg is that it is always compiled with BLAS/LAPACK support, while for numpy this is optional. Therefore, the scipy version might be faster depending on how numpy was installed. Therefore, unless you don’t want to add scipy as a dependency to your numpy program, use scipy.linalg instead of numpy.linalg). numpy的库和LinearAlgebra的线代库都是可以选择的，建议根据设备不同编译的时候选择链接到不同线代库上。Julia对OpenBLAS的支持属实不怎么好，建议在intel的芯片上还是用MKL，理论上AMD+OpenBLAS=Intel+MKL，在python表现出来了，Julia却没有，就很怪）。

Math Package

Basic Math Package

This basic linear algebra also have

BLAS

Basic Linear Algebra Subprograms by Netliv in Fortran, not been optimized for operations too much. Various software and hardware manufacturers highly optimize the implementation of BLAS interfaces for their products.

LAPACK

Linear Algebra PACKage, whose bottom layer is BLAS, is more efficient than BLAS, by Netlib in Fortran.

ScaLAPACK

short for Scalable LAPACK, suitable for MIMD(multiple instruction, multiple data)

Advanced Math Package

based on BLAS, LAPACK and ScaLAPACK which could be seen as interface standards, other packages are implemented. In which Atlas and OpenBLAS are open-accessed while others are commercial such as MKL cuBLAS.

Atlas

Automatically Tuned Linear Algebra Software, is possible to automatically adjust operating parameters during runtime based on hardware.

OpenBLAS

Just as its name, Open BLAS. (It is astonishing that its initial release is at 22 March 2011, and one of the developers is Chinese, Zhang Xianyi, Wang Qian)The optimization of OpenBLAS is done at compile time, so its running efficiency is generally higher than Atlas. But this also determines that Openblas has a high hardware dependency, and if the machine is changed, it may need to be recompiled.

MKL

based on Intel ® Built with C++ and Fortran compilers, and implemented threading using OpenMP *(Open Multi-Processing, a kind of parallel programming model). Support for Linux/Win (Intel chips such i9-11900H).

Its bottom has:

BLAS: All inter matrix operations (three levels) are threaded for dense and sparse BLAS. Many vector to vector operations (first level) and matrix to vector operations (second level) are aimed at Intel ® The dense matrix in 64 bit programs on the 64 architecture achieves threading. For sparse matrix, all secondary operations except triangle sparse matrix solver are threaded.
LAPACK: Some computational routines have implemented threading for certain types of problems: linear equation solvers, orthogonal factorization, single valued factorization, and symmetric eigenvalue problems. LAPACK also calls BLAS, so even non threaded functions may run in parallel.
ScaLAPACK: cluster oriented LAPACK Distributed memory parallel version.
PARDISO: The three stages of the parallel direct sparse matrix solver are threaded: reordering (optional), factoring and solving (if multiple items on the right are used).
DFTs: Discrete Fourier Transform
VML: Vector Mathematics Library
VSL: Vector Statistics Library

Armadillo

Using template metaprogramming technology, the easy-to-use C++matrix library, similar to Matlab, provides efficient LAPACK, BLAS and ATLAS packages, including many high-performance versions such as Intel MKL, AMD ACM and OpenBLAS.

bottom:

BLAS/LAPACK: Supports OpenBLAS, ACML, and MKL

Eigen(3)

Its bottom:

BLAS/LAPACK：支持所有基于F77的BLAS或LAPACK库作为底层（EIGEN_USE_BLAS、EIGEN_USE_LAPACKE）
MKL：支持MKL作为底层（EIGEN_USE_MKL_ALL）
CUDA：支持在CUDA kernels里使用CUDA
OpenMP：多线程优化

他们之间的关系

狭义的BLAS/LAPACK可理解为用于线性代数运算库的API
Netlib实现了Fortran/C版的BLAS/LAPACK、CBLAS/CLAPACK
开源社区及商业公司针对API实现了BLAS（ATLAS、OpenBLAS）和LAPACK（MKL、ACML、CUBLAS）的针对性优化
Eigen、Armadillo除自身实现线性代数运算库外还支持上述各种BLAS/LAPACK为基础的底层以加速运算

他们之间的对比：

备选：MKL、OpenBLAS、Eigen、Armadillo
接口易用程度：Eigen > Armadillo > MKL/OpenBLAS
速度：MKL≈OpenBLAS > Eigen(with MKL) > Eigen > Armadillo
其中：
OpenBLAS没有单核版本，强行指定OMP_NUM_THREADS=1性能损失大，不考虑
MKL直接使用学习成本较高，但是性能最强
Armadillo效率和接口易用性不如Eigen
Eigen的原生BLAS/LAPACK实现速度不如MKL、OpenBLAS，但是使用MKL做后台性能和MKL原生几乎一样，所以可以视情况决定是否使用MKL
AMD+openblas可以匹敌Intel+mkl，英特尔处理器无脑选择mkl，AMD 处理器下openblas的速度是mkl的两倍。
2018 MATLAB用户反馈在 Zen CPU 上的运行效率奇低，还不如几年前的 Intel CPU。人们才发现MATLAB 用的MKL在非 Intel CPU 上不开启 AVX2 指令集，后来有人逆向 MKL，挖出了那个著名的环境变量MKL_DEBUG_CPU_TYPE=5，在 Zen CPU 上强制启用 AVX2。但是2020 年 Intel 在新版本 MKL 里删掉了这个环境变量，一批 AMD 用户又各种降级。不过好在这个时候已经有可用的开源替代品 OpenBLAS 了，虽然某些性能还赶不上 MKL，但是也比没有加速强多了。并且从这个时候开始，NumPy 等一票计算库也渐渐向开源的 OpenBLAS 靠拢。后续有人进一步逆向 MKL，还能用动态库劫持的方法强制开启对 Zen AVX/AVX2 指令集的完整支持。

Above are focusing on CPU, nvidia invented cuBLAS in CUDA on GPU, others GPU accelerating packages have cuSOLVER, MAGMA, ROCm SMI Library.

Others parallel computation platform have, OpenCL, ROCm, except CUDA.

Compilation instruction

-O4 -msse2 -msse3 -msse4

-O4 是编译器优化级别的选项，表示最高级别的优化。该选项会启用各种优化技术，以尽可能提高代码执行速度和性能。不同的编译器可能对优化级别有不同的定义和实现。

-msse2、-msse3 和 -msse4 是指令集扩展选项，用于启用特定的SIMD（单指令多数据）指令集。这些指令集扩展允许并行处理多个数据元素，从而提高代码的执行效率。SSE（Streaming SIMD Extensions）指令集是一组在x86架构上用于向量化运算的指令集。

常见的SIMD指令集有：

SSE（Streaming SIMD Extensions）：这是最早的SIMD指令集，包括SSE、SSE2、SSE3、SSSE3等版本。它们广泛支持x86和x86-64架构的处理器。
AVX（Advanced Vector Extensions）：AVX是一组SIMD指令集扩展，包括AVX、AVX2和AVX-512等版本。它们提供更高的矢量化性能和更大的向量寄存器。
NEON：NEON是ARM架构上的SIMD指令集，广泛用于移动设备和嵌入式系统。
AltiVec：AltiVec是PowerPC架构上的SIMD指令集。