Posted on 2005/08/29 11:36
Filed Under 클러스터란/고성능연산_HPC 조회수: view 8092

Infiniband 로 HPC Cluster 환경 구축 하기


                             작성자 : 서 진우 (alang@syszone.co.kr)

1. 기본 Linux HPC Cluster 환경 구축 하기

        - Network 환경 구축 ( ifcfg-eth0 ..,hostname ..)
        - rsh,ssh 환경 구축
        - intel compiler 설치 ( icc, ifc .. )

2. IBGD install ( Infiniband Gold Distribution ) 하기

IBGD 는 Infiniband 개발 환경 자동 구축 스크립터 툴이다. 기본적인 환경 설정과
시스템 최적화 프로그램 RPM Build 및 install 을 순차적으로 실행해 주는 프로그램이다.
여기에서 infiniband 에서 사용하는 HCA card module 및 infiniband porotocol 을 지원하는
mpich 그리고 Benchmark 툴등을 제공한다.

기본적인 개발 환경이 구축된 상태에서 glib-devel 이 설치 되어 있어야 한다.
IBGD-1.8.0.gz source 를 풀고 install.sh 를 실행한다.


[root@noco01 infini]# tar xzvf IBGD-1.8.0.gz
[root@noco01 infini]# cd IBGD-1.8.0
[root@noco01 IBGD-1.8.0]# ./install.sh


         InfiniBand Gold Distribution (IBGD) Software Installation Menu

          1) View IBGD Installation Guide
          2) Install IBGD Software
          3) Show Installed Software
          4) Configure IPoIB Network Interface, IBADM Server, and OpenSM Server
          5) Uninstall IBGD Software
          6) Build IBGD Software RPMs

          Q) Exit

Select Option [1-6]:
                           -> 2번


          Select IBGD Software

          1) Typical (ib-verbs, ib-ipoib, opensm, ibadm and mpi)
          2) Minimal (ib-verbs only)
          3) All packages (ib-verbs, ib-ipoib, ib-cm, ib-sdp, ib-dapl, ib-srp, opensm, ibadm, mpi, pdsh)
          4) Customize

          Q) Exit


                                ib-ipoib -> device drive
                                ib-verbs ->
                                opensm   -> system monitoring app
                            ibadm    -> admin command
                                mpi      -> infiband mpi


Select Option [1-4]:
                          -> 1번

The following compiler(s) on your system can be used to build/install MPI:  gcc intel

Next you will be prompted to choose the compiler(s) with which to build/install the MPI RPM(s)

Do you wish to create/install an MPI RPM with gcc? [Y/n]:
Do you wish to create/install an MPI RPM with intel? [Y/n]:

Next you will be prompted to enter the number of CPUs in your cluster.

        small:  1 - 63 CPUs
        medium: 64 - 255 CPUs
        big:    256+ CPUs

Please select the size of your cluster [small/medium/big]:

                                                               -> small

Following is the list of IBGD packages that you have chosen
            (some may have been added by the installation program due to package dependencies):
ib-ipoib
ib-verbs
opensm
mpi_osu
ibadm

Preparing to build the IBGD RPMs:


RPM build process uses a temporary directory.

Please enter the temporary directory [/var/tmp/IBGD]:

Please enter IBGD installation directory [/usr/local/ibgd]:

The following compiler(s) will be used to build the MPI RPM(s): intel gcc


Checking dependencies. Please wait ...


Building InfiniBand Software RPMs. Please wait...


Building ib RPMs. Please wait...

Running /tmp/ib-1.8.0/build_rpm.sh --prefix /usr/local/ibgd --build_root /var/tmp/IBGD \\
--packages ib_ipoib ib_verbs -- -kver 2.6.9-11.EL.rootsmp --ksrc /lib/modules/2.6.9-11.EL.rootsmp/source
.
.


시스템에 최적화된 RPM 을 build 한다...build 가 완성되면 ..Configuring IPoIB:


The default IPoIB interface configuration is based on a LAN interface configuration.
You may change this default configuration in the following steps.

Enter LAN interface to be used for setting ib0 interface [eth0]:ib0

Configuring IPoIB:


The default IPoIB interface configuration is based on a LAN interface configuration.
You may change this default configuration in the following steps.

Enter LAN interface to be used for setting ib0 interface [eth0]:ib1

ib0 configuration:

  Current IPOIB configuration for ib0

DEVICE=ib0
BOOTPROTO=static
IPADDR=193.168.123.111
NETMASK=255.255.255.0
NETWORK=193.168.123.0
BROADCAST=193.168.123.255
ONBOOT=yes

Do you want to change this configuration? [y/N]: n

IPOIB interface configured successfully


Configuring OpenSM:

Enter OpenSM Server IP Address [192.168.123.111]:

Configuring IBADM:


Please provide IBADM (Infiniband Administration Package) configuration:

Enter IBADM Name Server and In-Band Server IP Address (one per IB subnet) [192.168.123.111]:
Enter In-Band Server Hostname [H-1]:       ->   node01
Enter firmware work directory [/tmp]:
Creating FW directory to be used by IBADM server

Running tar xzvf /usr/local/src/infini/IBGD-1.8.0/SOURCES/ibgd1.8.0.fwrel.tgz

/usr/local/ibgd/FW directory updated with new FW release

Do you want to install IBGranite Cluster Verification Suite (ibgfvs-1.0.0 - beta version) [y/N]?

         InfiniBand Gold Distribution (IBGD) Software Installation Menu

          1) View IBGD Installation Guide
          2) Install IBGD Software
          3) Show Installed Software
          4) Configure IPoIB Network Interface, IBADM Server, and OpenSM Server
          5) Uninstall IBGD Software
          6) Build IBGD Software RPMs

          Q) Exit

Select Option [1-6]:


일단 이 상태에서 설치는 완료된다.

openibd 를 실행하면 자동으로 모듈을 올리고 자동으로 네트워크를 잡는다 하지만..버그 투성
수동으로 네트워크를 잡아주는 것이 좋다

일단 설치가 완료된 상태에서 ibadm 설정을 한다.

[root@noco01 ~]# vi /etc/ibadm.hosts
------------------------------------------------------------------------------------------
192.168.123.111
192.168.123.112
.
.
infiniband 로 구성된 모든 노드 ..



그런후 첫 번째 노드에서 다른 모든 노드로 아래 파일을 복사한다.

[root@noco01 ~]# dua2 /etc/ibadm.hosts
[root@noco01 ~]# dua2 /etc/ibadm.conf
[root@noco01 ~]# dua2 /etc/ibfw

그런 후 아래 데몬을 순서대로 실행 한다.

[root@noco01 ~]# dush2 /etc/rc.d/init.d/ibadmd stop
[root@noco01 ~]# dush2 /etc/rc.d/init.d/opensmd stop
[root@noco01 ~]# dush2 /etc/rc.d/init.d/openibd restart



3. 기본 네트워크 성능 테스트 ( perf_main, Netpipe )

- perf_main Test :

noco01 에서 다음 실행 ..

[root@noco01 ~]# perf_main --send -trc -mbw -s 128000 -n 1000
********************************************
*********  perf_main version 10.3  *********
*********  CPU is: 2993.00 Mcps    *********
*********  Architecture X86     *********
********************************************


noco02 에서 다음 실행

[root@noco02 ~]# perf_main -a 192.168.123.111 (noco01 ip)
********************************************
*********  perf_main version 10.3  *********
*********  CPU is: 2993.00 Mcps    *********
*********  Architecture X86     *********
********************************************


그럼 noco01 노드의 콘솔에 아래와 같은 테스트 결과 수치가 나타난다.

************* RC BW Unidirection Test started for port 1  *********************

BW: 935.6 MBytes/sec [size: 128000 bytes, iter: 1000, total 128000000]

************* RC BW Unidirection Test Finished for port 1 *********************

즉 초당 935MB/sec 의 네트워크 대역폭을 지원함을 나타낸다. Gigabit Ethernet의 경우
초당 100MB/sec 의 네트워크 대역폭을 지원하고 있다.

- Netpipe 로 NPmpi 와 NPtcp, NPib 성능 측정

먼저 Netpipe 를 컴파일 한다.  

[root@noco01 infini]# tar xzvf NetPIPE_3.6.2.tar.tar
[root@noco01 infini]# cd NetPIPE_3.6.2

makefile 을 현재 mpich 환경에 맞게 수정한다.

[root@noco01 NetPIPE_3.6.2]# vi makefile

MPICC       = /usr/local/ibgd/mpi/osu/gcc/mvapich-0.9.5/bin/mpicc
MTHOME  = /usr/local/ibgd/driver/infinihost

그런 후 아래와 같이 make 실행을 한다.

[root@noco01 NetPIPE_3.6.2]# make mpi
[root@noco01 NetPIPE_3.6.2]# make tcp
[root@noco01 NetPIPE_3.6.2]# make ib

컴파일된 실행 파일을 모든 노드에 동기화 한다.

[root@noco01 NetPIPE_3.6.2]# dua2 *

이제 Npmpi 를 이용하여 mpi 통신에서 사용되는 네트워크 최대 네트워크 대역폭을 측정한다.

*** NPmpi ( infiniband 드라이브가 포함된 MPICH 로 mpi 통신 대역폭 측정 )

[root@noco01 NetPIPE_3.6.2]# /usr/local/ibgd/mpi/osu/gcc/mvapich-0.9.5/bin/mpirun_rsh -rsh -np 2 node01 node02 ./NPmpi
-----------------------------------------------------------------------------------------------------------------------
0: noco01
1: noco02

Now starting the main loop
  0:       1 bytes  20491 times -->      1.80 Mbps in       4.24 usec
  1:       2 bytes  23590 times -->      3.61 Mbps in       4.23 usec
  2:       3 bytes  23655 times -->      5.41 Mbps in       4.23 usec
.
.

108: 1572867 bytes     40 times -->   7314.13 Mbps in    1640.66 usec
109: 2097149 bytes     20 times -->   7335.83 Mbps in    2181.07 usec
110: 2097152 bytes     22 times -->   7334.86 Mbps in    2181.36 usec
111: 2097155 bytes     22 times -->   7331.74 Mbps in    2182.30 usec
112: 3145725 bytes     22 times -->   7354.99 Mbps in    3263.09 usec
113: 3145728 bytes     20 times -->   7355.64 Mbps in    3262.80 usec
114: 3145731 bytes     20 times -->   7351.55 Mbps in    3264.62 usec
115: 4194301 bytes     10 times -->   7365.29 Mbps in    4344.70 usec
116: 4194304 bytes     11 times -->   7366.41 Mbps in    4344.04 usec
117: 4194307 bytes     11 times -->   7372.73 Mbps in    4340.32 usec
118: 6291453 bytes     11 times -->   7387.76 Mbps in    6497.23 usec
119: 6291456 bytes     10 times -->   7388.09 Mbps in    6496.95 usec
120: 6291459 bytes     10 times -->   7385.65 Mbps in    6499.09 usec
121: 8388605 bytes      5 times -->   7393.03 Mbps in    8656.80 usec
122: 8388608 bytes      5 times -->   7393.03 Mbps in    8656.80 usec
123: 8388611 bytes      5 times -->   7391.16 Mbps in    8659.00 usec
--------------------------------------------------------------------------------------------------------------------------

아래와 같이 최대 700MB/sec 정도(Gigabit 의 7배)의 대역폭이 측정되었다.

*** NPtcp ( 일반 TCP 네트워크 대역폭 )

Node02 에서 아래 실행

[root@noco02 NetPIPE_3.6.2]# ./NPtcp

Node01 에서 아래 실행

[root@noco01 NetPIPE_3.6.2]# ./NPtcp -h node02
-----------------------------------------------------------------------------
.
.
109: 2097149 bytes      3 times -->   1055.77 Mbps in   15154.83 usec
110: 2097152 bytes      3 times -->   1060.92 Mbps in   15081.18 usec
111: 2097155 bytes      3 times -->   1056.31 Mbps in   15147.15 usec
112: 3145725 bytes      3 times -->   1059.30 Mbps in   22656.51 usec
113: 3145728 bytes      3 times -->   1061.99 Mbps in   22599.18 usec
114: 3145731 bytes      3 times -->   1057.49 Mbps in   22695.18 usec
115: 4194301 bytes      3 times -->   1058.45 Mbps in   30232.83 usec
116: 4194304 bytes      3 times -->   1056.31 Mbps in   30294.00 usec
117: 4194307 bytes      3 times -->   1061.43 Mbps in   30148.17 usec
118: 6291453 bytes      3 times -->   1063.24 Mbps in   45144.85 usec
119: 6291456 bytes      3 times -->   1061.07 Mbps in   45237.15 usec
120: 6291459 bytes      3 times -->   1060.43 Mbps in   45264.83 usec
121: 8388605 bytes      3 times -->   1060.78 Mbps in   60332.68 usec
122: 8388608 bytes      3 times -->   1061.87 Mbps in   60271.00 usec
123: 8388611 bytes      3 times -->   1062.30 Mbps in   60246.67 usec
-------------------------------------------------------------------------------

일반적인 Gigabit 정도의 수준으로 나타난다.
  
*** NPib ( infiniband 전용 대역폭 )      

[root@noco01 NetPIPE_3.6.2]# ./NPib -h node02
-------------------------------------------------------------------------------
.
.

108: 1572867 bytes     40 times -->   7299.72 Mbps in    1643.90 usec
109: 2097149 bytes     20 times -->   7327.83 Mbps in    2183.45 usec
110: 2097152 bytes     22 times -->   7326.93 Mbps in    2183.73 usec
111: 2097155 bytes     22 times -->   7329.99 Mbps in    2182.82 usec
112: 3145725 bytes     22 times -->   7354.01 Mbps in    3263.52 usec
113: 3145728 bytes     20 times -->   7354.85 Mbps in    3263.15 usec
114: 3145731 bytes     20 times -->   7353.85 Mbps in    3263.60 usec
115: 4194301 bytes     10 times -->   7367.33 Mbps in    4343.50 usec
116: 4194304 bytes     11 times -->   7368.11 Mbps in    4343.04 usec
117: 4194307 bytes     11 times -->   7368.11 Mbps in    4343.05 usec
118: 6291453 bytes     11 times -->   7382.49 Mbps in    6501.87 usec
119: 6291456 bytes     10 times -->   7382.23 Mbps in    6502.10 usec
120: 6291459 bytes     10 times -->   7382.06 Mbps in    6502.25 usec
121: 8388605 bytes      5 times -->   7388.34 Mbps in    8662.30 usec
122: 8388608 bytes      5 times -->   7388.76 Mbps in    8661.81 usec
123: 8388611 bytes      5 times -->   7388.18 Mbps in    8662.49 usec
--------------------------------------------------------------------------------

700MB/sec 정도의 속도로 MPI 와 비슷한 Gigabit 의 7배정도의 성능이 나온다.


4. HPL Linpak 성능 테스트

위 자동 인스톨 툴로 일괄 설치를 하면 infiniband + intel compiler 가 설치된 MPICH
가 설치가 된다. 하지만 정확하게 연동이 되지 않으므로 수동으로 다시 설치를 한다.

[root@noco01 src]# tar xzvf mvapich-0.9.5.tar.gz
[root@noco01 src]# cd mvapich-0.9.5
[root@noco01 mvapich-0.9.5]# ./configure --prefix=/usr/local/mvapich-intel -cc=/opt/intel/cc/9.0/bin/icc -c++=/opt/intel/cc/9.0/bin/icc -fc=/opt/intel/fc/9.0/bin/ifort -f90=/opt/intel/fc/9.0/bin/ifort -f90linker=/opt/intel/fc/9.0/bin/ifort --with-arch=LINUX --enable-f77 --enable-f90modules --with-device=vapi --disable-weak-symbols

[root@noco01 mvapich-0.9.5]# make && make install

만일 일괄 설치 패키지에서 설치된 mvapich 를 이용할 경우는 아래와 같이 하면 된다.

# /usr/local/ibgd/mpi/osu/intel/mvapich-0.9.5/bin/mpirun_rsh -rsh -np 4 node01 node01 node02 node02 ./xhpl

그런 후 HPL 테스트 진행
-------------------------------------------------------------------------------------------------------------------
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   20000
NB     :     104
PMAP   : Row-major process mapping
P      :       1
Q      :       4
PFACT  :   Crout
NBMIN  :       4
NDIV   :       2
RFACT  :   Right
BCAST  :   1ring
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

----------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
   1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
   2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
   3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be          5.421011e-20
- Computational tests pass if scaled residuals are less than           16.0

============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
WR10R2C4       20000   104     1     4             736.74          7.240e+00
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =       44.6046954 ...... FAILED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =       43.1710652 ...... FAILED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        8.2276427 ...... PASSED
||Ax-b||_oo  . . . . . . . . . . . . . . . . . =           0.000000
||A||_oo . . . . . . . . . . . . . . . . . . . =        5076.247098
||A||_1  . . . . . . . . . . . . . . . . . . . =        5074.832190
||x||_oo . . . . . . . . . . . . . . . . . . . =           5.419810
||x||_1  . . . . . . . . . . . . . . . . . . . =       20664.162508
============================================================================

Finished      1 tests with the following results:
              0 tests completed and passed residual checks,
              1 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
----------------------------------------------------------------------------

End of Tests.
============================================================================

평균 CPU 사용량 : 99%

참고로 100Mbit,1000Mbit  네트워크 환경에서 테스트 한 결과 이다.

**** 100M 환경

Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   20000
NB     :     104
PMAP   : Row-major process mapping
P      :       1
Q      :       4
PFACT  :   Crout  
NBMIN  :       4
NDIV   :       2
RFACT  :   Right
BCAST  :   1ring  
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

----------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
   1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
   2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
   3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be          5.421011e-20
- Computational tests pass if scaled residuals are less than           16.0

============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
WR10R2C4       20000   104     1     4             952.82          5.598e+00
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =       35.8555115 ...... FAILED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =       34.7030870 ...... FAILED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        6.6137956 ...... PASSED
||Ax-b||_oo  . . . . . . . . . . . . . . . . . =           0.000000
||A||_oo . . . . . . . . . . . . . . . . . . . =        5076.247098
||A||_1  . . . . . . . . . . . . . . . . . . . =        5074.832190
||x||_oo . . . . . . . . . . . . . . . . . . . =           5.419810
||x||_1  . . . . . . . . . . . . . . . . . . . =       20664.162508
============================================================================

Finished      1 tests with the following results:
              0 tests completed and passed residual checks,
              1 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
----------------------------------------------------------------------------

End of Tests.
============================================================================

평균 CPU 사용량 : 80%



Writer profile
author image
-아랑 -
2005/08/29 11:36 2005/08/29 11:36

트랙백 주소 : 이 글에는 트랙백을 보낼 수 없습니다

About

by 서진우
Twitter :@muchunalang

Counter

• Total
: 4158227
• Today
: 1469
• Yesterday
: 1261