Posted on 2009/08/20 17:09
Filed Under 리눅스기술문서/문제해결 조회수: view 5463

Enhanced Intel SpeedStep® Technology  (EIST) enables moderns computers to adjust CPU frequency and power consumption according to CPU usage. With the addition of Nehalem Turbo Boost it became quite cumbersome to use all these features in the benchmarking data center Intel runs in Dupont/Washington.

As the focus is on benchmarking users – both Intel Engineers and external customers, want to run the systems in various configurations. Some want best performance, other do not want to enable Turbo Boost and a few even want to clock down the CPU to simulate behavior of  slower CPUs or evaluate the scaling of programs with clock speed.

Supporting these requirements became impossible for the small administrator stuff, and an automatic solution had to be found. This paper presents a way, how to switch CPU frequency on a PER JOB basis in our PBS Pro managed HPC cluster.

Technical Background – EIST:

The following is an excerpt from a far more complete overview on “Enhanced Intel SpeedStep® Technology and Demand-Based Switching on Linux” by Venkatesh Pallipadi to be found here .

The following figure depicts the 2.6.8 kernel cpufreq infrastructure at a high level:


The Cpufreq module of the Linux kernel provides a framework to support frequency and voltage changes. It depends on hardware specific drivers (like acpi and speedstep-centrino) and provides a hardware independent interface to the so called governors. These governors can either reside in the kernel or be completely user controlled via the /proc or /sys file systems and this is the interface used in our approach.

For every CPU, including logical CPUs implemented via Hyper-Threading, found in the system the Linux kernel will create a subdir under /sys/devices/system/cpu/cpu?/ cpufreq.

[root]# ls /sys/devices/system/cpu
cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7 sched_mc_power_savings
[root]# ls /sys/devices/system/cpu/cpu0/cpufreq
affected_cpus cpuinfo_max_freq scaling_available_frequencies scaling_cur_freq scaling_governor scaling_min_freq
cpuinfo_cur_freq cpuinfo_min_freq scaling_available_governors scaling_driver scaling_max_freq scaling_setspeed

These files can be read/write using standard Unix methods. On a shell the system administrator can use cat {filename} and echo {value} > {filename} to do all necessary changes.
Readable files include:

  • scaling_driver: low-level CPU-specific driver currently in use
  • scaling_available_frequencies: list of all the frequencies supported on this processor (all frequency values are in KHz
  • scaling_available_governors: lists all the governors that can be used in this system

The following files allow read and write access. While read gives the current settings, only specific values are allowed for write. Allowed values can be found by reading from the files in the previous paragraph.

  • scaling_governor: current policy governor being used
  • scaling_cur_freq: provides an interface to get the current frequency
  • scaling_max_freq: limits maximum frequency that can be set by the governor
  • scaling_min_freq: limits minimum frequency that can be set by the governor
  • scaling_setspeed: available only if governor is set to userspace; if set, writing a value from scaling_available_frequencies will change the CPU frequency accordingly.

Redhat Enterprise 5 employs this interface (userpace governor and scaling_setspeed file) to control the emand based frequency switching with the user-level daemon cpuspeed.

The PBS Pro batch scheduling system:

To allow parallel usage of our clusters we employ Altair’s PBS Pro batch management solution. In our configuration PBS Pro ensures, that every node at any given time is usable only by exactly ONE job and user. Wile a user’s job has control over the system, the user can use remote commands like ssh to access it. All other processes with a user ID greater than 1000 are automatically detected by PBS and killed.

Once PBS schedules a job to a number of nodes, on the first node the script prologue is executed (default location /var/spool/PBS/mom_priv) with an effective UID of 0 (aka run as root). In our environment this is a shell script used for a couple of reasons:

  • Checking consistency of all nodes reserved for a job.
  • Ensuring all file systems report properly
  • Ensuring no processes from previous jobs have been left behind
  • Setting on ALL nodes associated to a job special configurations as requested by a user. One of these items is CPU frequency.
  • Prints a report on important characteristics of the nodes in a job. This includes kernel and OFED version in use, version of motherboard BIOS and IB-card firmware and so on

After the job is done an epilogue script is run (in effect we use the same script that is executed under different names). Again the nodes are checked, and any special configurations returned to their default states.

Additional tools used

It is important to note, the these two scripts are only executed on the first node associated to a job (the head node). To process various commands in parallel on all nodes within a job, we use the program pdsh. A typical command might look like

[root]# pdsh –w en[001,003-004] –u 3 pwd

This executes in PARALLEL the command pwd on the nodes en001, en003 and en004. Often the output is then parsed by dshbak to combine identical output into a format more easy to read.

Methodoloy

EIST must be enabled within the BIOS and supported by the Linux kernel. We use Redhat 5.3 within CRT-DC which supports all required features and can use Linux capabilities to switch between various states. Our methodology consists of:

  1. ensure all necessary drivers are loaded and all files in /sys have been created
  2. a user submits a job and requests via a PBS resource that all nodes on this job are set to a specified CPU frequency
  3. during PBS pre execution the prologue script parses the user requests and take appropriate measures to set everything according to user request. If a node is found wanting, it is take offline, the prologue script exits with an error and the job automatically requeued by PBS.
  4. job executes
  5. after the job has finished, the epilogue script ensures all nodes are returned to standard configuration. If a node is found wanting, it is take offline.
Preparation of nodes

One has to ensure that all necessary drivers are loaded and all files in /sys have been created. We found the easiest way to do this under Redhat 5 Linux is to execute

/etc/init.d/cpuspeed start; sleep 1; /etc/init.d/cpuspeed stop 

on our compute nodes. After cpuspeed is stopped the system will remain in the highest frequency available. Checking

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies

on our NHM systems gives

2794000 2793000 2660000 2527000 2394000 2261000 2128000 1995000 1862000 1596000

with 2794000 indicating "Turbo Mode" (notice it's only a single step above the next lower frequency). As our default behavior is "Turbo Off", we next force the system to switch to 2793000 MHz.

speed=2793000
for file in /sys/devices/system/cpu/cpu*/cpufreq/scaling_setspeed
do
echo "$speed" > $file
done

At this point the system will run fixed at the design speed. One can check via

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq

or

grep MHz /proc/cpuinfo

These steps we run during our validation process. Each time before a system is handed over to users an elaborate script runs ensuring consistency of all compute node across the cluster.

Job Sumbission by User

To allow users to change speed settings (and Turbo mode) at run time we use features of PBS, the batch processing and queuing system managing our clusters. Users indicate at the time of job submission via resource EIST what speed they want to use during their run, and if Turbo should be enabled or not.

Turbo can be activated in 2 ways - either by setting

echo FREQ > /sys/devices/system/cpu/cpu*/cpufreq/scaling_setspeed

to the highest possible frequency (2794000 in the example above), or by starting the cpuspeed demon. We recommend using the later option, as this will also ensure CPU frequencies on cores not needed are kept low. This optimizes both power consumption and turbo behavior.

Within our environment a user can change the behavior of all nodes via the resource EIST. Valid options are:

EIST=0 CPUspeed=2.793, no cpuspeed demon, default
EIST=1 cpuspeed demon started; Turbo active
EIST=X cpuspeed set to value closest to X (X>0)

Some examples:

  qsub … -l EIST=1 ./test.sh switches cpuspeed daemon on 
qsub … -l EIST=2261000 ./test.sh Sets the cpuspeed to 2261000 Hz
qsub … -l EIST=1596000 ./test.sh Sets the cpuspeed to 1596000 Hz
qsub … -l EIST=0 ./test.sh switches cpuspeed daemon off
(disables TURBO Mode, default)

The PBS prologue script

Under PBS, before a job runs, on the headnode (the first node used in a job), the script /var/spool/PBS/mom_priv/prologue is executed under root privileges. During this script we evaluate the resources requested, and set the frequency accordingly.

Unfortunately the author did not find a direct way to query PBS (version 8) in an easier way for resources. So we use "qstat -f" and analyze the output in a fairly complicated sed statement and "eval" the resulting string. In our environment a request "-l EIST=1596000" will therefore create a shell variable Resource_List_EIST with the value 1596000.

RETURN=`qstat -f $JOBID | sed -e 's/\t//g' -e 's/Job Id:/Job_Id =/' | \
sed -e ':a' -e '$!N; s/\n//; ta' -e 's/ /\n/g' |\
sed -e 's/ = /="/' -e 's/$/";/g' -e 's/resources_used./resources_used_/' -e 's/Resource_List./Resource_List_/' \
-e 's/^/export /'`
eval $RETURN

In the configuration part of the script we set EIST to the default frequency. We also use this variable to switch this option on a clusterwide level.

EIST=2793000

The code evaluating the user set resource and setting frequency is shown below. Keep in mind, that the script is only executed on the headnode. We use "pdsh" to distribute the settings to all nodes used in this job.

# check if this feature is currently enabled on the cluster

if [ "${EIST}" -gt 0 ]
then

# if the user set "-l EIST=0" we are going to use
# the default frequency
if [ "${Resource_List_EIST}" = 0 ]
then
Resource_List_EIST=${EIST}
fi

# during prologue set
if [ "$prologue" -a -n "${Resource_List_EIST}" ]
then
EIST=${Resource_List_EIST}
fi

# if EIST is still set to one, we only start cpuspeed
if [ "${EIST}" = 1 ]
then
pdsh -w "${NODES}" -x "${HEADNODE}" -u 5 \
/etc/init.d/cpuspeed start | dshbak -c
else
# ensure cpuspedd is stoped, and then the requested speed
# is set on all nodes
pdsh -w "${NODES}" -x "${HEADNODE}" -u 5 \
/etc/init.d/cpuspeed stop | dshbak -c
for I in `seq 0 ${MAXCORES}`
do
FILE=/sys/devices/system/cpu/cpu${I}/cpufreq/scaling_setspeed
pdsh -w "${NODES}" -x "${HEADNODE}" -u 5 \
"[ -f ${FILE} ] && echo ${EIST} > ${FILE};exit 0"
done
fi
fi


At the end of the script we inform the user about the current settings. This will show up in the standard output file of each job before any other user output.
echo "speedstep setting: EIST=${EIST}"
pdsh -w "$NODES" -x "$HEADNODE" -u 5 \
'/etc/init.d/cpuspeed status;exit 0' | dshbak -c
pdsh -w "${NODES}" -x "${HEADNODE}" -u 5 \
"cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq" \
| dshbak –c
Beyond the fence – SuSE SEL11

SuSE systems are a little bit different. As with Redhat some items depend on the configuration. Using a default desktop config the author found that power management was directed by the “Gnome Power Management” utility. SEL11 also comes with the powersaved package that contains userspace demon to control the CPU frequency. Please take a look at your configuration and the documentation provided by Novell.

Using a SEL11 in runstate 3 (no X windows; typical for HPC server farms) the author found that the kernel provided ondemand governor was regulating CPU frequencies, and without load the CPU would run on lowest frequency:
# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
ondemand

ondemand
# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
1596000

1596000

Nevertheless, a small benchmark program revealed EIST was working, and the CPU would go into Turbo mode as soon as load was applied:

> ./bin/blackscholes 1 100000000
The integral of BS(T) over [0,1] with 100000000 steps (1 threads) is 0.770042642388
Time Elapsed: 9.07 sec

Not surprisingly the same possible frequency range as found under Redhat was seen again:

#cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
2794000 2793000 2660000 2527000 2394000 2261000 2128000 1995000 1862000 1596000

Remember – the highest frequency denotes Turbo mode, the second highest value gives the rated CPU frequency.

To customize frequencies by hand one has first to switch the governor to userpace (at that point CPU frequency will remain unchanged from it’s current state):

# for I in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor;  \
do echo userspace > $I; done
# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
userspace

userspace
# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
1596000

1596000

The Blacksholes benchmark takes now more than twice as long as before:

> ./bin/blackscholes 1 100000000
The integral of BS(T) over [0,1] with 100000000 steps (1 threads) is 0.770042642388
Time Elapsed: 18.20 sec

Again one can easily set the CPU to it’s highest standard value WITHOUT turbo:

# for I in /sys/devices/system/cpu/cpu*/cpufreq/scaling_setspeed; \
do echo 2793000 > $I; done
# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
2793000

2793000

The Blacksholes benchmark regains almost it’s original speed:

> ./bin/blackscholes 1 100000000
The integral of BS(T) over [0,1] with 100000000 steps (1 threads) is 0.770042642388
Time Elapsed: 10.36 sec

And lastly one can enable Turbo mode (highest available frequency):

# for I in /sys/devices/system/cpu/cpu*/cpufreq/scaling_setspeed; \
do echo 2794000 > $I; done
# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
2794000
...
2794000

The Blackshole test runs now as fast as before – the small difference of 0.01s should not be take seriously:

> ./bin/blackscholes 1 100000000
The integral of BS(T) over [0,1] with 100000000 steps (1 threads) is 0.770042642388
Time Elapsed: 9.06 sec

For this specific SuSE installation there the only change necessary in the PBS prologue script would be to exchange the lines

/etc/init.d/cpuspeed start

with

# for I in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; \
do echo userspace > $I; done
and lines
/etc/init.d/cpuspeed stop

With

# for I in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; \
do echo ondemand > $I; done

Summary

This paper explained in detail a method how to enable Enhanced Intel SpeedStep® Technology and Nehalem Turbo Mode within the constraints of a multi user HPC cluster. The author hopes this Whitepaper helps the reader in using Intel technology to her best advantages. He can be reached via e-mail at Michael.hebenstreit@intel.com.

Writer profile
author image
-아랑 -
2009/08/20 17:09 2009/08/20 17:09

트랙백 주소 : 이 글에는 트랙백을 보낼 수 없습니다

About

by 서진우
Twitter :@muchunalang

Counter

• Total
: 4164446
• Today
: 384
• Yesterday
: 1075