There are many benchmark testing tools that are focusing on various fields of a cluster like computing capabilities, energy efficiency, data & storage, memory etc.

In this article I’m going to use Somberero – High Performance Computing tool suitable to run a test against your home PC or cluster of supercomputers.

Sombrero is particularly useful for testing a mix of computation power, RAM speed and cluster interconnect network.

You can use it in your local cluster, Cloud cluster, Oracle RAC (Real Application Cluster) as it’s very flexible and easy to use.

Although you can find more about how to install Sombrero at the following link:

**https://github.com/sa2c/sombrero**

there is one mistake in installation instructions which is why I’m providing here correct way that works on Ubuntu (RedHat clones have slightly different commands which you can find on the Web).

Let’s start with installation procedure:

```
sudo apt install git make gcc bc gnuplot openssh-server openmpi-bin openmpi-common
#this step is missing from the official repo
sudo apt install mpich
cd
git clone https://github.com/sa2c/sombrero.git
cd sombrero
make
```

Once the installation completes, everything is ready to start with tests.

For the first run I suggest to use just one CPU core and set a size of the test to small.

`username@hostname:~/sombrero>./sombrero.sh -n 1 small`

As a result you’ll get number of operations per second like the following:

[RESULT] Case 1 2356.71 Gflops in 172.899673 seconds

[RESULT] Case 1 13.63 Gflops/seconds

You can play with your PC by performing test with more cores like in the following example:

`username@hostname:~/sombrero>./sombrero.sh -n 2 small`

**One important note!**

You need to know number of cores of your server/cluster.

Below is one of many ways to find real number of cores:

```
username@hostname:~>cat /proc/cpuinfo | grep processor | wc -l
8
```

Result I’ve got needs to be divided by 2 to get the number of cores due to the hyper threading which assumes 2 threads per core on x64 processors (Intel, on RISC there can be 4 or even more threads per core).

In this case, correct number of cores is four.

You can also produce formatted output by using awk utility:

```
./sombrero.sh -n 2 -s small | awk 'BEGIN {printf("hostname");} /^\[RESULT\] Case.*Gflops.seconds/ {printf("\t" $4);} END {printf("\n");}' >> test_results.dat
```

The next step is to produce some graph to check the results visually:

```
username@hostname:~/Desktop>gnuplot
G N U P L O T
Version 5.0 patchlevel 6 last modified 2017-03-18
Copyright (C) 1986-1993, 1998, 2004, 2007-2017
Thomas Williams, Colin Kelley and many others
gnuplot home: http://www.gnuplot.info
faq, bugs, etc: type "help FAQ"
immediate help: type "help" (plot window: hit 'h')
Terminal type set to 'qt'
gnuplot> set style data histogram
gnuplot> set style histogram cluster gap 1
gnuplot> set style fill solid border rgb "black"
gnuplot> set auto x
gnuplot> set yrange [0:*]
gnuplot> set datafile separator "\t"
gnuplot> set ylabel "GFLOPs/s"
gnuplot> plot for [i=1:6] 'results.dat' using (column(i+1)):xtic(1) title 'TEST '.i
```

And here is result of my test in graphical format:

You can continue to play with various gcc options.

If you want to tweak your tests, you can change default configuration of MkFlags.

First you need to execute the following commands:

```
username@hostname:~>cd ~/sombrero/Make/
username@hostname:~/sombrero/Make>vi MkFlags
```

Default content of the MkFlags is:

```
#Compiler
CC = mpicc
CFLAGS = -std=c99 -g -O3
LDFLAGS =
```

You can change the file to something like this:

```
#Compiler
CC = mpicc
CFLAGS = -std=c99 -march=native -O3
LDFLAGS =
```

By removing -g option (debugger symbols) and by allowing gcc to use modern processor features

(-march=native), you can get about 12% better performance:

**Default tests 10.06 10.03 10.99 10.89 10.53 10.12 12.15 11.44 13.74 13.64**

**Optimized tests 11.32 10.88 11.39 10.80 10.19 10.13 11.57 11.52 13.95 13.71**

You can continue to play with various gcc parameter (you can even change compiler to clang), but that would be far beyond scope of this post.

The last interesting part is to create the same test on a cluster of machines.

For this demo first you need to create 2 VM’s (Virtual Machines) that are sharing the same network.

In the article: “Easy way to create Virtualbox VM’s internal network”, you can find one easy way of how to do it.

For more details, please check the following link:

**https://www.performatune.com/en/easy-way-to-create-virtualbox-vms-internal-network/**

Next step is to ensure login from one node/machine in the cluster to the other one without password.

First you need to create private/public keys (on every node in a cluster)

`ssh-keygen`

and then from each node to copy generated keys to all other nodes in a cluster like in the following example:

`ssh-copy-id <username>@node1_IP_address`

For example for two node cluster where host names are:

- node1 172.25.1.1
- node2 172.25.1.2

from the node1 you can execute:

```
ssh-keygen
ssh-copy-id <username>@node2
```

Then from node2 almost the same (the only difference is @node1 instead of @node2):

```
ssh-keygen
ssh-copy-id <username>@node1
```

Instead of hostname you can also use IP address.

You also need to install all packages (MPI and others) like I did on the first node (beginning of article) and then repeat installation of Sombrero tool.

After that you need to create hostfile.txt with the same content **on each node in the cluster.**

Here is what you need to do:

```
username@hostname:~/sombrero>cd ~/sombrero/
username@hostname:~/sombrero> touch hostfile.txt
```

Content of my hostfile.txt is the following:

```
vi hostfile.txt
172.25.1.1
172.25.1.2
```

As you can observe, I’m using IP address instead of hostname in this example.

Last question is how to control how many CPU’s will be used on each node in the cluster?

The answer is very simple: if you want to use 1 CPU on node1 (IP address 172.25.1.1) and two CPU’s from node2 (172.25.1.2), content of my hostfile.txt will look like this:

```
172.25.1.1
172.25.1.2
172.25.1.2
```

By repeating IP address of node2, I’m leveraging 2 CPU cores on that node.

Lastly I need to execute the following command to start the cluster test:

```
jp@jp-Ubuntu1804LTS:~/sombrero>./sombrero.sh -H hostfile.txt -s small
[GEOMETRY] Global size is 32x24x24x24
[GEOMETRY] Proc grid is 1x1x1x1
[GEOMETRY] Global size is 32x24x24x24
[GEOMETRY] Proc grid is 1x1x1x1
[GEOMETRY] Local size is 32x24x24x24
[GEOMETRY] Local size is 32x24x24x24
[GEOMETRY] Global size is 32x24x24x24
[GEOMETRY] Global size is 32x24x24x24
[GEOMETRY] Proc grid is 1x1x1x1
[GEOMETRY] Proc grid is 1x1x1x1
[GEOMETRY] Local size is 32x24x24x24
[GEOMETRY] Local size is 32x24x24x24
[GEOMETRY] Gauge field: size 442368 nbuffer 0
[GEOMETRY] Spinor field (EO): size 442368 nbuffer 0
[GEOMETRY] Gauge field: size 442368 nbuffer 0
[GEOMETRY] Spinor field (EO): size 442368 nbuffer 0
[GEOMETRY] Gauge field: size 442368 nbuffer 0
[GEOMETRY] Spinor field (EO): size 442368 nbuffer 0
[GEOMETRY] Gauge field: size 442368 nbuffer 0
[GEOMETRY] Spinor field (EO): size 442368 nbuffer 0
[MAIN] Performing 50 conjugate gradient iterations
[MAIN] Case 1: 147.29e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 1: inf operations per byte
[MAIN] Performing 50 conjugate gradient iterations
[MAIN] Case 1: 147.29e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 1: inf operations per byte
[MAIN] Performing 50 conjugate gradient iterations
[MAIN] Case 1: 147.29e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 1: inf operations per byte
[MAIN] Performing 50 conjugate gradient iterations
[MAIN] Case 1: 147.29e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 1: inf operations per byte
[RESULT] Case 1 147.29 Gflops in 21.236862 seconds
[RESULT] Case 1 6.94 Gflops/seconds
[RESULT] Case 1 147.29 Gflops in 21.353416 seconds
[RESULT] Case 1 6.90 Gflops/seconds
[RESULT] Case 1 147.29 Gflops in 21.709023 seconds
[RESULT] Case 1 6.78 Gflops/seconds
[RESULT] Case 1 147.29 Gflops in 21.991657 seconds
[RESULT] Case 1 6.70 Gflops/seconds
[MAIN] Case 2: 224.23e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 2: inf operations per byte
[MAIN] Case 2: 224.23e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 2: inf operations per byte
[MAIN] Case 2: 224.23e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 2: inf operations per byte
[MAIN] Case 2: 224.23e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 2: inf operations per byte
[RESULT] Case 2 224.23 Gflops in 29.793526 seconds
[RESULT] Case 2 7.53 Gflops/seconds
[RESULT] Case 2 224.23 Gflops in 30.681017 seconds
[RESULT] Case 2 7.31 Gflops/seconds
[RESULT] Case 2 224.23 Gflops in 31.857018 seconds
[RESULT] Case 2 7.04 Gflops/seconds
[RESULT] Case 2 224.23 Gflops in 32.133274 seconds
[RESULT] Case 2 6.98 Gflops/seconds
[MAIN] Case 3: 303.73e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 3: inf operations per byte
[MAIN] Case 3: 303.73e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 3: inf operations per byte
[MAIN] Case 3: 303.73e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 3: inf operations per byte
[MAIN] Case 3: 303.73e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 3: inf operations per byte
[RESULT] Case 3 303.73 Gflops in 37.804207 seconds
[RESULT] Case 3 8.03 Gflops/seconds
[RESULT] Case 3 303.73 Gflops in 38.542305 seconds
[RESULT] Case 3 7.88 Gflops/seconds
[RESULT] Case 3 303.73 Gflops in 42.568699 seconds
[RESULT] Case 3 7.13 Gflops/seconds
[RESULT] Case 3 303.73 Gflops in 42.795631 seconds
[RESULT] Case 3 7.10 Gflops/seconds
[MAIN] Case 4: 515.52e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 4: inf operations per byte
[MAIN] Case 4: 515.52e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 4: inf operations per byte
[MAIN] Case 4: 515.52e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 4: inf operations per byte
[MAIN] Case 4: 515.52e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 4: inf operations per byte
[RESULT] Case 4 515.52 Gflops in 54.308731 seconds
[RESULT] Case 4 9.49 Gflops/seconds
[RESULT] Case 4 515.52 Gflops in 54.376144 seconds
[RESULT] Case 4 9.48 Gflops/seconds
[RESULT] Case 4 515.52 Gflops in 58.882633 seconds
[RESULT] Case 4 8.76 Gflops/seconds
[RESULT] Case 4 515.52 Gflops in 60.108974 seconds
[RESULT] Case 4 8.58 Gflops/seconds
[MAIN] Case 5: 1105.36e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 5: inf operations per byte
[MAIN] Case 5: 1105.36e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 5: inf operations per byte
[MAIN] Case 5: 1105.36e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 5: inf operations per byte
[MAIN] Case 5: 1105.36e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 5: inf operations per byte
[RESULT] Case 5 1105.36 Gflops in 109.629005 seconds
[RESULT] Case 5 10.08 Gflops/seconds
[RESULT] Case 5 1105.36 Gflops in 110.747162 seconds
[RESULT] Case 5 9.98 Gflops/seconds
[RESULT] Case 5 1105.36 Gflops in 122.147095 seconds
[RESULT] Case 5 9.05 Gflops/seconds
[RESULT] Case 5 1105.36 Gflops in 126.025604 seconds
[RESULT] Case 5 8.77 Gflops/seconds
[MAIN] Case 6: 2067.93e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 6: inf operations per byte
[MAIN] Case 6: 2067.93e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 6: inf operations per byte
[MAIN] Case 6: 2067.93e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 6: inf operations per byte
[MAIN] Case 6: 2067.93e9 floating point operations and 0.00e6 bytes communicated
[MAIN] Case 6: inf operations per byte
```

While running the cluster test, here is the load on the node 1:

while on the next picture you can check the load on the node 2:

**Summary:**

By using various benchmark test you can not only check current server or cluster capabilities and limitations, but you can also perform stress tests which are very useful when testing stability.

You can, for exampe, over burn one node in a cluster and check if node will be evicted.

Sombrero, along with many similar tools, I’ve found very helpful in my daily work, which is the reason behind why I highly recommend it.

## Comments