UWMLSC > Beowulf Systems > Nemo
  

Switching

The cluster's proposed switch topology is as shown below:

Nemo Cluster Switching Topology

XFig image source file

This is a network design that is oversubscribed at one or two locations, depending upon the choice of the core switch on the right. While it would be possible to design a network with no oversubscription, this would increase costs by approximately $120k, and we believe that this would not increase the performance of the cluster on typical analylsis codes.

This design differs from the original proposal. The design in the original proposal was fully meshed and not oversubscribed. However we believe that based on observation of current analysis codes, it makes more sense to have a network design that is oversubscribed and to spend the savings on additional nodes.

The network is oversubscribed at the edges by a factor of two. The edge switches are SMC 8505T (5 port) or 8508T (8 port) unmanaged layer two switches. These are based on a Broadcom chip set, and cost approximately $62 and $89 each, respectively. These switches are non-blocking at all packet sizes and can handle jumbo frames up to 9kB in size.

There are a handful of options for the core network switch. These are:

  • Cisco 6509E chassis. This has 9 slots. One must be occupied by a supervisor 720 card. Five slots are populated with 48-port line cards, from the 67XX series. Three slots are free. Both the chassis and the line cards will not run at line speed, but will run close to line speed if we include distributed forwarding cards.
    The cost of this core switch is $117k including redundant power supplies and three years of service.

  • Cisco 6509E chassis, exactly as above but WITHOUT the distributed forwarding cards.
    The cost of this core switch is $94k.

  • Force10 E600 chassis. This has 7 slots. Five slots are populated with 48-port line cards. Two slots are free. Both the chassis and line cards are non-blocking at all packet sizes.
    The cost of this core switch is $126k.

  • Force10 E1200 chassis. This has 14 slots. Five slots are populated with 48-port line cards. Nine slots are free. Both the chassis and line cards are non-blocking at all packet sizes.
    The cost of this core switch is $141k.

We have tested the edge switches and on-board Broadcom NICs both for overall performance and for jumbo frame support. There are two choices of Linux driver for this NIC. NEED BCM RESULTS.



The following netperf tests were performed using netperf-2.4.0, available at netperf's home page. For each test, I ran netserv on one machine (with the command netserver -v 4 -4 -d, where -v 4 turned up verbosity, -4 specifies TCP/IP v4.0, and -d would return extra debug info), and then ran the following netperf commands from the remote machine. In the tests that measure bidirectional speeds I ran a netserver and netperf on each machine.

  • tg3 driver, crossover cable - tg3 driver supports >9000 byte frames (it varied by payload, I saw 9010-9014 byte frames, I tested by sending larger than 9000 byte frames, sniffing, looking for fragmentation point), I issued ping -s 9000 192.168.168.3 (note, ethernet header adds 14 bytes to IP datagrams).

    netperf -c -C -f K -l 60 -H 192.168.168.3
    ("-c" show local CPU, "-C" show remote CPU, "-f K" report in KB/s, "-l 60" = 60 sec. test)
    1. with 9000 MTU, yields 120111.19 KB/s 7.24% local CPU 9.00% remote CPU
    2. with 1500 MTU, yields 114905.40 KB/s 18.50% local CPU 40.83% remote CPU
    3. with 9000 MTU, running test in each direction simultaneously, 94587.40/94568.85 13.08/14.37 14.37/13.10

  • tg3 driver, SMC 8505T - switch only processes frames <=8996 bytes (tested by sending ICMP echo requests of increasing size until I hit a ceiling, the largest packet sent with "ping -s 8954 192.168.168.3"). Note, because of the 8996 limit of the switch, I had to set the MTU via ifconfig to 8982 for my netperf tests.

    netperf -c -C -f K -l 60 -H 192.168.168.3
    ("-c" show local CPU, "-C" show remote CPU, "-f K" report in KB/s, "-l 60" = 60 sec. test)
    1.with 8982 MTU, yields 96132.05 KB/s 5.97% local CPU 6.42% remote CPU
    2.with 1500 MTU, yields 114902.26 KB/s 18.37% local CPU 39.56% remote CPU
    3.with 8982 MTU, running test in each direction simultaneously, 49415.38/47241.83 5.79/6.72 6.73/5.78. This test was repeated MANY times with the total throughput being between 95-98MB/sec (not always evenly distributed).
    4.with 1500 MTU, running test in each direction simultaneously, 111594.47/110839.93 69.32/70.52 70.78/69.55. This test was repeated several times, with the total throughput being between 222.3-222.8MB/s.

  • tg3 driver, SMC 8508T - switch only processes frames <=8996 bytes (tested by sending ICMP echo requests of increasing size until I hit a ceiling, the largest packet sent with "ping -s 8954 192.168.168.3"). Note, because of the 8996 byte limit of the switch, I had to set the MTU via ifconfig to 8982 for my netperf tests.

    netperf -c -C -f K -l 60 -H 192.168.168.3
    ("-c" show local CPU, "-C" show remote CPU, "-f K" report in KB/s, "-l 60" = 60 sec. test)
    1.with 8982 MTU, yields 96013.90 KB/s 6.11% local CPU 6.39% remote CPU
    2.with 1500 MTU, yields 114899.13 KB/s 18.45% local CPU 40.26% remote CPU




    This graph shows total throughput of two machines, attached via crossover cable, running netperf. The bottom plot shows avg cpu % of the two machines while the top plot shows total throughput:

    TCP Benchmark Results


    This graph shows total throughput of two machines, attached via 8505T switch, running netperf. The bottom plot shows avg cpu % of the two machines while the top plot shows total throughput:

    TCP Benchmark Results


    This graph shows total throughput of two machines, attached via crossover cable, running netperf (as described above). The top plot is the total throughput with the given MTU. The bottom plot shows cpu % of the Sunfire x2100, using the Nvidia driver. The middle plot is the cpu % of the machine controlling the test. I find the results strange in their uniformity, so am trying to reproduce/retest:

    TCP Benchmark Results

SMC 8505T Benchmarking
This graph shows total throughput of two machines, both Supermicro H8DAR-T motherboards, TCP settings were not optimized, attached via 8505T switch, and running netperf. The bottom plot shows avg cpu % of the two machines while the top plot shows total throughput:
-Average CPU Usage


This graph shows total throughput of two machines, both Supermicro H8DAR-T motherboards, TCP settings were not optimized, attached via 8505T switch, and running netperf. The bottom plots shows both machines cpu usage while the top plot shows total throughput.
-Individual control and slave CPU usage


This graph shows total throughput of two machines, both Supermicro H8DAR-T motherboards, TCP settings were not optimized, attached via 8505T switch, and running netperf. The bottom (blue) plot shows both machines cpu usage while the top (red) plot shows total throughput and finally the middle (violet) plot is instructions per byte.
-Instruction per byte



SMC 8508T Benchmarking
This graph shows total throughput of two machines, both Supermicro H8DAR-T motherboards, TCP settings were not optimized, attached via 8508T switch, and running netperf. The bottom plot shows avg cpu % of the two machines while the top plot shows total throughput:
-Average CPU Usage



This
graph shows total throughput of two machines, both Supermicro H8DAR-T motherboards, TCP settings were not optimized, attached via 8508T switch, and running netperf. The bottom plots shows both machines cpu usage while the top plot shows total throughput.
-Individual control and slave CPU usage



This graph shows total throughput of two machines, both Supermicro H8DAR-T motherboards, TCP settings were not optimized, attached via 8508T switch, and running netperf. The bottom (blue) plot shows both machines cpu usage while the top (red) plot shows total throughput and finally the middle (violet) plot is instructions per byte.
-Instruction per byte


D-Link DGS-108 Benchmarking
This graph shows total throughput of two machines, both Supermicro H8DAR-T motherboards, TCP settings were not optimized, attached via DGS-108 switch, and running netperf. The bottom plot shows avg cpu % of the two machines while the top plot shows total throughput:
-Average CPU Usage



This graph shows total throughput of two machines, both Supermicro H8DAR-T motherboards, TCP settings were not optimized, attached via DGS-108 switch, and running netperf. The bottom plots shows both machines cpu usage while the top plot shows total throughput.
-Individual control and slave CPU usage



This graph shows total throughput of two machines, both Supermicro H8DAR-T motherboards, TCP settings were not optimized, attached via DGS-108 switch, and running netperf. The bottom (blue) plot shows both machines cpu usage while the top (red) plot shows total throughput and finally the middle (violet) plot is instructions per byte.
-Instruction per byte



SMCGS5 Benchmarking
This graph shows total throughput of two machines, both Supermicro H8DAR-T motherboards, TCP settings were not optimized, attached via SMCGS5 switch, and running netperf. The bottom plot shows avg cpu % of the two machines while the top plot shows total throughput:
-Average CPU Usage


This graph shows total throughput of two machines, both Supermicro H8DAR-T motherboards, TCP settings were not optimized, attached via SMCGS5 switch, and running netperf. The bottom plots shows both machines cpu usage while the top plot shows total throughput.
-Individual control and slave CPU usage


This graph shows total throughput of two machines, both Supermicro H8DAR-T motherboards, TCP settings were not optimized, attached via SMCGS5 switch, and running netperf. The bottom (blue) plot shows both machines cpu usage while the top (red) plot shows total throughput and finally the middle (violet) plot is instructions per byte.
-Instruction per byte



SMCGS8 Benchmarking
This graph shows total throughput of two machines, both Supermicro H8DAR-T motherboards, TCP settings were not optimized, attached via SMCGS8 switch, and running netperf. The bottom plot shows avg cpu % of the two machines while the top plot shows total throughput
-Average CPU Usage


This graph shows total throughput of two machines, both Supermicro H8DAR-T motherboards, TCP settings were not optimized, attached via SMCGS8 switch, and running netperf. The bottom plots shows both machines cpu usage while the top plot shows total throughput.
-Individual control and slave CPU usage


This graph shows total throughput of two machines, both Supermicro H8DAR-T motherboards, TCP settings were not optimized, attached via SMCGS8 switch, and running netperf. The bottom (blue) plot shows both machines cpu usage while the top (red) plot shows total throughput and finally the middle (violet) plot is instructions per byte.
-Instruction per byte




Crossover Cable Benchmarking
This graph shows total throughput of two machines, both Supermicro H8DAR-T motherboards, TCP settings were not optimized, attached via a crossover cable and running netperf. The bottom plot shows avg cpu % of the two machines while the top plot shows total throughput
-Average CPU Usage


This graph shows total throughput of two machines, both Supermicro H8DAR-T motherboards, TCP settings were not optimized, attached via a crossover cable and running netperf. The bottom plots shows both machines cpu usage while the top plot shows total throughput.
-Individual control and slave CPU usage


This graph shows total throughput of two machines, both Supermicro H8DAR-T motherboards, TCP settings were not optimized, attached via a crossover cable and running netperf. The bottom (blue) plot shows both machines cpu usage while the top (red) plot shows total throughput and finally the middle (violet) plot is instructions per byte.
-Instruction per byte




Check this page for dead links, sloppy HTML, or a bad style sheet; or strip it for printing.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.