Act 1 – The Problem

Here be known our Complicating Incident.

Look at that fancy new server just installed in the rack with 8x HDR 200Gb InfiniBand HCAs on it. WOW. That is going to make your data scientist stakeholders very happy. Linux is installed and it’s just dying to consume input.

Need Input - Short Circuit

This particular node needs to connect to NFS and a parallel filesystem (PFS), as well as a head node for workload management. For simplicity’s sake, all HCA NICs are configured as 10.10.1.0/16.

  • ib0 – 10.10.1.10/16
  • ib1 – 10.10.1.11/16
  • ib2 – 10.10.1.12/16
  • ib3 – 10.10.1.13/16
  • ib4 – 10.10.1.14/16
  • ib5 – 10.10.1.15/16
  • ib6 – 10.10.1.16/16
  • ib7 – 10.10.1.17/16

It’s ready to rock! Hold on to your horses, not yet. Executed from a peer node on 10.10.1.20/16:

$ sudo ping 10.10.1.10
PING 10.10.1.10 (10.10.1.10) 56(84) bytes of data.
64 bytes from 10.10.1.10: icmp_seq=1 ttl=64 time=0.102 ms
64 bytes from 10.10.1.10: icmp_seq=2 ttl=64 time=0.112 ms
64 bytes from 10.10.1.10: icmp_seq=3 ttl=64 time=0.113 ms
^C
--- 10.10.1.10 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 4120ms
rtt min/avg/max/mdev = 0.102/0.111/0.118/0.010 ms

That looks very good. And now we run into trouble:

$ sudo ping 10.10.1.11
PING 10.10.1.11 (10.10.1.11) 56(84) bytes of data.
From 10.10.1.11 icmp_seq=1 Destination Host Unreachable
From 10.10.1.11 icmp_seq=2 Destination Host Unreachable
From 10.10.1.11 icmp_seq=3 Destination Host Unreachable
^C
--- 10.10.1.11 ping statistics ---
5 packets transmitted, 0 received, +3 errors, 100% packet loss, time 4101ms
pipe 4

Uh oh. ib_read_bw also fails:

$ ib_read_bw -a -R -F 10.10.1.11
Unexpected CM event bl blka 8
Unable to perform rdma_client function
Unable to init the socket connection

That’s  not too surprising. An ICMP failure is a fairly good indication that RDMA CM will also fail.

What is going on with this hot mess? “ip a” indicates that all of the IPoIB devices are up, connected and configured with an IP address. So, you check the network configs. They look OK. Then you probably don’t trust that and go through a process of checking link lights on the switch and changing out known-good cables for suspect ones. That doesn’t work. Then you complain about this issue on Slack while you’re pulling your hair out and your colleague hips you to “policy-based routing”.

Act 2 – Policy-Based Routing

What is policy-based routing? Policy-based routing is what is going to solve your problem at hand. The problem you are encountering is that the Linux default IP routing policy doesn’t understand how to route IP packets to multiple interfaces that are configured on the same IPv4 subnet.

It seems like Linux should understand this, but it doesn’t.

For reference, here is a document that describes this in detail: https://www.usenix.org/system/files/login/articles/login_summer16_10_anderson.pdf

TL;DR. To summarize: the node must be configured with routing tables, routes and rules per NIC that tell the IP stack which NIC to send packets through according to a received packet’s destination address.

Tables, Rules and Routes

For simplicity’s sake, this example will configure two NICs with policy-based routing. For all Linux:

/etc/iproute2/rt_tables:
# append these definitions to the existing config
100 t1
101 t2

This configuration adds human-readable names for routing tables indexed by a number. Granted, the names I used are fairly obtuse; your configuration may use whatever name seems fit. The names and numbers are arbitrary but must be orthogonal within rt_tables.

Next, configure the routes. This instruction is for RHEL ifconfig systems. See the Addendum for Netplan configuration. For each IPoIB interface (aka NIC), create a file in /etc/sysconfig/network-scripts/route-ib0 (or ib1, whatever. Use the NIC name).

/etc/sysconfig/network-scripts/route-ib0:

10.10.0.0/16 dev ib0 scope link src 10.10.1.10 table t1

And /etc/sysconfig/network-scripts/route-ib1:

10.10.0.0/16 dev ib1 scope link src 10.10.1.11 table t2

This creates a separate route for ib0 and ib1 according to a packet’s destination address.

Note that the route files are incompatible with NetworkManager, so ifcfg-ib0 etc. need an “NM_CONTROLLED=no” directive to make this work.

We are almost there. An additional rule file must be added per NIC:

/etc/sysconfig/network-scripts/rule-ib0:

table t1 from 10.10.1.10

/etc/sysconfig/network-scripts/rule-ib1:

table t2 from 10.10.1.11

This creates the routing tables per NIC IP address. The per-NIC tables, rules and routes are now configured. The last remaining step is to configure ARP.

ARP Configuration

ARP must be adjusted to require packet response upon the originating interface. Take the following template and write it to /etc/sysctl.d/50-multihoming-arp.conf:

#ib0 ARP config
net.ipv4.conf.ib0.rp_filter = 1
net.ipv4.conf.ib0.arp_filter = 1
net.ipv4.conf.ib0.arp_announce = 2
net.ipv4.conf.ib0.arp_ignore = 2
#ib1 ARP config
net.ipv4.conf.ib1.rp_filter = 1
net.ipv4.conf.ib1.arp_filter = 1
net.ipv4.conf.ib1.arp_announce = 2
net.ipv4.conf.ib1.arp_ignore = 2

Add directives for any additional interfaces.

Act 3 – To Test, or Not to Test

With your due diligence towards these instructions, rebooting the node will present the configured NICs  accessible via IP and RDMA CM.

Hopefully the node in question has a KVM interface in case it is remote and the IP config got totally whacked. Maybe that should have been stated earlier?

First-blush diagnosis is performed via “ip route show table main”. There should be an entry for each IPoIB device that was configured with an associated route and rules script.

$ ip route show table main
default via 10.10.100.1 dev eno1
10.12.0.0/16 dev eno1 proto kernel scope link src 10.12.101.10
10.10.0.0/16 dev ib0 proto kernel scope link src 10.10.1.10
10.10.0.0/16 dev ib1 proto kernel scope link src 10.10.1.11
169.254.0.0/16 dev eno1 scope link metric 1002
169.254.0.0/16 dev ib0 scope link metric 1004
169.254.0.0/16 dev ib1 scope link metric 1005

And individual tables:

$ ip route show table t1
10.10.0.0/16 dev ib0 scope link src 10.10.1.10
$ ip route show table t2
10.10.0.0/16 dev ib1 scope link src 10.10.1.11

If that looks reasonable, then test with ping from a remote node:

$ sudo ping 10.10.1.11
PING 10.10.1.11 (10.10.1.11) 56(84) bytes of data.
64 bytes from 10.10.1.11: icmp_seq=1 ttl=64 time=0.149 ms
64 bytes from 10.10.1.11: icmp_seq=2 ttl=64 time=0.147 ms
64 bytes from 10.10.1.11: icmp_seq=3 ttl=64 time=0.098 ms
^C
--- 10.10.1.11 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2028ms
rtt min/avg/max/mdev = 0.098/0.131/0.149/0.025 ms

Test each of the node’s interfaces to verify connectivity.

Now test RDMA CM. Start a server on the node:

$ ib_read_bw -a -R -F

And launch the client on the remote peer:

$ ib_read_bw -a -R -F 10.10.1.11

ib_read_bw should output maximum bandwidth at ~10GBps for an EDR fabric. HDR performance will depend upon whether your nodes have the HCAs in PCIe 3 x16 or PCIe 4 x16 slots (PCIe 3 x16 is limited to ~126Gbps). Any connection failures would indicate that the policy-based routing is misconfigured. Or maybe there is a firewall running — turn it off if your server is in a secure environment.

Addendum: Netplan

A netplan configuration that approximates the previous configuration steps would resemble:

network:
  ethernets:
    eno1:
      dhcp4: true
    ib0:
      dhcp4: false
      addresses: [10.10.1.10/16]
      routes:
       - to: 10.10.0.0/16
         table: 100
         scope: link
      routing-policy:
       - from: 10.10.1.10
         table: 100
         priority: 100
      optional: true
    ib1:
      dhcp4: false
      addresses: [10.10.1.11/16]
      routes:
       - to: 10.10.0.0/16
         table: 101
         scope: link
      routing-policy:
       - from: 10.10.1.11
         table: 101
         priority: 101
      optional: true
  version: 2

Conclusions

Policy-based routing is key to driving traffic to multiple Linux NICs that cooperate in a single IPv4 subnet, be they Ethernet or InfiniBand IPoIB. The Linux routing tables must be configured to understand how to communicate over multiple NICs on the same subnet. This configuration is a detailed process and must be tested to ensure that connectivity exists between all cluster NICs.

Hope this helps. Avail me of your HPC travels at bsmith@systemfabricworks.com.