Act 1 – The Problem
Here be known our Complicating Incident.
Look at that fancy new server just installed in the rack with 8x HDR 200Gb InfiniBand HCAs on it. WOW. That is going to make your data scientist stakeholders very happy. Linux is installed and it’s just dying to consume input.
This particular node needs to connect to NFS and a parallel filesystem (PFS), as well as a head node for workload management. For simplicity’s sake, all HCA NICs are configured as 10.10.1.0/16.
- ib0 – 10.10.1.10/16
- ib1 – 10.10.1.11/16
- ib2 – 10.10.1.12/16
- ib3 – 10.10.1.13/16
- ib4 – 10.10.1.14/16
- ib5 – 10.10.1.15/16
- ib6 – 10.10.1.16/16
- ib7 – 10.10.1.17/16
It’s ready to rock! Hold on to your horses, not yet. Executed from a peer node on 10.10.1.20/16:
$ sudo ping 10.10.1.10 PING 10.10.1.10 (10.10.1.10) 56(84) bytes of data. 64 bytes from 10.10.1.10: icmp_seq=1 ttl=64 time=0.102 ms 64 bytes from 10.10.1.10: icmp_seq=2 ttl=64 time=0.112 ms 64 bytes from 10.10.1.10: icmp_seq=3 ttl=64 time=0.113 ms ^C --- 10.10.1.10 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 4120ms rtt min/avg/max/mdev = 0.102/0.111/0.118/0.010 ms
That looks very good. And now we run into trouble:
$ sudo ping 10.10.1.11 PING 10.10.1.11 (10.10.1.11) 56(84) bytes of data. From 10.10.1.11 icmp_seq=1 Destination Host Unreachable From 10.10.1.11 icmp_seq=2 Destination Host Unreachable From 10.10.1.11 icmp_seq=3 Destination Host Unreachable ^C --- 10.10.1.11 ping statistics --- 5 packets transmitted, 0 received, +3 errors, 100% packet loss, time 4101ms pipe 4
Uh oh. ib_read_bw also fails:
$ ib_read_bw -a -R -F 10.10.1.11 Unexpected CM event bl blka 8 Unable to perform rdma_client function Unable to init the socket connection
That’s not too surprising. An ICMP failure is a fairly good indication that RDMA CM will also fail.
What is going on with this hot mess? “ip a” indicates that all of the IPoIB devices are up, connected and configured with an IP address. So, you check the network configs. They look OK. Then you probably don’t trust that and go through a process of checking link lights on the switch and changing out known-good cables for suspect ones. That doesn’t work. Then you complain about this issue on Slack while you’re pulling your hair out and your colleague hips you to “policy-based routing”.
Act 2 – Policy-Based Routing
What is policy-based routing? Policy-based routing is what is going to solve your problem at hand. The problem you are encountering is that the Linux default IP routing policy doesn’t understand how to route IP packets to multiple interfaces that are configured on the same IPv4 subnet.
It seems like Linux should understand this, but it doesn’t.
For reference, here is a document that describes this in detail: https://www.usenix.org/system/files/login/articles/login_summer16_10_anderson.pdf
TL;DR. To summarize: the node must be configured with routing tables, routes and rules per NIC that tell the IP stack which NIC to send packets through according to a received packet’s destination address.
Tables, Rules and Routes
For simplicity’s sake, this example will configure two NICs with policy-based routing. For all Linux:
/etc/iproute2/rt_tables: # append these definitions to the existing config 100 t1 101 t2
This configuration adds human-readable names for routing tables indexed by a number. Granted, the names I used are fairly obtuse; your configuration may use whatever name seems fit. The names and numbers are arbitrary but must be orthogonal within rt_tables.
Next, configure the routes. This instruction is for RHEL ifconfig systems. See the Addendum for Netplan configuration. For each IPoIB interface (aka NIC), create a file in /etc/sysconfig/network-scripts/route-ib0 (or ib1, whatever. Use the NIC name).
10.10.0.0/16 dev ib0 scope link src 10.10.1.10 table t1
10.10.0.0/16 dev ib1 scope link src 10.10.1.11 table t2
This creates a separate route for ib0 and ib1 according to a packet’s destination address.
Note that the route files are incompatible with NetworkManager, so ifcfg-ib0 etc. need an “NM_CONTROLLED=no” directive to make this work.
We are almost there. An additional rule file must be added per NIC:
table t1 from 10.10.1.10
table t2 from 10.10.1.11
This creates the routing tables per NIC IP address. The per-NIC tables, rules and routes are now configured. The last remaining step is to configure ARP.
ARP must be adjusted to require packet response upon the originating interface. Take the following template and write it to /etc/sysctl.d/50-multihoming-arp.conf:
#ib0 ARP config net.ipv4.conf.ib0.rp_filter = 1 net.ipv4.conf.ib0.arp_filter = 1 net.ipv4.conf.ib0.arp_announce = 2 net.ipv4.conf.ib0.arp_ignore = 2 #ib1 ARP config net.ipv4.conf.ib1.rp_filter = 1 net.ipv4.conf.ib1.arp_filter = 1 net.ipv4.conf.ib1.arp_announce = 2 net.ipv4.conf.ib1.arp_ignore = 2
Add directives for any additional interfaces.
Act 3 – To Test, or Not to Test
With your due diligence towards these instructions, rebooting the node will present the configured NICs accessible via IP and RDMA CM.
Hopefully the node in question has a KVM interface in case it is remote and the IP config got totally whacked. Maybe that should have been stated earlier?
First-blush diagnosis is performed via “ip route show table main”. There should be an entry for each IPoIB device that was configured with an associated route and rules script.
$ ip route show table main default via 10.10.100.1 dev eno1 10.12.0.0/16 dev eno1 proto kernel scope link src 10.12.101.10 10.10.0.0/16 dev ib0 proto kernel scope link src 10.10.1.10 10.10.0.0/16 dev ib1 proto kernel scope link src 10.10.1.11 169.254.0.0/16 dev eno1 scope link metric 1002 169.254.0.0/16 dev ib0 scope link metric 1004 169.254.0.0/16 dev ib1 scope link metric 1005
And individual tables:
$ ip route show table t1 10.10.0.0/16 dev ib0 scope link src 10.10.1.10$ ip route show table t2 10.10.0.0/16 dev ib1 scope link src 10.10.1.11
If that looks reasonable, then test with ping from a remote node:
$ sudo ping 10.10.1.11 PING 10.10.1.11 (10.10.1.11) 56(84) bytes of data. 64 bytes from 10.10.1.11: icmp_seq=1 ttl=64 time=0.149 ms 64 bytes from 10.10.1.11: icmp_seq=2 ttl=64 time=0.147 ms 64 bytes from 10.10.1.11: icmp_seq=3 ttl=64 time=0.098 ms ^C --- 10.10.1.11 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2028ms rtt min/avg/max/mdev = 0.098/0.131/0.149/0.025 ms
Test each of the node’s interfaces to verify connectivity.
Now test RDMA CM. Start a server on the node:
$ ib_read_bw -a -R -F
And launch the client on the remote peer:
$ ib_read_bw -a -R -F 10.10.1.11
ib_read_bw should output maximum bandwidth at ~10GBps for an EDR fabric. HDR performance will depend upon whether your nodes have the HCAs in PCIe 3 x16 or PCIe 4 x16 slots (PCIe 3 x16 is limited to ~126Gbps). Any connection failures would indicate that the policy-based routing is misconfigured. Or maybe there is a firewall running — turn it off if your server is in a secure environment.
A netplan configuration that approximates the previous configuration steps would resemble:
network: ethernets: eno1: dhcp4: true ib0: dhcp4: false addresses: [10.10.1.10/16] routes: - to: 10.10.0.0/16 table: 100 scope: link routing-policy: - from: 10.10.1.10 table: 100 priority: 100 optional: true ib1: dhcp4: false addresses: [10.10.1.11/16] routes: - to: 10.10.0.0/16 table: 101 scope: link routing-policy: - from: 10.10.1.11 table: 101 priority: 101 optional: true version: 2
Policy-based routing is key to driving traffic to multiple Linux NICs that cooperate in a single IPv4 subnet, be they Ethernet or InfiniBand IPoIB. The Linux routing tables must be configured to understand how to communicate over multiple NICs on the same subnet. This configuration is a detailed process and must be tested to ensure that connectivity exists between all cluster NICs.
Hope this helps. Avail me of your HPC travels at firstname.lastname@example.org.