Most HPC administrators have had to cut their teeth on SCSI RDMA Protocol (SRP) at some point. And most of them have torn their hair out getting SRP to work. SRP is a fine remote block storage protocol that is one of the rocks HPC was built opon — if a little dated. Troubleshooting SRP can be a nightmare. Let’s call it “obtuse”.
Allow me to describe the situation du jour. I have two SRP initiators in my lab running CentOS 8.5, MOFED 5.9 and ConnectX-6 HCAs that were working perfectly with SRP block devices provided by a NetApp E-Series target. These devices are very fast and very reliable. Due to requirements of a recent project, I turned off those mounts and began to use local storage. Intervening in this timeframe we brought two more SRP initiators online for a separate project. Those also work extremely well.
The fly in the ointment arose today when I tried to get my two, original SRP initiators connected to the NetApp. After restarting srp_daemon and waiting for a few minutes, multipath does not indicate that the SRP block devices are available. Well, that’s strange. My first assumption was that my colleague may have borked up my target config and after confirming this was not the case, I felt like I owed him an apology for an imagined crime.
Analysis of syslog brought this problem to my attention:
[ 397.190547] scsi host11: ib_srp: REJ received
[ 397.190548] scsi host11: REJ reason 0x3
[ 397.190575] scsi host11: ib_srp: Connection 0/44 to fe80:0000:0000:0000:ec0d:9a03:0064:ef5c failed
Remember what I stated about this technology being a little obtuse? ib_srp is the kernel module that provides the SRP services. fe80:blah-blah-blah indicates the GUID of the NetApp’s HCA. “REJ reason 0x3” is utter Greek.
One might believe that 0x3 refers to Linux error ESRCH. That doesn’t make any sense for diagnosis of this problem. Examination of the target logs doesn’t reveal any errors or interesting information. Digging deeper, I found SRP_LOGIN_REJ_UNABLE_ASSOCIATE_CHANNEL=0x00010003 in /usr/src/mlnx-ofa-kernel-5.9/includes/scsi/srp.h. OK, now we’re getting somewhere. I still don’t have root cause nor a solution.
ChatGPT was sheepishly consulted about this and it told me to refer to an InfiniBand expert. I’m feeling a lot better about my job security v/s AI.
Let’s bring in the old hatchet of Google search. After some tooling around through that I found this support article: https://kb.netapp.com/onprem/E-Series/Hardware/E-Series_Infiniband_SRP_hosts_lose_access_to_storage_array
Perfect! I was a little skeptical at first that this is the solution. Exhaustion of RDMA channels caused by four SRP initiators? That doesn’t sound right. It turns out it was exactly right.
The E-Series allows for 128 RDMA channels per port. The initiator configuration for the number of channels consumed is ib_srp’s ch_count parameter. “modinfo ib_srp” tells me that the default value is 0 and “cat /sys/module/ib_srp/parameters/ch_count” confirms it. What does 0 mean? The default value means the lesser of number of cores or the count of completion vectors for the HCA.
Great. I know how many cores I have per machine: 44. That would certainly exhaust 128 channels with more than two nodes. Regarding the quantity of completion vectors per HCA, that’s a mystery to me. A little more experimentation with the IB tools yields:
$ ibv_devinfo -v | grep num_comp_vectors
Mystery solved! Each node consumes a default of 44 RDMA channels per port on the NetApp SRP target. How do we get four initiators using the same SRP target? That was the easiest part of this exercise.
options ib_srp ch_count=8
This configuration limits the node to use 8 RDMA channels, which allows 16 initiators to use the same SRP target. Configure, reboot, and the nodes are all connecting.
What are the implications of this new configuration? Lowering ib_srp: ib_srp ch_count has the potential to reduce initiator performance. This should be approached with benchmarking the block device via fio, elbencho or your tool of choice. In my case, it is more important to get the initiators online than obtaining ultimate throughput. I’m sure there will be some twiddling of ch_count down the road.
Good luck with your HPC adventures. Let me know how your exploration goes at firstname.lastname@example.org.