Infiniband SRP boot
|
2012-12-05, 01:23
Post: #1
|
|||
|
|||
Infiniband SRP boot
Hi, have 2 mellanox adapters with direct connection CX4 cable.
After a few trials got the inifiniband working with SRP/RDMA from windows linux (opensm). My next task is to sanboot the windows machine over the infiniband connection. Have recent (built from recent source version of ipxe) ipxe does not recognise the infiniband connection during boot net1 (connected port) shows : (typing this from smartphone picture of screen) Using mt25210 on PCI04:00.0 (closed) Link Down TX:0 TXE:0 RX:0 RXE:4 Status : Initialising ( ipxe.org/1a136101) RXE 4 x "Operation cancelled (ipxe.org/0b1360a0)" The remaining two connections also show link down (1x infiniband with no cable) and ethernet (with cable connected & working normally) that was working with ipxe prior to installing infiniband card) Any suggestions on what I can do next to debug this? I have tried the suggestion of killall -HUP opensm without luck (using opensm 3.3.15 on linux Ubuntu 12.04). It seems to be scanning fabric every 10 secs reporting 0x02 -> SUBNET UP in log.. |
|||
2012-12-05, 09:50
Post: #2
|
|||
|
|||
RE: Infiniband SRP boot
If you build with DEBUG=infiniband you might gain some more insight. I know next to nothing about Infiniband, unfortunately.
You can use the name of any .c file in the source tree to enable debugging for it. You specify multiple like this: DEBUG=infiniband,iscsi,scsi:3 See the topic "Debug builds" on the download page, http://ipxe.org/download. |
|||
2012-12-05, 12:12
Post: #3
|
|||
|
|||
RE: Infiniband SRP boot
Is the link detected on the other end of the connection?
I can't find support for mt25210 cards in the iPXE code, did you mean mt25218? If so, DEBUG=arbel should get you the debugging output from the card driver |
|||
2012-12-06, 01:36
Post: #4
|
|||
|
|||
RE: Infiniband SRP boot
I am sorry you are right, it was mt25218 (my typo)
Intially when getting original message in first post SM log is pretty sparse cat /var/log/opensm.0x0002c90200210dad.log | grep "Dec 04 07:" 0x01 -> osm_prtn_make_partitions: Partition configuration /etc/opensm/partitions.conf is not accessible (No such file or directory) 0x02 -> SUBNET UP 0x02 -> log_notice: Reporting Generic Notice type:3 num:67 (Mcast group deleted) from LID:1 GID:ff12:1405:ffff::3333:1:2 0x02 -> log_notice: Reporting Generic Notice type:3 num:67 (Mcast group deleted) from LID:1 GID:ff12:401b:ffff::fc0:988f 0x02 -> log_notice: Reporting Generic Notice type:3 num:67 (Mcast group deleted) from LID:1 GID:ff12:401b:ffff::fb 0x01 -> osm_prtn_make_partitions: Partition configuration /etc/opensm/partitions.conf is not accessible (No such file or directory) 0x02 -> SUBNET UP 0x01 -> osm_prtn_make_partitions: Partition configuration /etc/opensm/partitions.conf is not accessible (No such file or directory) 0x02 -> SUBNET UP 0x01 -> osm_prtn_make_partitions: Partition configuration /etc/opensm/partitions.conf is not accessible (No such file or directory) I am running the subnet manager without a config just specifying the port that has a cable. I was fed up with all the partitions.conf missing spam every 10secs, so read man page and created the following as /etc/opensm/partitions.conf Default=0x7fff,ipoib:ALL=full rebuilt ipxe.kpxe with debug=infiniband & many, many messages resulted in them scrolling off console to fast to read. How should I capture them ? Will remote syslog be ok or will I need to source a serial cable? cat /var/log/opensm.0x0002c90200210dad.log | grep "Dec 05 23:" 0x80 -> SM port is up 0x01 -> log_send_error: ERR 5411: DR SMP Send completed with error (IB_TIMEOUT) -- dropping 0x01 -> Received SMP on a 1 hop path: Initial path = 0,1, Return path = 0,0Dec 05 23:00:02 953051 [53662700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x2f9e7 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3120 Timeout while getting attribute 0x11 (NodeInfo); Possible mis-set mkey? 0x80 -> Entering MASTER state 0x80 -> SUBNET UP <Repeats lots of times> I will try again tonight with DEBUG=arbel and remove /etc/opensm/partitions.conf to see if that is a red herring |
|||
2012-12-06, 13:06
Post: #5
|
|||
|
|||
RE: Infiniband SRP boot | |||
2012-12-07, 03:09
(This post was last modified: 2012-12-07 03:13 by johnp12345.)
Post: #6
|
|||
|
|||
RE: Infiniband SRP boot
It seems to cut off debugging pretty soon after starting the inifiniband stuff, will try to get a serial cable cable as does not have much to go on in logs that are received by remote syslog.. the screen seems to have much more detailed info whizzing by
Dec 7 09:19:53 storage-pc OpenSM[17544]: SM port is down#012 Dec 7 09:19:53 storage-pc OpenSM[17544]: Entering DISCOVERING state#012 Dec 7 09:20:33 main-pc.home.int ipxe: Hello World Dec 7 09:20:36 main-pc.home.int ipxe: Press Ctrl-B for the iPXE command line...&& shell#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010 Dec 7 09:20:36 main-pc.home.int ipxe: #010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010#010 #010got past line Dec 7 09:20:54 main-pc.home.int ipxe: iPXE> ne#010 #010#010 #010ifopen net1 Dec 7 09:20:54 main-pc.home.int ipxe: Arbel 0xc33b4 issuing command 0004 Dec 7 09:20:54 main-pc.home.int ipxe: Arbel 0xc33b4 firmware version 5.3.0 Dec 7 09:20:54 main-pc.home.int ipxe: Arbel 0xc33b4 requires 5136 kB for firmware Dec 7 09:20:55 main-pc.home.int ipxe: iPXE> Arbel 0xc33b4 issuing command 0024 Dec 7 09:20:55 main-pc.home.int ipxe: Arbel 0xc33b4 issuing command 0024 Dec 7 09:21:18 main-pc.home.int ipxe: last message repeated 550 times Dec 7 09:21:18 main-pc.home.int ipxe: Dec 7 09:21:18 main-pc.home.int ipxe: iPXE> Arbel 0xc33b4 issuing command 0024 Dec 7 09:21:18 main-pc.home.int ipxe: Arbel 0xc33b4 issuing command 0024 Dec 7 09:21:19 main-pc.home.int ipxe: last message repeated 17 times |
|||
2012-12-07, 04:22
(This post was last modified: 2012-12-07 07:45 by johnp12345.)
Post: #7
|
|||
|
|||
RE: Infiniband SRP boot
Interestingly my cards are memfree and looking through code (I have not looked at any form of C for 10+ years) command 0024 appears to be related to setting up the cards local memory... I will try commenting out the following line in arbel.c :-
/* Enable locally-attached memory. Ignore failure; there may * be no attached memory. */ arbel_cmd_enable_lam ( arbel, &lam ); to see if this makes a difference (my card is a InfiniBand: Mellanox Technologies MT25208 [InfiniHost III Ex] (rev a0)) I have looked through the linux driver for this card and there seems to be a few places that mention memory free and wrap an if statement around it. Is it possible to use that as reference if I get further issues? No license conflicts? |
|||
2013-07-03, 10:26
Post: #8
|
|||
|
|||
RE: Infiniband SRP boot
Hello!!
I have MT25208 card. I load iPXE from CD-ROM. There is no indication that the card is initialized. There are no messages on the screen and the LED is not lit on the card. Who can tell anything? |
|||
2013-07-03, 19:09
Post: #9
|
|||
|
|||
RE: Infiniband SRP boot
Forgot to point out the error. Error is http://ipxe.org/err/2c2260
|
|||
2013-07-05, 14:22
Post: #10
|
|||
|
|||
RE: Infiniband SRP boot
(2012-12-07 04:22)johnp12345 Wrote: I have looked through the linux driver for this card and there seems to be a few places that mention memory free and wrap an if statement around it. Is it possible to use that as reference if I get further issues? No license conflicts? That almost certainly isn't the problem. All development was carried out on memfree cards; that command is expected to fail (hence the comment in the source code). You may be doomed to failure using Arbel cards for SRP. It works fine with Hermon (aka ConnectX) cards. With Arbel, I was never able to fix an unknown problem which caused an almost immediate reboot: http://git.ipxe.org/ipxe.git/commitdiff/7a84cc5 Your current problem is earlier than that, since you're not getting as far as link-up. Have you established that your subnet manager is working: for example, can you get link-up from something that isn't running iPXE on the same subnet? To debug link-up problems from within iPXE, you could try building with DEBUG=infiniband,ib_mi,ib_smc,ib_sma Michael |
|||
2013-07-05, 14:24
Post: #11
|
|||
|
|||
RE: Infiniband SRP boot
(2013-07-03 19:09)CGen Wrote: Forgot to point out the error. Error is http://ipxe.org/err/2c2260 Please start a separate thread; this isn't the same problem. Michael |
|||
« Next Oldest | Next Newest »
|
User(s) browsing this thread: 5 Guest(s)