Post Reply 
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
iPXE iSCSI sanboot unreliable with 82573L at 1Gbit
2014-02-06, 14:15
Post: #1
iPXE iSCSI sanboot unreliable with 82573L at 1Gbit

first of all, iPXE is a real, cool and fancy bootloader. Good work @ devel.

But, we got a big problem if we try to boot a Windows (7,8, 2008 (R2) or 2012 (R2)) on a mainboard equiped by an Intel 82573L NIC chipset.
Here are the symptoms.
We boot into an iSCSI target LUN , install Windows on it, everthing is fine. The installation succeeds. But then, some times, an iSCSI boot failed. The next iscsi boot works fine again, on the same machine, the same target, the same ipxe loader. It is impossible to say when will this boot succeeds of fails.

Deeper investigations showed that the problem only occurs if the link speed of the iscsi client interface is 1 Gbit/s. This problem only occurs on mainboards with the specified NIC 82573L, any other Intel NIC works without any problems. One workaround is to tear down the speed to 100 Mbit at boot time. Because as long as the speed is at 100 Mbit during the boot of Windows, it is safe to go up to 1Gbit while Windows is running.
Furthermore, we could see on tcp dumps that the iSCSI traffic suddenly stops without any idea why, when the boot failed. The iSCSI client stops to send iSCSI read requests.

So, as discussed in it would be a fix for us to set the link speed hard to 100 Mbit at iPXE boot time. Or, does anyone have any other idea to kill this bug or a better workaround? I hope it's obvious that changing the port speed at switch level everytime a boot occurs can't be a fix.
I'm looking forward to hearing of someone who wants to help me Smile

If any further information is needed, I will provide them.
Find all posts by this user
Quote this message in a reply
2014-02-11, 09:47
Post: #2
RE: iPXE iSCSI sanboot unreliable with 82573L at 1Gbit
I'd like to understand exactly _when_ the iSCSI boot fails. Is it while iPXE is still in play, or is it after the Windows iSCSI client has taken over control (via the iBFT)? If that is the case, your Windows Intel driver might be having a problem taking over after iPXE for some reason. Are you using the latest version of the Intel NIC driver in Windows? Do you actually get any kind of error message, or are things just hanging?

If the last iSCSI traffic you see during boot is from iPXE then that is quite normal. iPXE iSCSI client will time out when Windows boots, because the Windows iSCSI client takes over and iPXE has no way to shut down cleanly. If you look clearly at the packet trace you should be able to find out if the iSCSI packets are coming from iPXE or Windows. You could enable DEBUG=intel,iscsi,scsi,int13 to try and figure out if iPXE is behaving strangely during its run stage. You should also try to run through the tests in to ensure the Intel NIC is actually well-behaved on its own.
Visit this user's website Find all posts by this user
Quote this message in a reply
2014-02-11, 10:21
Post: #3
RE: iPXE iSCSI sanboot unreliable with 82573L at 1Gbit
Hello robinsmidsrod,

and thanks in advance for taken time for helping me Smile

I compared two kinds of tcp dumps. There were tcp dumps were the iSCSI boot works perfectly, and the other tcp dumps where the Intel 82573L were used and the problem occured. At the "problem" dumps I could see that the last iSCSI read request were never fully executed. For example, there were 21 packtes transfered by the target to the client for a specific block address at a working client and only 5 packets at the Intel NIC. And after that 5 packets no further packets were transfered and of course the target closed the session after a time out.

I know that there is a shift from the iPXE stack to the Windows stack, but that shift is never reached when the problem occures.

Furthermore, since last night I was able to build a custom iPXE with the current iPXE version and the last "full" Intel driver pack, that was used by iPXE in branch 45e0327. I patched the Intel driver to use hard 100 Mbit half duplex. Yes, it's ugly Smile But it's working perfectly now.

So, the quick answers to your questions.
1.) The boot fails before the Windows iSCSI client takes over. So, iPXE is still in place.
2.) After 15-20 minutes of no transfer an "unspecified I/O error" message is displayed.

At that moment, I'm preparing a debug environment. So, I will provide you more information as soon as I'm able to.
Furthermore, I will perform the driver tests as you recommended.

I will update as soon as possible.

Thanks again!
Find all posts by this user
Quote this message in a reply
2014-02-14, 10:27
Post: #4
RE: iPXE iSCSI sanboot unreliable with 82573L at 1Gbit
Also, you might be hitting some kind of issue related to iSCSI keep-alive (NOP In) packets. They are not supported, and when iPXE receives one, it'll drop the connection and reattach. It could be that your iSCSI target is having issues with this behavior. If you're able to disable NOP In packets (keep-alive) on the target you might have better luck. Can't really understand how this relates to the NIC having to be forced to 100Mbit though...

Awaiting response on the driver tests. Especially the loopback and high-MTU tests are good at disclosing bugs. But it does seem like your problem is even earlier than that.

Aah, you're trying to use the old Intel driver (which was based on the Linux driver). I wouldn't recommend that you continue using that, as it is no longer supported and any question arising would just be met with the suggestion to upgrade to the latest version (with the new Intel driver). I'd suggest you try to patch the new driver to force it into 100Mbit half-duplex instead and see if that is also stable.

Also, the normal DEBUG=intel should give you some register output right after 'dhcp' command that should shed some light, together with ifstat and route command output.
Visit this user's website Find all posts by this user
Quote this message in a reply
Post Reply 

User(s) browsing this thread: 1 Guest(s)