iPXE discussion forum

Full Version: ipxe http download hangs
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I've been on IRC trying to solve this problem but I think it's easier if I post it here :

I'm using ipxe to boot several linux machines simultaneously (currently 2). Those machines are rebooted many times a day. Each linux has to download an initrd file which is around 500mb in size, over a 10/100 mb ethernet switch.

One time out of 5, the downloads hangs in the middle at various percentages and linux never boots.

I have a http debug build and the last thing it shows is : [0m[33mHTTP 0x16eb4 start of data

My webserver is apache, centos 5.3. The logs show nothing (no error but no success either)

A wireshark capture shows that the transfer stops at exactly the same moment for the 2 linux machines that are http downloading with ipxe. There are a few TCP retransmits - over 20 seconds, then nothing at all.

Trying to narrow down the issue I made another interesting discovery : when the download is taking place from ipxe, I start a wget infinite loop of the same file on another linux machine; then the download from ipxe stalls. It only resumes when I stop wget, then completes. I could observe that on a bare metal PC but also in a virtual machine.

Could it be that ipxe http does not work well in a congested environment ? That ipxe and wget (or 2 instances of ipxe) complete with each other, and ipxe looses ?

I've also tested many builds of ipxe, going as far as oct 2010. Same results observed.

I am now using tftp for the initrd file, which is really slow. I'd really appreciate some help. Thank you

I'm having same problem. Did anybody successfully solve this?
A wireshark or tcpdump trace file from the perspective of the web server might help narrow down the problem. See http://ipxe.org/howto/pcap for details on how to create one. Remember that a packet capture file may contain sensitive information, like passwords, so you might want to "clean" it before you upload it, or run it on a test-only network segment.
Bringing up this old story (I am the OP).

The problem of hanging http download did not occur with commit 149b5023. So I originally settled for this commit for our instance of ipxe used in production. We are kickstarting linux with large initrd files (several hundred MBs) many times a day, on a congested network with a lot of broadcast traffic (around 100 packets/s). We typically kickstart several machines at the same time from the same http server (i.e. 10 machines)

More recently we extended usage of ipxe and we needed the menu functions, which were introduced later. So I compiled the HEAD of ipxe and found the hanging http download issue to be back: it would hang at a random percentage. Hitting ctrl-c would restart the download. It occured very frequently.

Then I tried to bisect the issue by hand and I found that:

* commit a87c0c (jul 20 2012) had both the menu functionality I needded and did not show hang behaviour. This is the one I am using in production now. I had to cherry-pick 1ac62b and eb5a2ba to fix some compiler warnings on gcc 4.9
* Later commit 71727 (nov 21, 2012) did not hang, but http download was much slower.
* similar behaviour with later commit e523 (aug 14 2013): no hang, but slower download
* current HEAD (e905cdc) shows frequent hangs as described above

Based on this, it seems that http download performance of ipxe has degraded since 2012, first by download speed slowing down, then hanging altogether.
There are quite a lot of commits between a87c0c (jul 20 2012) and 71727 (nov 21, 2012), 103 in fact. Could you do a precise bisect (following http://ipxe.org/howto/bisect) between these two to figure out exactly where the HTTP downloads started slowing down?

I see two commits that might have something to do with it, but bisecting should make sure we get the right one.
* 8f7cd88 - [http] Fix HTTP SAN booting (2 years, 7 months ago)
* 501527d - [http] Treat any unexpected connection close as an error (2 years, 7 months ago)

For reference, I'm adding in the other commits you mentioned above with messages.
* a87c0c4 - [isa] Avoid spurious compiler warning on gcc 4.7 (2 years, 8 months ago)
* 717279a - [efi] Include product short name in EFI SNP device names (2 years, 4 months ago)
* e52380f - [uri] Allow URIs to incorporate a parameter list (1 year, 7 months ago)
* e905cdc - [xhci] Undo PCH-specific quirk fixes when removing device (3 days ago)
Just to chime in, I'm seeing what seems to be the same or a similar issue. PXE booting around 10-40 nodes, ~100mb of images downloaded by each node. When it does manifest it's usually only 2-4 nodes that fail, and if rebooted they PXE boot like normal. It's peculiar but this is only reproducible in a smaller subset of our environments, so I'm wondering whether on more congested networks this problem could be more prevalent?
I have this random problem on HEAD. But I don't start several machines. Only one is booting PXE. I just restart the process and its working.
Reference URL's