2016-06-01, 13:26
So I started discussing this issue on IRC last week, here's a recap for other people. If you don't want to know the background story and get right to my final point/suggestion, scroll down to the last paragraph (tl;dr> On our network, dhcp requests with broadcast flag not set are unreliable).
I could not get the dhcp client to work when booting ipxe.usb from a USB key. A packet sniffer installed before the client showed that no DHCP offers ever reached the client. Patching ipxe to set the broadcast flag in the DHCP requests fixed the issue, as the reply was delivered as a broadcast and everything worked fine. Seeing that the two physical machines I had here had the onboard classic PXE also send their DHCP requests with the broadcast flag set, I assumed this would be the proper way to do it (but I was wrong ).
Next I tried to boot Ubuntu from a USB key, and lo and behold, its DHCP client did *not* set the broadcast flag, and yet it worked.
The network here is somewhat complex, the dhcp server is for on another subnet, so there are relay agents running on the routers. We have several dozen subnets spread across the city, vlans, private and public ip ranges, and probably a hand full of hardware manufacturers/models for all the routers and switches in use.
So I could finally get one of the network guys to take half a day to debug this issue with me, as the problem seemed interesting enough to him.
And things were quite crazy. If the machine was booted up successfully and you quickly switched to the ipxe stick, it would boot just fine, which makes you assume that some mac or arp table somewhere along the way was still filled, and behaved weird otherwise. But other than that, we couldn't find anything overly suspicious, and the fact that every other client that does request unicast replies succeeds doesn't help too much either. At some point I was desperate enough to hack up ipxe so much that its dhcp discover looks exactly like the one ubuntu sends, but still no luck. We did realize however that the problem only occurs on some network ports, while on others it works, even if they both lead to the same switch, with both ports configured on the same vlan.
And best of all, today I could not reproduce the problem at all anymore, so I asked if they changed anything or have an idea what happened, but they don't.
So finally what I'm trying to get to:
We're building a USB key that will boot a system from our servers. We want to hand it out as some kind of demo, so we want it to work anywhere someone has a DHCP server running and access to the internet. But I have no clue how common this problem would be in the wild. Maybe we're a one-in-a-million case. Maybe this is why classic PXE also sets the broadcast flag.
So we could maintain a private patch that always sets the broadcast flag, but it'd be nice if that could just be in mainline ipxe. It could easily be made a setting in config/dhcp.h (isc client has that too), a more sophisticated approach would be to set the flag automatically if the first one or two requests timed out.
Any thoughts on that? I'd be willing to create a patch for whatever solution sounds reasonable, if desired.
I could not get the dhcp client to work when booting ipxe.usb from a USB key. A packet sniffer installed before the client showed that no DHCP offers ever reached the client. Patching ipxe to set the broadcast flag in the DHCP requests fixed the issue, as the reply was delivered as a broadcast and everything worked fine. Seeing that the two physical machines I had here had the onboard classic PXE also send their DHCP requests with the broadcast flag set, I assumed this would be the proper way to do it (but I was wrong ).
Next I tried to boot Ubuntu from a USB key, and lo and behold, its DHCP client did *not* set the broadcast flag, and yet it worked.
The network here is somewhat complex, the dhcp server is for on another subnet, so there are relay agents running on the routers. We have several dozen subnets spread across the city, vlans, private and public ip ranges, and probably a hand full of hardware manufacturers/models for all the routers and switches in use.
So I could finally get one of the network guys to take half a day to debug this issue with me, as the problem seemed interesting enough to him.
And things were quite crazy. If the machine was booted up successfully and you quickly switched to the ipxe stick, it would boot just fine, which makes you assume that some mac or arp table somewhere along the way was still filled, and behaved weird otherwise. But other than that, we couldn't find anything overly suspicious, and the fact that every other client that does request unicast replies succeeds doesn't help too much either. At some point I was desperate enough to hack up ipxe so much that its dhcp discover looks exactly like the one ubuntu sends, but still no luck. We did realize however that the problem only occurs on some network ports, while on others it works, even if they both lead to the same switch, with both ports configured on the same vlan.
And best of all, today I could not reproduce the problem at all anymore, so I asked if they changed anything or have an idea what happened, but they don't.
So finally what I'm trying to get to:
We're building a USB key that will boot a system from our servers. We want to hand it out as some kind of demo, so we want it to work anywhere someone has a DHCP server running and access to the internet. But I have no clue how common this problem would be in the wild. Maybe we're a one-in-a-million case. Maybe this is why classic PXE also sets the broadcast flag.
So we could maintain a private patch that always sets the broadcast flag, but it'd be nice if that could just be in mainline ipxe. It could easily be made a setting in config/dhcp.h (isc client has that too), a more sophisticated approach would be to set the flag automatically if the first one or two requests timed out.
Any thoughts on that? I'd be willing to create a patch for whatever solution sounds reasonable, if desired.