iPXE discussion forum

Full Version: DHCP issues with unicast
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
So I started discussing this issue on IRC last week, here's a recap for other people. If you don't want to know the background story and get right to my final point/suggestion, scroll down to the last paragraph (tl;dr> On our network, dhcp requests with broadcast flag not set are unreliable).

I could not get the dhcp client to work when booting ipxe.usb from a USB key. A packet sniffer installed before the client showed that no DHCP offers ever reached the client. Patching ipxe to set the broadcast flag in the DHCP requests fixed the issue, as the reply was delivered as a broadcast and everything worked fine. Seeing that the two physical machines I had here had the onboard classic PXE also send their DHCP requests with the broadcast flag set, I assumed this would be the proper way to do it (but I was wrong Wink).
Next I tried to boot Ubuntu from a USB key, and lo and behold, its DHCP client did *not* set the broadcast flag, and yet it worked.

The network here is somewhat complex, the dhcp server is for on another subnet, so there are relay agents running on the routers. We have several dozen subnets spread across the city, vlans, private and public ip ranges, and probably a hand full of hardware manufacturers/models for all the routers and switches in use.

So I could finally get one of the network guys to take half a day to debug this issue with me, as the problem seemed interesting enough to him. Smile
And things were quite crazy. If the machine was booted up successfully and you quickly switched to the ipxe stick, it would boot just fine, which makes you assume that some mac or arp table somewhere along the way was still filled, and behaved weird otherwise. But other than that, we couldn't find anything overly suspicious, and the fact that every other client that does request unicast replies succeeds doesn't help too much either. At some point I was desperate enough to hack up ipxe so much that its dhcp discover looks exactly like the one ubuntu sends, but still no luck. We did realize however that the problem only occurs on some network ports, while on others it works, even if they both lead to the same switch, with both ports configured on the same vlan.
And best of all, today I could not reproduce the problem at all anymore, so I asked if they changed anything or have an idea what happened, but they don't.

So finally what I'm trying to get to:
We're building a USB key that will boot a system from our servers. We want to hand it out as some kind of demo, so we want it to work anywhere someone has a DHCP server running and access to the internet. But I have no clue how common this problem would be in the wild. Maybe we're a one-in-a-million case. Maybe this is why classic PXE also sets the broadcast flag.
So we could maintain a private patch that always sets the broadcast flag, but it'd be nice if that could just be in mainline ipxe. It could easily be made a setting in config/dhcp.h (isc client has that too), a more sophisticated approach would be to set the flag automatically if the first one or two requests timed out.
Any thoughts on that? I'd be willing to create a patch for whatever solution sounds reasonable, if desired.
(2016-06-01 13:26)simon Wrote: [ -> ]At some point I was desperate enough to hack up ipxe so much that its dhcp discover looks exactly like the one ubuntu sends, but still no luck. We did realize however that the problem only occurs on some network ports, while on others it works, even if they both lead to the same switch, with both ports configured on the same vlan.
And best of all, today I could not reproduce the problem at all anymore, so I asked if they changed anything or have an idea what happened, but they don't.

This sounds like a bug in the network infrastructure. When the problem can be reproduced, how far does the DHCP response packet get before it gets dropped? (Does it get dropped, or does it just get sent to the wrong egress switch port?)

I don't really want to add a mainline patch to work around what currently looks most likely to be a configuration problem specific to your infrastructure, unless the root cause is understood and it can be shown to be something that is likely to affect other people as well.

Michael
(2016-06-03 16:44)mcb30 Wrote: [ -> ]This sounds like a bug in the network infrastructure. When the problem can be reproduced, how far does the DHCP response packet get before it gets dropped? (Does it get dropped, or does it just get sent to the wrong egress switch port?)
The DHCP is directly hooked up to our central H3C router. On the port where the agent should send out the relayed OFFER, we always see an ARP request for the IP address that was in the OFFER, which obviously doesn't get a reply. We see this ARP request in both cases, where the relayed packet appears on the wire (isc client) and where it does not appear on the wire (ipxe). That's pretty much how far we got. The next day it started working, and still does today.
Unfortunately I could not convince them to try to get logs or captures on the router itself, since the router is about a year old now, and the guy helping me out wasn't too familiar with its interface. So unfortunately no idea if the packet might have ended up on a wrong port. I was already glad he agreed to help me at all, since from the networking department's point of view, this is a corner case issue that doesn't impact anyone else anywhere.

Quote:I don't really want to add a mainline patch to work around what currently looks most likely to be a configuration problem specific to your infrastructure, unless the root cause is understood and it can be shown to be something that is likely to affect other people as well.
Understood. Was hoping a small on/off patch would be ok, but I see where you are coming from. We'll keep patching locally since we don't want to risk ending up giving the USB key to anyone having the same issues.

Oh and since the networking guy didn't want to blame it on a faulty config (I wonder why Smile) he considered the possibility of a firmware bug in the router, and started ranting about how Alcatel devices had the strangest bugs in the past and always required multiple updates...
Reference URL's