iPXE discussion forum

Full Version: iPXE ESXi 6.0u2 on UEFI -> PSOD
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I’m trying to solve a PSOD (purple screen) issue on a ProLiant DL380 Gen9 while installing ESXi via PXE in UEFI mode. The only thing that seems to be consistent is that the system PSOD always at the same point. (Kernel stack is always the same)
All DL380Gen9 which I could get my hands on crash!! The system crashes on any ESXi release!!

The PSOD indicates an hardware issue, but installing via the ISO media works just fine. I found an advisory that describes the issue really good, but it relates to running ESXi servers, not during the ESXI installation and the suggested work-arround doesn’t work.

• I’m focusing on 2 DL380Gen9 servers: 1xLLF (SATA) based, 1x SSF (SAS) based. (both panic with the same error at the same spot)
o ILO is 2.44 (latest from the web)
o System Rom is 2.22_07-18-2016 (latest from the web)
o SPP 2016.04 is loaded
• I have the PSOD with ALL ESXi Releases
o Custom HPE 5.5 update 2 / 5.5 update 3 / 6.0 / 6.0 update 1 / 6.0 update 2
o Orginal VM ware release (6.0 update 2)

• Following advisory describes the issue (only I’m having the PSOD during installation) pretty close
o http://h20564.www2.hpe.com/portal/site/h...-c04912076
o YES the ESXi Panic code is: LINT1/NMI (Motherboard nonmaskable interrupt) undiagnosed
o YES the ILM Log says: Uncorrectable PCI Express Error (Embedded device, Bus 0, Device 0, Function 0, Error status 0x00000000)
Note: the Advisory handles an Error status 0x00100000. Here we have 0x0000000

• Applied workaround of advisory ( Kernelopt in boot.cfg iovDisableIR=TRUE ) but STILL all systems panic
• It isn’t a hardware issue. Installing from ISO works just fine, removed all PCI cards, (even internal Smart Array Ctrl), … still panic.
• Issue is only UEFI mode. Bios mode works fine

o System starts BOOTX64.EFI …. It loads all the modules found in BOOT.CFG.
o The Yellow Startup screen comes up and the gray progress bar at the bottom of the screen progresses.
o Last message of ESXI is “VM Kernel loaded successfully” then system panic


Anyone care to reproduce ? … I’ve got a feeling I hit some undocumented bug.

• IPXE Script Looks like:
#!ipxe
kernel http://<ip-address>/netboot/vmware/6.0_u2/boot/bootx64.efi -c http://<ip-address>/netboot/vmware/6.0_u2/boot/boot.cfg
boot

• Boot.cfg looks like:
bootstate=0
title=Loading ESXi installer
timeout=5
prefix=http://<ip-address>/netboot/vmware/6.0_u2/depot
kernel=tboot.b00
kernelopt=runweasel iovDisableIR=true
modules=b.b00 --- jumpstrt.gz --- useropts.gz --- k.b00 --- chardevs.b00 --- a.b00 --- user.b00 --- uc_intel.b00 --- uc_amd.b00 --- sb.v00 --- s.v00 --- scsi_mpt.v00 --- net_tg3.v00 --- elxnet.v00 --- ima_be2i.v00 --- lpfc.v00 --- scsi_be2.v00 --- amsHelpe.v00 --- conrep.v00 --- hpbootcf.v00 --- hpe_buil.v00 --- hpe_esxi.v00 --- hpe_ilo.v00 --- hpe_smx_.v00 --- hponcfg.v00 --- hpssacli.v00 --- hptestev.v00 --- char_hpc.v00 --- hpnmi.v00 --- scsi_hpd.v00 --- scsi_hps.v00 --- scsi_hpv.v00 --- intelcim.v00 --- net_i40e.v00 --- net_igb.v00 --- net_ixgb.v00 --- nmlx4_co.v00 --- nmlx4_en.v00 --- misc_cni.v00 --- net_bnx2.v00 --- net_bnx2.v01 --- net_cnic.v00 --- net_nx_n.v00 --- net_qlcn.v00 --- qlnative.v00 --- scsi_bfa.v00 --- scsi_bnx.v00 --- scsi_bnx.v01 --- scsi_qla.v00 --- mtip32xx.v00 --- ata_pata.v00 --- ata_pata.v01 --- ata_pata.v02 --- ata_pata.v03 --- ata_pata.v04 --- ata_pata.v05 --- ata_pata.v06 --- ata_pata.v07 --- block_cc.v00 --- ehci_ehc.v00 --- emulex_e.v00 --- weaselin.t00 --- esx_dvfi.v00 --- esx_ui.v00 --- ima_qla4.v00 --- ipmi_ipm.v00 --- ipmi_ipm.v01 --- ipmi_ipm.v02 --- lsi_mr3.v00 --- lsi_msgp.v00 --- lsu_hp_h.v00 --- lsu_lsi_.v00 --- lsu_lsi_.v01 --- lsu_lsi_.v02 --- lsu_lsi_.v03 --- lsu_lsi_.v04 --- misc_dri.v00 --- net_e100.v00 --- net_e100.v01 --- net_enic.v00 --- net_forc.v00 --- net_mlx4.v00 --- net_mlx4.v01 --- net_vmxn.v00 --- nmlx4_rd.v00 --- nvme.v00 --- ohci_usb.v00 --- rste.v00 --- sata_ahc.v00 --- sata_ata.v00 --- sata_sat.v00 --- sata_sat.v01 --- sata_sat.v02 --- sata_sat.v03 --- sata_sat.v04 --- scsi_aac.v00 --- scsi_adp.v00 --- scsi_aic.v00 --- scsi_fni.v00 --- scsi_ips.v00 --- scsi_meg.v00 --- scsi_meg.v01 --- scsi_meg.v02 --- scsi_mpt.v01 --- scsi_mpt.v02 --- uhci_usb.v00 --- vsan.v00 --- vsanheal.v00 --- vsanmgmt.v00 --- xhci_xhc.v00 --- tools.t00 --- nmst.v00 --- xorg.v00 --- imgdb.tgz --- imgpayld.tgz
build=
updated=0
Exactly the same here on a DL360 Gen9.

iLO; 2.40
System ROM: 2.00 and 1.50 (tested with both)

Tried custom HP image 6.0 with U1 and U2.

I've got EXACTLY the same errors as you have and a nearly identical boot.cfg

Not found a workaround or solution yet.
I installed ESXi manually on the physical server from a USB drive.
Then i tried to install a nested ESXi as a VM on the physical server.
The VM is configured to boot with EFI.

From there i tried the PXE boot and the install went fine...
So i am pretty sure my config is ok. It seems the problem is probably related to the hardware?
I tried to install with the 6.0 U2 HP ISO connected to the iLO Virtual Drive. Works like a charm.

Look like something (network) hardware related.
I tried to build a more recent HPE 6.0 U2 bundle for installation.
Only new drivers were a HPSA driver and a QLogic CNA driver... did not help...

On to the next brilliant idea... (which has yet to bubble up Smile )
I had a successful installation when i tried to configure the network installation server from the embedded uefi shell.
According to the following guide on page 33
http://h20564.www2.hpe.com/hpsc/doc/publ...=c04565930

Specify a static NIC IP address configuration, for example:
Shell> sysconfig -s PreBootNetwork=Auto Dhcpv4=Disabled
Ipv4Address=192.168.1.105 Ipv4SubnetMask=255.255.255.0
Ipv4Gateway=192.168.1.1 Ipv4PrimaryDNS=192.168.0.2
Ipv4SecondaryDNS=192.168.10.3
UrlBootfile=http://boot.server.com/iso/vmware6u2.iso

That command adds an entry to the one time boot menu.
When you select that entry, the installation works properly.

But still, this does not help me with the iPXE configuration.
Another update on this issue:

We’ve come one step closer to understanding the issue, however, things have also become more complex.
Is it now our GEN9 ? iPXE ? or VMware . Who can tell.

One thing to note, - and this is interesting -, if we setup DHCP to hand out BOOTX64.EFI from the ESXi DVD directly, (so skipping iPXE), than the system doesn’t panic. So that’s a YEAH, but also a surprise, … Do note that I’ve tested this with a 6.0 update 1. It didn’t PSOD, but it failed later on because it lacked the 334i network card drive, (but that’s another issue)

So yes I can confirm:
UEFI_PXE > iPXE > vmware -> PSOD
UEFI_PXE > vmware -> no PSOD

I have it all setup using TFTP only. Just to keep the HTTP question out of the way, and it just boils down if iPXE is in the middle or not. So is iPXE at fault. I wish it was that simple.
It is the unique combination of Gen9_UEFI / iPXE / BOOTX64.efi. If one component is out of the mix = success. If all three come in to play = problem.

So far we can conclude:

GEN9_UEFI / IPXE / NON_VMWARE ---> yes can boot windows and linux. ---> (so is it an ESXi issue ? )
NON_GEN9_UEFI / iPXE / VMWARE ---> yes we can install it on None Prolaint UEFI’s ---> (so is it a ProLiant issue ? )
GEN9_UEFI / VMWARE ---> yes this works too .. ---> (so is iPXE at fault ? )

So at this point, I still cannot pin point this to a GEN9 issue, or iPXE or VMWARE.

One could say, well forget about iPXE, but skipping iPXE isn’t an option. iPXE is our central PXE engine. With this engine, we boot the system using the F12 button, this gives us a selection menu and we select: we want to install Windows, … or RHEL. … or SLES or …. VMWare. It works excellent except for UEFI with VMware. Without iPXE, no selection menu, no windows, no RHEL no SLES. And basically we want to do more with our DHCP PXE server than doing only vmware.
I found this thread will searching for similar issue the last couple of days. Am trying similar process using ipxe.efi or snponly.efi (fresh build from git). I'm not getting any PSOD, all that is happening is the ESXi 6.0 installer happily loads all the modules over HTTP, then freezes. No further messages or movement. It's at the part where it'd show "relocating modules and starting kernel." But that never appears.

Have tried adding ignoreHeadless=TRUE to boot options, doesn't help. Like you, have seen that booting bootx64.efi (viz mboot.efi) does work, although I guess it's downloading modules via TFTP. At least it does boot into the installer prompts.

Since there's no diagnostic output, I can't tell if it hates the video environment somehow, or it's dying initializing the modules that were loaded through ipxe.

I've only tried the HP ESXi 6.0 update 2 so far... I'll try U1 for completeness. But you're definitely not imagining this.
-Alan
Please try this.it works for me.
I think http should be the same as tftp.

#!ipxe
dhcp
:MENU
menu
item --gap-- -------ipxe UEFI boot menu-----
item vmware vmware
:vmware
kernel tftp://<ip-address>/netboot/vmware/6.0_u2/boot/bootx64.efi
initrd -n boot.cfg tftp://<ip-address>/netboot/vmware/6.0_u2/boot/boot.cfg boot.cfg
boot

• Boot.cfg looks like:
bootstate=0
title=Loading ESXi installer
timeout=5
prefix=tftp://<ip-address>/netboot/vmware/6.0_u2/depot
kernel=tboot.b00
kernelopt=runweasel iovDisableIR=true
modules=b.b00 --- jumpstrt.gz --- useropts.gz --- k.b00 --- chardevs.b00 --- a.b00 --- user.b00 --- uc_intel.b00 --- uc_amd.b00 --- sb.v00 --- s.v00 --- scsi_mpt.v00 --- net_tg3.v00 --- elxnet.v00 --- ima_be2i.v00 --- lpfc.v00 --- scsi_be2.v00 --- amsHelpe.v00 --- conrep.v00 --- hpbootcf.v00 --- hpe_buil.v00 --- hpe_esxi.v00 --- hpe_ilo.v00 --- hpe_smx_.v00 --- hponcfg.v00 --- hpssacli.v00 --- hptestev.v00 --- char_hpc.v00 --- hpnmi.v00 --- scsi_hpd.v00 --- scsi_hps.v00 --- scsi_hpv.v00 --- intelcim.v00 --- net_i40e.v00 --- net_igb.v00 --- net_ixgb.v00 --- nmlx4_co.v00 --- nmlx4_en.v00 --- misc_cni.v00 --- net_bnx2.v00 --- net_bnx2.v01 --- net_cnic.v00 --- net_nx_n.v00 --- net_qlcn.v00 --- qlnative.v00 --- scsi_bfa.v00 --- scsi_bnx.v00 --- scsi_bnx.v01 --- scsi_qla.v00 --- mtip32xx.v00 --- ata_pata.v00 --- ata_pata.v01 --- ata_pata.v02 --- ata_pata.v03 --- ata_pata.v04 --- ata_pata.v05 --- ata_pata.v06 --- ata_pata.v07 --- block_cc.v00 --- ehci_ehc.v00 --- emulex_e.v00 --- weaselin.t00 --- esx_dvfi.v00 --- esx_ui.v00 --- ima_qla4.v00 --- ipmi_ipm.v00 --- ipmi_ipm.v01 --- ipmi_ipm.v02 --- lsi_mr3.v00 --- lsi_msgp.v00 --- lsu_hp_h.v00 --- lsu_lsi_.v00 --- lsu_lsi_.v01 --- lsu_lsi_.v02 --- lsu_lsi_.v03 --- lsu_lsi_.v04 --- misc_dri.v00 --- net_e100.v00 --- net_e100.v01 --- net_enic.v00 --- net_forc.v00 --- net_mlx4.v00 --- net_mlx4.v01 --- net_vmxn.v00 --- nmlx4_rd.v00 --- nvme.v00 --- ohci_usb.v00 --- rste.v00 --- sata_ahc.v00 --- sata_ata.v00 --- sata_sat.v00 --- sata_sat.v01 --- sata_sat.v02 --- sata_sat.v03 --- sata_sat.v04 --- scsi_aac.v00 --- scsi_adp.v00 --- scsi_aic.v00 --- scsi_fni.v00 --- scsi_ips.v00 --- scsi_meg.v00 --- scsi_meg.v01 --- scsi_meg.v02 --- scsi_mpt.v01 --- scsi_mpt.v02 --- uhci_usb.v00 --- vsan.v00 --- vsanheal.v00 --- vsanmgmt.v00 --- xhci_xhc.v00 --- tools.t00 --- nmst.v00 --- xorg.v00 --- imgdb.tgz --- imgpayld.tgz
build=
updated=0
Just tried the Update 1 version of this as a comparison, same behavior for me, it all freezes up right after the last module load (in my case, imgpayld.tgz).
And the definite problem is using HTTP to download the modules. While it looks like it works OK, something badly goes wrong somewhere invisibly. I did the following test, starting from a DL360G9 booted into the snponly.efi command line, where I chained to an ipxe.cfg script on a Web server I could edit:

Failed:
boot http://10.10.10.16/dists/esxi/6.0/mboot.efi -c http://10.10.10.16/dists/esxi/6.0/boot.cfg ks=http://10.10.10.16/01-ec-b1-d7-77-42-90.cfg

Succeeded:
boot tftp://10.10.10.16/dists/esxi/6.0/mboot.efi -c tftp://10.10.10.16/dists/esxi/6.0/boot.cfg ks=http://10.10.10.16/01-ec-b1-d7-77-42-90.cfg

So, why would downloading the installer modules via HTTP be so much worse than pulling them via TFTP?
This is just guesses since I have not tested Vmware at all.
The server sends .gz files with a header that the client (iPXE) decompresses and that throws the loader of (this does not happen with tftp) to test this check headers with packetdumps, change filename of *gz files, or change the http server config.

One other thing that is worth knowing about iPXE is that iPXE does not do any decompression of compressed files before sending them to the kernel (it does if the http server sends headers telling it to decompress but that is before the files are sent on to the kernel) however since it works with tftp we can also rule this out.

Have anyone tested with nfs to see if it works?
(2016-09-27 19:01)NiKiZe Wrote: [ -> ]This is just guesses since I have not tested Vmware at all.
The server sends .gz files with a header that the client (iPXE) decompresses and that throws the loader of (this does not happen with tftp) to test this check headers with packetdumps, change filename of *gz files, or change the http server config.

What is meant by "change the http server config?" Disabling some header?
(2016-09-27 19:19)webminster Wrote: [ -> ]What is meant by "change the http server config?" Disabling some header?

If it is that the http server sends headers for .gz files that iPXE interprets as if it was server side compression you might want to disable that.
If I'm not mistaken that should be seen in the Content-Encoding header of the reply.

It still does not explain why it works on some hw but not other.
Maybe building ipxe with DEBUG=httpcore could shed some light on how the connection is used.
Tried using the latest and greatest, VMware ESXi 6.5 HPE Custom ISO on a DL360 Gen9. Did not help at all, exactly the same error message.

Tried booting from the ISO in a iLO virtual drive, that does work on exactly the same hardware.
Have anyone tried to contact VMware about this issue? this is probably something that they would have an interest in having working and should test on their end.
Has there been any new updates, we just switch over to UEFI and then got this problem.
Tried using tftp but still getting the PSoD.
also tried Update 3 and still no change.
the server was updated to HPE SPP Oct 2016.


EDIT:
I switch ipxe.efi with snponly.efi and it seems to be working completely over http. :oD
@fdge, what hardware do you use?
I am trying this in a ProLiant Gen10 server, either in Legacy or UEFI, for testing purposes.

My learnings:

1) Legacy mode: compile undionly.kpxe wtih COMBOOT enabled, as suggested in
https://www.reversengineered.com/2015/02...i-in-ipxe/
Failure to do this will trigger an Illegal Opcode inmediately (full screen in Red with text ILLEGAL OPCODE on the top left corner).

2) UEFI mode: indeed use snponly.efi instead of the full ipxe.efi.
When Using full iPXE.efi, the ESXi starts loading files, but when it switches to the yellow and gray screen, it will gives a PSOD, reboot and most of the time report an UMCE in the server's IML.


In my environment, I am chainloading via a hardcoded script, and everything is being pulled from HTTP.
Reference URL's