Post Reply 
 
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[SOLVED] iSCSI read errors during iPXE boot
2019-02-13, 23:19 (This post was last modified: 2019-02-17 21:10 by tlum.)
Post: #1
[SOLVED] iSCSI read errors during iPXE boot
EDIT: Substantially updated 2019-02-16 to include additional troubleshooting information.

I've had iPXE and iSCSI environments in place for years now, but for the first time I'm attempting to do an iSCSI boot and the iPXE is having a problem with the conversation with the iSCSI target.

The storage server
Code:
CentOS Linux release 7.6.1810 (Core)
Linux san1srvp01.********.net 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
zfs-0.7.12-1.el7_6.x86_64


The backing block instance
Code:
Disk /dev/zpool1/jane: 8422 MB, 8422687232 bytes, 16450561 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0xa9a554b4

           Device Boot      Start         End      Blocks   Id  System
/dev/zpool1/jane1              63       80324       40131   12  Compaq diagnostics
/dev/zpool1/jane2   *       80325    16434494     8177085    7  HPFS/NTFS/exFAT


The backing fileio instance
Code:
Disk /zpool1/nas1/Media/c0d0.img: 8422 MB, 8422686720 bytes, 16450560 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0xa9a554b4

                      Device Boot      Start         End      Blocks   Id  System
/zpool1/nas1/Media/c0d0.img1              63       80324       40131   12  Compaq diagnostics
/zpool1/nas1/Media/c0d0.img2   *       80325    16434494     8177085    7  HPFS/NTFS/exFAT


Simplified iSCSI Target configuration. This is an example of block using ZFS zvol, and I've also tried fileio which behaves no differently.
Code:
{
  "fabric_modules": [
    {
      "discovery_enable_auth": true,
      "discovery_password": "********************************",
      "discovery_userid": "san1srvp01",
      "name": "iscsi"
    }
  ],
  "storage_objects": [
    {
      "alua_tpgs": [
        {
          "alua_access_state": 0,
          "alua_access_status": 0,
          "alua_access_type": 3,
          "alua_support_active_nonoptimized": 1,
          "alua_support_active_optimized": 1,
          "alua_support_offline": 1,
          "alua_support_standby": 1,
          "alua_support_transitioning": 1,
          "alua_support_unavailable": 1,
          "alua_write_metadata": 0,
          "implicit_trans_secs": 0,
          "name": "default_tg_pt_gp",
          "nonop_delay_msecs": 100,
          "preferred": 0,
          "tg_pt_gp_id": 0,
          "trans_delay_msecs": 0
        }
      ],
      "attributes": {
        "block_size": 512,
        "emulate_3pc": 1,
        "emulate_caw": 1,
        "emulate_dpo": 1,
        "emulate_fua_read": 1,
        "emulate_fua_write": 1,
        "emulate_model_alias": 1,
        "emulate_rest_reord": 0,
        "emulate_tas": 1,
        "emulate_tpu": 0,
        "emulate_tpws": 0,
        "emulate_ua_intlck_ctrl": 0,
        "emulate_write_cache": 0,
        "enforce_pr_isids": 1,
        "force_pr_aptpl": 0,
        "is_nonrot": 1,
        "max_unmap_block_desc_count": 1,
        "max_unmap_lba_count": 262144,
        "max_write_same_len": 65535,
        "optimal_sectors": 32768,
        "pi_prot_format": 0,
        "pi_prot_type": 0,
        "queue_depth": 128,
        "unmap_granularity": 16,
        "unmap_granularity_alignment": 0,
        "unmap_zeroes_data": 0
      },
      "dev": "/dev/zpool1/jane",
      "name": "jane",
      "plugin": "block",
      "readonly": false,
      "write_back": false,
      "wwn": "8688850f-7200-48a0-ad32-0f4f9397a836"
    }
  ],
  "targets": [
    {
      "fabric": "iscsi",
      "tpgs": [
        {
          "attributes": {
            "authentication": 0,
            "cache_dynamic_acls": 0,
            "default_cmdsn_depth": 64,
            "default_erl": 0,
            "demo_mode_discovery": 1,
            "demo_mode_write_protect": 1,
            "fabric_prot_type": 0,
            "generate_node_acls": 0,
            "login_timeout": 15,
            "netif_timeout": 2,
            "prod_mode_write_protect": 0,
            "t10_pi": 0,
            "tpg_enabled_sendtargets": 1
          },
          "enable": true,
          "luns": [
            {
              "alias": "414d07d6b4",
              "alua_tg_pt_gp_name": "default_tg_pt_gp",
              "index": 2,
              "storage_object": "/backstores/block/jane"
            }
          ],
          "node_acls": [
            {
              "attributes": {
                "dataout_timeout": 3,
                "dataout_timeout_retries": 5,
                "default_erl": 0,
                "nopin_response_timeout": 30,
                "nopin_timeout": 15,
                "random_datain_pdu_offsets": 0,
                "random_datain_seq_offsets": 0,
                "random_r2t_offsets": 0
              },
              "chap_mutual_password": "****************",
              "chap_mutual_userid": "san1srvp01",
              "chap_password": "****************",
              "chap_userid": "jane",
              "mapped_luns": [
                {
                  "alias": "c8ce872be3",
                  "index": 2,
                  "tpg_lun": 2,
                  "write_protect": false
                }
              ],
              "node_wwn": "iqn.1999-10.net.********:jane"
            }
          ],
          "parameters": {
            "AuthMethod": "CHAP,None",
            "DataDigest": "CRC32C,None",
            "DataPDUInOrder": "Yes",
            "DataSequenceInOrder": "Yes",
            "DefaultTime2Retain": "20",
            "DefaultTime2Wait": "2",
            "ErrorRecoveryLevel": "0",
            "FirstBurstLength": "65536",
            "HeaderDigest": "CRC32C,None",
            "IFMarkInt": "2048~65535",
            "IFMarker": "No",
            "ImmediateData": "Yes",
            "InitialR2T": "Yes",
            "MaxBurstLength": "262144",
            "MaxConnections": "1",
            "MaxOutstandingR2T": "1",
            "MaxRecvDataSegmentLength": "8192",
            "MaxXmitDataSegmentLength": "262144",
            "OFMarkInt": "2048~65535",
            "OFMarker": "No",
            "TargetAlias": "LIO Target"
          },
          "portals": [
            {
              "ip_address": "192.168.40.1",
              "iser": false,
              "offload": false,
              "port": 3260
            }
          ],
          "tag": 1
        }
      ],
      "wwn": "iqn.1999-10.net.********:san1srvp01"
    }
  ]
}


The PXE/iPXE/TFTP/HTTP server
Code:
CentOS release 6.10 (Final)
Linux sy1srvp01.********.net 2.6.32-754.10.1.el6.i686 #1 SMP Tue Jan 15 17:33:10 UTC 2019 i686 i686 i386 GNU/Linux
tftp-0.49-8.el6.i686


The iPXE implementation will first hand off to scripts matching host name, uuid, or mac in that order. This is the individual iPXE boot script for this mac mac-0007e90feaf5.ipxe
Code:
set username jane
set password ****************
set reverse-username san1srvp01
set reverse-password ****************
set initiator-iqn iqn.1999-10.net.********:jane
sanboot iscsi:192.168.40.1::::iqn.1999-10.net.********:san1srvp01


The initiator
Code:
Compaq ML370 (Generation 0)
BIOS P17 (12/18/2002)
Processor 866/133 Mhz with 256k Cache
RAM 1 GB
Intel Boot Agent GE v1.2.22


The PXE -> iPXE chain load
Code:
PXE 2.1 Build 084 (WfM 2.0), RPL V1.25

PX->EB: PXE! at 9CC2:0070, entry point at 9CC2:0106
            UNDI code segment 9CC2:0000, data segment 969B:0000 (602-628kB)
            UNDI device is PCI 00:06.0, type DIX+802.3
            602kB free base memory after PXE unload
            
iPXE 1.0.0+ -- Open Source Network Boot Firmware -- http://ipxe.org
Features: DNS HTTP iSCSI TFTP AoE ELF MBOOT PXE bzImage Menu PXEXT


I've used a network trace to follow the iPXE sanboot iSCSI boot process. From a high level it is:

  1. Login Command (CHAP)
  2. Test Unit Ready
  3. Read Capacity(10)
  4. Read(10) <- Failure

First, the Read Capacity(10) return an unexpected value of 63 for LBA, instead of 16450560. It then attempts a Read(10) at LBA 64 which predictably fails with Logical Block Address Out Of Range. Testing indicates this is caused by specific interaction between iPXE and LIO, but the exact cause is unknown.

Read Capacity(10) - Request
Code:
Frame 27: 114 bytes on wire (912 bits), 114 bytes captured (912 bits)
Ethernet II, Src: Intel_0f:ea:f5 (00:07:e9:0f:ea:f5), Dst: SuperMic_6c:a9:82 (00:25:90:6c:a9:82)
Internet Protocol Version 4, Src: 192.168.4.13, Dst: 192.168.40.1
Transmission Control Protocol, Src Port: cifs (3020), Dst Port: iscsi-target (3260), Seq: 773, Ack: 637, Len: 48
iSCSI (SCSI Command)
Flags: 0xc1, F, R, Attr: Simple
SCSI CDB Read Capacity(10)
    [LUN: 0x0000]
    [Command Set:Direct Access Device (0x00) (Using default commandset)]
    [Response in: 29]
    Opcode: Read Capacity(10) (0x25)
    Control: 0x00

Read Capacity(10) - Response
Code:
Frame 29: 74 bytes on wire (592 bits), 74 bytes captured (592 bits)
Ethernet II, Src: SuperMic_6c:a9:82 (00:25:90:6c:a9:82), Dst: Intel_0f:ea:f5 (00:07:e9:0f:ea:f5)
Internet Protocol Version 4, Src: 192.168.40.1, Dst: 192.168.4.13
Transmission Control Protocol, Src Port: iscsi-target (3260), Dst Port: cifs (3020), Seq: 685, Ack: 821, Len: 8
[2 Reassembled TCP Segments (56 bytes): #28(48), #29(8)]
iSCSI (SCSI Data In)
SCSI Payload (Read Capacity(10) Response Data)
    [LUN: 0x0000]
    [Command Set:Direct Access Device (0x00) (Using default commandset)]
    [SBC Opcode: Read Capacity(10) (0x25)]
    [Request in: 27]
    [Response in: 29]
    LBA: 63 (0 MB)
    Block size in bytes: 512
SCSI Response (Read Capacity(10))
    [LUN: 0x0000]
    [Command Set:Direct Access Device (0x00) (Using default commandset)]
    [SBC Opcode: Read Capacity(10) (0x25)]
    [Request in: 27]
    [Time from request: 0.000252000 seconds]
    [Status: Good (0x00)]

Read(10) - Request
Code:
Frame 32: 114 bytes on wire (912 bits), 114 bytes captured (912 bits)
Ethernet II, Src: Intel_0f:ea:f5 (00:07:e9:0f:ea:f5), Dst: SuperMic_6c:a9:82 (00:25:90:6c:a9:82)
Internet Protocol Version 4, Src: 192.168.4.13, Dst: 192.168.40.1
Transmission Control Protocol, Src Port: cifs (3020), Dst Port: iscsi-target (3260), Seq: 821, Ack: 693, Len: 48
iSCSI (SCSI Command)
Flags: 0xc1, F, R, Attr: Simple
SCSI CDB Read(10)
    [LUN: 0x0000]
    [Command Set:Direct Access Device (0x00) (Using default commandset)]
    [Response in: 33]
    Opcode: Read(10) (0x28)
    Flags: 0x00
    Logical Block Address (LBA): 64
    ...0 0000 = Group: 0x00
    Transfer Length: 4
    Control: 0x00

Read(10) - Response
Code:
Frame 33: 214 bytes on wire (1712 bits), 214 bytes captured (1712 bits)
Ethernet II, Src: SuperMic_6c:a9:82 (00:25:90:6c:a9:82), Dst: Intel_0f:ea:f5 (00:07:e9:0f:ea:f5)
Internet Protocol Version 4, Src: 192.168.40.1, Dst: 192.168.4.13
Transmission Control Protocol, Src Port: iscsi-target (3260), Dst Port: cifs (3020), Seq: 693, Ack: 869, Len: 148
iSCSI (SCSI Response)
Flags: 0x80
SCSI: SNS Info
    [LUN: 0x0000]
    .111 0000 = SNS Error Type: Current Error (0x70)
    Valid: 112
    0... .... = Filemark: False
    .0.. .... = EOM: False
    ..0. .... = ILI: False
    .... 0101 = Sense Key: Illegal Request (0x5)
    Sense Info: 0x00000000
    Additional Sense Length: 10
    Command-Specific Information: 00000000
    Additional Sense Code+Qualifier: Logical Block Address Out Of Range (0x2100)
    Field Replaceable Unit Code: 0x00
    0... .... = SKSV: False
    .000 0000 0000 0000 0000 0000 = Sense Key Specific: 0x000000


The iPXE response to the console
Code:
Could not open SAN device: Input/output error (http://ipxe.org/1d704039
Could not boot image: Input/output error (http://ipxe.org/1d704039

The message logged on the iSCSI target
Code:
Feb 13 09:17:41 san1srvp01 kernel: cmd exceeds last lba 64 (lba 64, sectors 4)


Testing and Troubleshooting
  • Identical behavior is observed when attempting to boot a VM using this iSCSI LUN, ruling out the physical machine and its network card.
  • The iSCSI device behaves correctly when mounted using the native Linux initiator, and the device was initially populated over an iSCSI mount using dd to copy the image file to it.
  • I created a custom build of iPXE that forces Read Capacity(16) and Read(16), to no avail.
  • I have found one documented instance of similar behavior which was determined to be caused by the operational parameters supplied (or not) during the Login stage, in that case. In response, I created a custom build with the same parameters and parameter values as those observed during a working mount using the native Linux initiator, to no avail.
  • I have tried moving the fileio backing store image from ZFS to an xfs volume, to no avail.
  • I have tried the iPXE sanboot with the block device initialized with zeros, to no avail. This suggests the problem is not related to partitions or block device content.

  • Can anyone attest to having a working setup equivalent to this?
  • Does anyone know of definite problems with this setup?
  • Does anyone know what would cause LIO to behave in this way?
  • And the hardest question of all, does anyone know how to fix it?

-TIA
Find all posts by this user
Quote this message in a reply
2019-02-17, 21:10 (This post was last modified: 2019-02-17 21:11 by tlum.)
Post: #2
RE: [SOLVED] iSCSI read errors during iPXE boot
The volume is LUN 2 on the storage server and iPXE needs to address LUN 2 explicitly. The storage server dosen't even have a LUN 0 so it not clear why the Test Unit Ready succeeded. Somewhat embarrassing that it was something so stupid simple.
Find all posts by this user
Quote this message in a reply
2019-02-17, 21:21
Post: #3
RE: [SOLVED] iSCSI read errors during iPXE boot
Great that you solved it!

It would maybe have helped if you have started with the iPXE error message - i really had to search to find the relevant bits in your post, but finally found this:
Could not open SAN device: Input/output error (http://ipxe.org/1d704039)

and the first suggestion on that page:
Quote:Make sure you connect to the correct LUN on the iscsi target. See sanuri for the correct URI syntax for specifying the LUN.

Use GitHub Discussions
VRAM bin
Visit this user's website Find all posts by this user
Quote this message in a reply
2019-02-17, 21:48 (This post was last modified: 2019-02-17 21:56 by tlum.)
Post: #4
RE: [SOLVED] iSCSI read errors during iPXE boot
I did read that, and more, and it didn't help, since I believed it was correct, since the target was responding with a "Test Unit Ready" Success, and other indications that I was hitting the correct LUN, and in addition to expecting it to cry if it's wrong because that initiators acls only has that that one LUN.

What adds to the confusion is that the Target has a dozen LUNs, but the Initiators all do a discovery and then automatically create sequential block devices, like:

client1
LUN3 = /dev/sda, LUN7 = /dev/sdb

client2
LUN5 = /dev/sda, LUN8 = /dev/sdb

So, it's easy to fall into the trap of thinking of the first device as being relative to the first LUN assigned to the initiators iqn. But that's not the case, iPXE wants the actual Target LUN #, and not a 0 based index relative to the initiators iqn acl. So in order to, "Make sure you connect to the correct LUN on the iscsi target", you first have to be exceptionally clear on exactly what the means.
Find all posts by this user
Quote this message in a reply
2019-02-17, 21:52
Post: #5
RE: [SOLVED] iSCSI read errors during iPXE boot
Ok, so the issue here could maybe have been avoided, or simplified if iPXE had used LUN index, instead of actual LUN # ?
Is that how the targets are defined in other tools as well, and this could/should be considered a bug in iPXE?

Use GitHub Discussions
VRAM bin
Visit this user's website Find all posts by this user
Quote this message in a reply
2019-02-17, 23:08
Post: #6
RE: [SOLVED] iSCSI read errors during iPXE boot
I think iPXE works as designed, so it's not a bug. The "L" in LUN is part of the confusion, since it's a "Logical" unit that can be remapped along the way.

As a new feature, what I might do is enumerate the LUNs the target is exposing, check what the user is passing against it, and throw up and error and the list if they're asking for something that's not available.

If I were to take that a step further I might implement an iPXE command that would dump a list in the console, so if you needed to you could interrogate the target and see what it looks like from iPXE's perspective.
Find all posts by this user
Quote this message in a reply
2019-02-18, 18:15
Post: #7
RE: [SOLVED] iSCSI read errors during iPXE boot
In all fairness to iPXE, the largest amount of grief was caused by the LIO Target. For days I was hammering away at a non-existent LUN, but the Target reported that it was 63 Logical Blocks in size, with a sector size of 512, the TEST UNIT READY was succeeding, and when you tried to read from it the complaint was about a non-existent LBA... all of this while it was really a non-existent LUN. It was that deception that kept me distracted. Had it behaved more like a missing LUN, with a Service Unavailable fault, this issue would have been over before it even started. So, I'm not sure the degree to which iPXE ought to take on the weight of dealing LIOs confusion.
Find all posts by this user
Quote this message in a reply
2019-02-18, 20:02
Post: #8
RE: [SOLVED] iSCSI read errors during iPXE boot
The error message page is updated, I think it is the 3rd or 4th time this pops up even tho the error page tries to enforce checking LUN.

Use GitHub Discussions
VRAM bin
Visit this user's website Find all posts by this user
Quote this message in a reply
Post Reply 




User(s) browsing this thread: 3 Guest(s)