target-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Unable to recover from DataOut timeout while in ERL=0
@ 2024-03-05 10:36 Holger Amberg
  2024-03-05 16:55 ` Mike Christie
  0 siblings, 1 reply; 7+ messages in thread
From: Holger Amberg @ 2024-03-05 10:36 UTC (permalink / raw)
  To: target-devel

Hello, everyone, 

i am using a pair of Ubuntu 22.04 AMD EPYC 9124 servers with DRBD Primary/Secondary and iscsi target configuration to serve VMWare ESXi hosts. Everything was working a few month, but i have had two complete freezes recently within 24hrs caused by stuck iscsi targets. Only a complete reboot or failover to the other node can fix this issue. The problem is starting with „Unable to recover from DataOut timeout“ and then causing a set of many other issues errors, hung_task_timeouts, kernel stack traces at the end. Any idea what might cause this problem? 

Environment:
* Ubuntu 22.04
* Kernel: 5.15.0-94-generic x86_64
* DRBD: 9.2.7
* 1 CPU EPYC 9124, 32GB of RAM
* Asus RS500A-E12-RS12U/K14PA-U24
* 4 targets with 5 TB each backend by DRBD and LVM

Mar  4 18:23:08 km38033 kernel: [32853.081205] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus iqn.1998-01.com.vmware:km38112-6f3c301b,i,0x00023d000001,iqn.2001-09.com.customer:target2,t,0x01

Mar  4 18:23:13 km38033 kernel: [32857.897105] ABORT_TASK: Found referenced iSCSI task_tag: 756708339
Mar  4 18:23:13 km38033 kernel: [32857.897144] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 756708339
Mar  4 18:23:14 km38033 kernel: [32859.382095] Detected MISCOMPARE at offset 20
Mar  4 18:23:21 km38033 kernel: [32866.464749] Detected MISCOMPARE at offset 56
Mar  4 18:23:22 km38033 kernel: [32867.481244] Detected MISCOMPARE at offset 13
Mar  4 18:23:23 km38033 kernel: [32868.000291] Detected MISCOMPARE at offset 13
Mar  4 18:23:23 km38033 kernel: [32868.335822] Detected MISCOMPARE at offset 13
Mar  4 18:23:24 km38033 kernel: [32869.176568] Detected MISCOMPARE at offset 13
Mar  4 18:23:28 km38033 kernel: [32873.047201] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 157801725
Mar  4 18:23:29 km38033 kernel: [32873.901843] Detected MISCOMPARE at offset 13
Mar  4 18:23:35 km38033 kernel: [32880.495731] Detected MISCOMPARE at offset 28
Mar  4 18:23:36 km38033 kernel: [32881.012692] Detected MISCOMPARE at offset 56
Mar  4 18:23:36 km38033 kernel: [32881.013909] Detected MISCOMPARE at offset 14
Mar  4 18:23:38 km38033 kernel: [32882.886672] Detected MISCOMPARE at offset 14
Mar  4 18:27:46 km38033 kernel: [33130.878094] ABORT_TASK: Found referenced iSCSI task_tag: 263906787
Mar  4 18:27:46 km38033 kernel: [33130.878118] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 263906787
Mar  4 18:27:47 km38033 kernel: [33131.879914] ABORT_TASK: Found referenced iSCSI task_tag: 756717268
Mar  4 18:27:47 km38033 kernel: [33131.879943] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 756717268
Mar  4 18:27:47 km38033 kernel: [33131.880389] ABORT_TASK: Found referenced iSCSI task_tag: 756717269
Mar  4 18:27:47 km38033 kernel: [33132.179400] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 756717269
Mar  4 18:27:47 km38033 kernel: [33132.180525] ABORT_TASK: Found referenced iSCSI task_tag: 756717304
Mar  4 18:27:47 km38033 kernel: [33132.180748] ABORT_TASK: Found referenced iSCSI task_tag: 209408054
Mar  4 18:27:47 km38033 kernel: [33132.180783] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 209408054
Mar  4 18:27:47 km38033 kernel: [33132.181353] ABORT_TASK: Found referenced iSCSI task_tag: 209408082
Mar  4 18:27:48 km38033 kernel: [33133.234828] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 756717304
Mar  4 18:27:48 km38033 kernel: [33133.234932] Detected MISCOMPARE at offset 20
Mar  4 18:27:49 km38033 kernel: [33133.568484] ABORT_TASK: Found referenced iSCSI task_tag: 18127
Mar  4 18:27:49 km38033 kernel: [33133.568507] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 18127
Mar  4 18:27:49 km38033 kernel: [33133.575026] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 209408082
Mar  4 18:27:49 km38033 kernel: [33133.845133] ABORT_TASK: Found referenced iSCSI task_tag: 1434712138
Mar  4 18:27:49 km38033 kernel: [33133.845159] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1434712138
Mar  4 18:27:49 km38033 kernel: [33133.845273] ABORT_TASK: Found referenced iSCSI task_tag: 157803718
Mar  4 18:27:49 km38033 kernel: [33133.845307] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 157803718
Mar  4 18:27:49 km38033 kernel: [33133.845344] ABORT_TASK: Found referenced iSCSI task_tag: 286437689
Mar  4 18:27:49 km38033 kernel: [33133.845522] ABORT_TASK: Found referenced iSCSI task_tag: 157803719
Mar  4 18:27:49 km38033 kernel: [33133.845531] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 286437689
Mar  4 18:27:49 km38033 kernel: [33133.845967] ABORT_TASK: Found referenced iSCSI task_tag: 286437691
Mar  4 18:27:51 km38033 kernel: [33135.654498] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 286437691
Mar  4 18:27:51 km38033 kernel: [33135.654987] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 157803719
Mar  4 18:27:51 km38033 kernel: [33135.655087] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 286437697
Mar  4 18:27:52 km38033 kernel: [33137.235524] ABORT_TASK: Found referenced iSCSI task_tag: 263906810
Mar  4 18:27:52 km38033 kernel: [33137.408856] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 263906810
Mar  4 18:27:52 km38033 kernel: [33137.409000] Detected MISCOMPARE at offset 56
Mar  4 18:27:52 km38033 kernel: [33137.409320] ABORT_TASK: Found referenced iSCSI task_tag: 263906811
Mar  4 18:27:52 km38033 kernel: [33137.409522] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 263906811
Mar  4 18:27:53 km38033 kernel: [33137.584808] Detected MISCOMPARE at offset 56
Mar  4 18:27:53 km38033 kernel: [33137.584843] Detected MISCOMPARE at offset 56
Mar  4 18:27:53 km38033 kernel: [33137.584877] Detected MISCOMPARE at offset 56
Mar  4 18:27:53 km38033 kernel: [33138.289566] Detected MISCOMPARE at offset 56
Mar  4 18:27:53 km38033 kernel: [33138.289721] Detected MISCOMPARE at offset 56
Mar  4 18:27:53 km38033 kernel: [33138.289903] Detected MISCOMPARE at offset 56
Mar  4 18:27:57 km38033 kernel: [33141.720127] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1434712401
Mar  4 18:27:58 km38033 kernel: [33142.940133] Detected MISCOMPARE at offset 56
Mar  4 18:27:59 km38033 kernel: [33144.323760] Detected MISCOMPARE at offset 56
Mar  4 18:27:59 km38033 kernel: [33144.323965] Detected MISCOMPARE at offset 56
Mar  4 18:27:59 km38033 kernel: [33144.498569] Detected MISCOMPARE at offset 56
Mar  4 18:28:01 km38033 kernel: [33146.244787] Detected MISCOMPARE at offset 56
Mar  4 18:28:01 km38033 kernel: [33146.246893] Detected MISCOMPARE at offset 56
Mar  4 18:28:01 km38033 kernel: [33146.246931] Detected MISCOMPARE at offset 56
Mar  4 18:28:02 km38033 kernel: [33147.476180] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus iqn.1998-01.com.vmware:km38116-369efe55,i,0x00023d000001,iqn.2001-09.com.customer:target1,t,0x01
Mar  4 18:39:28 km38033 kernel: [33833.544517] INFO: task iscsi_np:27427 blocked for more than 241 seconds.
Mar  4 18:39:28 km38033 kernel: [33833.544952]       Tainted: G           OE     5.15.0-94-generic #104-Ubuntu
Mar  4 18:39:28 km38033 kernel: [33833.545254] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  4 18:39:28 km38033 kernel: [33833.545547] task:iscsi_np        state:D stack:    0 pid:27427 ppid:     2 flags:0x00004004
Mar  4 18:39:28 km38033 kernel: [33833.545550] Call Trace:
Mar  4 18:39:28 km38033 kernel: [33833.545552]  <TASK>
Mar  4 18:39:28 km38033 kernel: [33833.545554]  __schedule+0x24e/0x590
Mar  4 18:39:28 km38033 kernel: [33833.545560]  schedule+0x69/0x110
Mar  4 18:39:28 km38033 kernel: [33833.545562]  schedule_timeout+0x105/0x140
Mar  4 18:39:28 km38033 kernel: [33833.545564]  ? __kmalloc_track_caller+0x181/0x340
Mar  4 18:39:28 km38033 kernel: [33833.545569]  ? iscsi_update_param_value+0x28/0x70 [iscsi_target_mod]
Mar  4 18:39:28 km38033 kernel: [33833.545578]  __wait_for_common+0xae/0x150
Mar  4 18:39:28 km38033 kernel: [33833.545579]  ? usleep_range_state+0x90/0x90
Mar  4 18:39:28 km38033 kernel: [33833.545581]  wait_for_completion+0x24/0x30
Mar  4 18:39:28 km38033 kernel: [33833.545582]  iscsit_stop_session+0x18f/0x1b0 [iscsi_target_mod]
Mar  4 18:39:28 km38033 kernel: [33833.545589]  iscsi_check_for_session_reinstatement+0x1df/0x280 [iscsi_target_mod]
Mar  4 18:39:28 km38033 kernel: [33833.545596]  iscsi_target_handle_csg_one+0x128/0x150 [iscsi_target_mod]
Mar  4 18:39:28 km38033 kernel: [33833.545603]  iscsi_target_do_login+0x71/0xd0 [iscsi_target_mod]
Mar  4 18:39:28 km38033 kernel: [33833.545609]  iscsi_target_start_negotiation+0x56/0x100 [iscsi_target_mod]
Mar  4 18:39:28 km38033 kernel: [33833.545615]  __iscsi_target_login_thread+0x2f0/0x620 [iscsi_target_mod]
Mar  4 18:39:28 km38033 kernel: [33833.545621]  iscsi_target_login_thread+0x24/0x60 [iscsi_target_mod]
Mar  4 18:39:28 km38033 kernel: [33833.545627]  ? __iscsi_target_login_thread+0x620/0x620 [iscsi_target_mod]
Mar  4 18:39:28 km38033 kernel: [33833.545633]  kthread+0x12a/0x150
Mar  4 18:39:28 km38033 kernel: [33833.545636]  ? set_kthread_struct+0x50/0x50
Mar  4 18:39:28 km38033 kernel: [33833.545638]  ret_from_fork+0x22/0x30
Mar  4 18:39:28 km38033 kernel: [33833.545641]  </TASK>
Mar  4 18:41:29 km38033 kernel: [33954.376305] INFO: task iscsi_trx:588744 blocked for more than 120 seconds.
Mar  4 18:41:29 km38033 kernel: [33954.376560]       Tainted: G           OE     5.15.0-94-generic #104-Ubuntu
Mar  4 18:41:29 km38033 kernel: [33954.376811] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  4 18:41:29 km38033 kernel: [33954.377065] task:iscsi_trx       state:D stack:    0 pid:588744 ppid:     2 flags:0x00004004
Mar  4 18:41:29 km38033 kernel: [33954.377067] Call Trace:
Mar  4 18:41:29 km38033 kernel: [33954.377067]  <TASK>
Mar  4 18:41:29 km38033 kernel: [33954.377068]  __schedule+0x24e/0x590
Mar  4 18:41:29 km38033 kernel: [33954.377070]  schedule+0x69/0x110
Mar  4 18:41:29 km38033 kernel: [33954.377071]  schedule_timeout+0x87/0x140
Mar  4 18:41:29 km38033 kernel: [33954.377072]  ? __bpf_trace_tick_stop+0x20/0x20
Mar  4 18:41:29 km38033 kernel: [33954.377075]  __wait_for_common+0xae/0x150
Mar  4 18:41:29 km38033 kernel: [33954.377076]  ? usleep_range_state+0x90/0x90
Mar  4 18:41:29 km38033 kernel: [33954.377078]  wait_for_completion_timeout+0x1d/0x30
Mar  4 18:41:29 km38033 kernel: [33954.377079]  __transport_wait_for_tasks+0xd3/0x140 [target_core_mod]
Mar  4 18:41:29 km38033 kernel: [33954.377088]  transport_generic_free_cmd+0x14c/0x190 [target_core_mod]
Mar  4 18:41:29 km38033 kernel: [33954.377096]  iscsit_free_cmd+0x58/0xd0 [iscsi_target_mod]
Mar  4 18:41:29 km38033 kernel: [33954.377103]  iscsit_release_commands_from_conn+0x1cd/0x210 [iscsi_target_mod]
Mar  4 18:41:29 km38033 kernel: [33954.377109]  iscsit_close_connection+0x3bf/0x720 [iscsi_target_mod]
Mar  4 18:41:29 km38033 kernel: [33954.377115]  iscsit_take_action_for_connection_exit+0x86/0x110 [iscsi_target_mod]
Mar  4 18:41:29 km38033 kernel: [33954.377122]  iscsi_target_rx_thread+0xba/0x110 [iscsi_target_mod]
Mar  4 18:41:29 km38033 kernel: [33954.377128]  ? iscsi_target_tx_thread+0x220/0x220 [iscsi_target_mod]
Mar  4 18:41:29 km38033 kernel: [33954.377134]  kthread+0x12a/0x150
Mar  4 18:41:29 km38033 kernel: [33954.377135]  ? set_kthread_struct+0x50/0x50
Mar  4 18:41:29 km38033 kernel: [33954.377137]  ret_from_fork+0x22/0x30
Mar  4 18:41:29 km38033 kernel: [33954.377139]  </TASK>

Many thanks in advance.

Best regards,
Holger Amberg


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Unable to recover from DataOut timeout while in ERL=0
  2024-03-05 10:36 Unable to recover from DataOut timeout while in ERL=0 Holger Amberg
@ 2024-03-05 16:55 ` Mike Christie
  0 siblings, 0 replies; 7+ messages in thread
From: Mike Christie @ 2024-03-05 16:55 UTC (permalink / raw)
  To: Holger Amberg, target-devel

On 3/5/24 4:36 AM, Holger Amberg wrote:
> * Kernel: 5.15.0-94-generic x86_64


> Mar  4 18:28:02 km38033 kernel: [33147.476180] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus iqn.1998-01.com.vmware:km38116-369efe55,i,0x00023d000001,iqn.2001-09.com.customer:target1,t,0x01

This should be fixed in the current kernels with this patch:

Author: Dmitry Bogdanov <d.bogdanov@yadro.com>
Date:   Sat Mar 18 20:56:20 2023 -0500

    scsi: target: iscsi: Handle abort for WRITE_PENDING cmds

For upstream kernels I think it was merged around 6.3.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Unable to recover from DataOut timeout while in ERL=0
  2022-07-26 11:12     ` Dmitry Bogdanov
@ 2022-07-26 11:21       ` Nick Couchman
  0 siblings, 0 replies; 7+ messages in thread
From: Nick Couchman @ 2022-07-26 11:21 UTC (permalink / raw)
  To: Dmitry Bogdanov; +Cc: target-devel

On Tue, Jul 26, 2022 at 7:13 AM Dmitry Bogdanov <d.bogdanov@yadro.com> wrote:
>
> >
> > Is this a bug that needs to be raised with VMware? Or is patching the
> > Linux Target driver really the way to go? I'm happy to put in a case
> > with VMware if that's desirable.
>
> Most likely such behavior of ESXi is allowed by RFC. Even linux
> iSCSI initiator side does the same - stops sending DataOut PDUs for the
> aborted command.

Ah, okay, makes sense.

>
> > > > I'm wondering if anyone can provide tips as to how to best mitigate
> > > > this - any tuning that can be done to change the time out, or throttle
> > > > the iSCSI traffic, or is it indicative of a lack of available RAM for
> > > > buffering (I'm not seeing a lot of RAM pressure, but possible I'm just
> > > > not catching it)?
> > > >
> > > I may just send you a patch for a target that fixes the hanging. ESXi
> > > will reconnect to the target and will continue work with it without a
> > > reboot.
> > >
> >
> > I got the patch - I had to tweak it a bit for the CentOS Stream 8
> > kernel I'm running against, but I've added it to the RPM and am
> > rebuilding the packages, now. Hopefully will get it tested in the next
> > couple of days.
>
> Did you managed to test the patch? Could you please give some feedback?

Not, yet - I have Secure Boot enabled on most of the systems where I'm
currently running it, so it makes compiling a custom kernel a bit more
difficult. I'll try turning secure boot off on one of those systems
and give it a shot, again. Hopefully have some news for you, soon.

-Nick

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Unable to recover from DataOut timeout while in ERL=0
  2022-07-14  1:47   ` Nick Couchman
@ 2022-07-26 11:12     ` Dmitry Bogdanov
  2022-07-26 11:21       ` Nick Couchman
  0 siblings, 1 reply; 7+ messages in thread
From: Dmitry Bogdanov @ 2022-07-26 11:12 UTC (permalink / raw)
  To: Nick Couchman; +Cc: target-devel

Hi Nick,

On Wed, Jul 13, 2022 at 09:47:40PM -0400, Nick Couchman wrote:
> 
> On Wed, Jul 13, 2022 at 4:40 PM Dmitry Bogdanov <d.bogdanov@yadro.com> wrote:
> >
> > Hi Nick,
> >
> > On Wed, Jul 13, 2022 at 03:04:12PM -0400, Nick Couchman wrote:
> > >
> > > (Apologies if this ends up as a double-post, re-sending in Plain Text Mode)
> > >
> > > Hello, everyone,
> > > Hopefully this is the correct place to ask a general
> > > usage/troubleshooting question regarding the Linux Target iSCSI
> > > system.
> > >
> > > I'm using the Linux iSCSI target on a pair of CentOS 8 Stream VMs that
> > > are configured with DRBD to synchronize data between two ESXi hosts,
> > > and then present that disk back to the ESXi hosts via iSCSI. Basically
> > > I'm attempting to achieve a vSAN-like configuration, where I have
> > > "shared storage" backed by the underlying physical storage of the
> > > individual hosts.
> > >
> > > It's worth noting that, at present, I'm not using an Active/Active
> > > configuration (DRBD dual-primary), but each of the VMs has the DRBD
> > > configuration and iSCSI configuration, and I can fail the primary and
> > > iSCSI service back and forth between the nodes.
> > >
> > > I'm running into a situation where, once I get the system under
> > > moderate I/O load (installing Linux in another VM, for example), I
> > > start seeing the following errors in dmesg and/or journalctl on the
> > > active node:
> > >
> > > Unable to recover from DataOut timeout while in ERL=0, closing iSCSI
> > > connection for I_T Nexus
> > > iqn.1998-01.com.vmware:esx01-18f91cf9,i,0x00023d000001,iqn.1902-01.com.example.site:drbd1,t,0x01
> > >
> > > This gets repeated a couple of dozen or so times, and then I/O to the
> > > iSCSI LUN from the ESXi host halts, the path to the LUN shows as
> > > "Dead", and I have to reboot the active node and fail over to the
> > > other node, at which point VMware picks back up and continues.
> > >
> > > I've searched around the web to try to find assistance with this
> > > error, but it doesn't seem all that common - in one case it appears to
> > > be a bug from several years ago that was patched, and beyond that not
> > > much relevant has turned up. Based on the error message, it almost
> > > seems as if the target system is trying to say that it couldn't write
> > > its data out to the disk in a timely fashion (which might be because
> > > DRBD can't sync as quickly as is expected?), but it isn't all that
> > > clear from the error.
> > We have been encountering the same issue with ESXi. For some reasons it
> > may not send an IO data for the already sent SCSI WRITE command - iSCSI
> > DataOUT PDUs. Instead, it send an ABORT for that command. Linux Target
> > Core does not abort a SCSI command when it has not yet full IO data
> > collected. iSCSI DataOut timer times out and triggers connection
> > reinstatement.
> > But during that reinstatement iSCSI hangs waiting for that aborted WRITE
> > command got completed. A not finished logout prevents a new login from
> > the same initiator.
> > That condition solves only by a target reboot.
> 
> Is this a bug that needs to be raised with VMware? Or is patching the
> Linux Target driver really the way to go? I'm happy to put in a case
> with VMware if that's desirable.

Most likely such behavior of ESXi is allowed by RFC. Even linux
iSCSI initiator side does the same - stops sending DataOut PDUs for the
aborted command.

> > > I'm wondering if anyone can provide tips as to how to best mitigate
> > > this - any tuning that can be done to change the time out, or throttle
> > > the iSCSI traffic, or is it indicative of a lack of available RAM for
> > > buffering (I'm not seeing a lot of RAM pressure, but possible I'm just
> > > not catching it)?
> > >
> > I may just send you a patch for a target that fixes the hanging. ESXi
> > will reconnect to the target and will continue work with it without a
> > reboot.
> >
> 
> I got the patch - I had to tweak it a bit for the CentOS Stream 8
> kernel I'm running against, but I've added it to the RPM and am
> rebuilding the packages, now. Hopefully will get it tested in the next
> couple of days.

Did you managed to test the patch? Could you please give some feedback?

BR,
 Dmitry

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Unable to recover from DataOut timeout while in ERL=0
  2022-07-13 20:40 ` Dmitry Bogdanov
@ 2022-07-14  1:47   ` Nick Couchman
  2022-07-26 11:12     ` Dmitry Bogdanov
  0 siblings, 1 reply; 7+ messages in thread
From: Nick Couchman @ 2022-07-14  1:47 UTC (permalink / raw)
  To: Dmitry Bogdanov; +Cc: target-devel

On Wed, Jul 13, 2022 at 4:40 PM Dmitry Bogdanov <d.bogdanov@yadro.com> wrote:
>
> Hi Nick,
>
> On Wed, Jul 13, 2022 at 03:04:12PM -0400, Nick Couchman wrote:
> >
> > (Apologies if this ends up as a double-post, re-sending in Plain Text Mode)
> >
> > Hello, everyone,
> > Hopefully this is the correct place to ask a general
> > usage/troubleshooting question regarding the Linux Target iSCSI
> > system.
> >
> > I'm using the Linux iSCSI target on a pair of CentOS 8 Stream VMs that
> > are configured with DRBD to synchronize data between two ESXi hosts,
> > and then present that disk back to the ESXi hosts via iSCSI. Basically
> > I'm attempting to achieve a vSAN-like configuration, where I have
> > "shared storage" backed by the underlying physical storage of the
> > individual hosts.
> >
> > It's worth noting that, at present, I'm not using an Active/Active
> > configuration (DRBD dual-primary), but each of the VMs has the DRBD
> > configuration and iSCSI configuration, and I can fail the primary and
> > iSCSI service back and forth between the nodes.
> >
> > I'm running into a situation where, once I get the system under
> > moderate I/O load (installing Linux in another VM, for example), I
> > start seeing the following errors in dmesg and/or journalctl on the
> > active node:
> >
> > Unable to recover from DataOut timeout while in ERL=0, closing iSCSI
> > connection for I_T Nexus
> > iqn.1998-01.com.vmware:esx01-18f91cf9,i,0x00023d000001,iqn.1902-01.com.example.site:drbd1,t,0x01
> >
> > This gets repeated a couple of dozen or so times, and then I/O to the
> > iSCSI LUN from the ESXi host halts, the path to the LUN shows as
> > "Dead", and I have to reboot the active node and fail over to the
> > other node, at which point VMware picks back up and continues.
> >
> > I've searched around the web to try to find assistance with this
> > error, but it doesn't seem all that common - in one case it appears to
> > be a bug from several years ago that was patched, and beyond that not
> > much relevant has turned up. Based on the error message, it almost
> > seems as if the target system is trying to say that it couldn't write
> > its data out to the disk in a timely fashion (which might be because
> > DRBD can't sync as quickly as is expected?), but it isn't all that
> > clear from the error.
> We have been encountering the same issue with ESXi. For some reasons it
> may not send an IO data for the already sent SCSI WRITE command - iSCSI
> DataOUT PDUs. Instead, it send an ABORT for that command. Linux Target
> Core does not abort a SCSI command when it has not yet full IO data
> collected. iSCSI DataOut timer times out and triggers connection
> reinstatement.
> But during that reinstatement iSCSI hangs waiting for that aborted WRITE
> command got completed. A not finished logout prevents a new login from
> the same initiator.
> That condition solves only by a target reboot.

Is this a bug that needs to be raised with VMware? Or is patching the
Linux Target driver really the way to go? I'm happy to put in a case
with VMware if that's desirable.

>
> >
> > I'm wondering if anyone can provide tips as to how to best mitigate
> > this - any tuning that can be done to change the time out, or throttle
> > the iSCSI traffic, or is it indicative of a lack of available RAM for
> > buffering (I'm not seeing a lot of RAM pressure, but possible I'm just
> > not catching it)?
> >
> I may just send you a patch for a target that fixes the hanging. ESXi
> will reconnect to the target and will continue work with it without a
> reboot.
>

I got the patch - I had to tweak it a bit for the CentOS Stream 8
kernel I'm running against, but I've added it to the RPM and am
rebuilding the packages, now. Hopefully will get it tested in the next
couple of days.

Thanks - NIck

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Unable to recover from DataOut timeout while in ERL=0
  2022-07-13 19:04 Nick Couchman
@ 2022-07-13 20:40 ` Dmitry Bogdanov
  2022-07-14  1:47   ` Nick Couchman
  0 siblings, 1 reply; 7+ messages in thread
From: Dmitry Bogdanov @ 2022-07-13 20:40 UTC (permalink / raw)
  To: Nick Couchman; +Cc: target-devel

Hi Nick,

On Wed, Jul 13, 2022 at 03:04:12PM -0400, Nick Couchman wrote:
> 
> (Apologies if this ends up as a double-post, re-sending in Plain Text Mode)
> 
> Hello, everyone,
> Hopefully this is the correct place to ask a general
> usage/troubleshooting question regarding the Linux Target iSCSI
> system.
> 
> I'm using the Linux iSCSI target on a pair of CentOS 8 Stream VMs that
> are configured with DRBD to synchronize data between two ESXi hosts,
> and then present that disk back to the ESXi hosts via iSCSI. Basically
> I'm attempting to achieve a vSAN-like configuration, where I have
> "shared storage" backed by the underlying physical storage of the
> individual hosts.
> 
> It's worth noting that, at present, I'm not using an Active/Active
> configuration (DRBD dual-primary), but each of the VMs has the DRBD
> configuration and iSCSI configuration, and I can fail the primary and
> iSCSI service back and forth between the nodes.
> 
> I'm running into a situation where, once I get the system under
> moderate I/O load (installing Linux in another VM, for example), I
> start seeing the following errors in dmesg and/or journalctl on the
> active node:
> 
> Unable to recover from DataOut timeout while in ERL=0, closing iSCSI
> connection for I_T Nexus
> iqn.1998-01.com.vmware:esx01-18f91cf9,i,0x00023d000001,iqn.1902-01.com.example.site:drbd1,t,0x01
> 
> This gets repeated a couple of dozen or so times, and then I/O to the
> iSCSI LUN from the ESXi host halts, the path to the LUN shows as
> "Dead", and I have to reboot the active node and fail over to the
> other node, at which point VMware picks back up and continues.
> 
> I've searched around the web to try to find assistance with this
> error, but it doesn't seem all that common - in one case it appears to
> be a bug from several years ago that was patched, and beyond that not
> much relevant has turned up. Based on the error message, it almost
> seems as if the target system is trying to say that it couldn't write
> its data out to the disk in a timely fashion (which might be because
> DRBD can't sync as quickly as is expected?), but it isn't all that
> clear from the error.
We have been encountering the same issue with ESXi. For some reasons it
may not send an IO data for the already sent SCSI WRITE command - iSCSI
DataOUT PDUs. Instead, it send an ABORT for that command. Linux Target
Core does not abort a SCSI command when it has not yet full IO data
collected. iSCSI DataOut timer times out and triggers connection
reinstatement.
But during that reinstatement iSCSI hangs waiting for that aborted WRITE
command got completed. A not finished logout prevents a new login from
the same initiator.
That condition solves only by a target reboot.

> 
> I'm wondering if anyone can provide tips as to how to best mitigate
> this - any tuning that can be done to change the time out, or throttle
> the iSCSI traffic, or is it indicative of a lack of available RAM for
> buffering (I'm not seeing a lot of RAM pressure, but possible I'm just
> not catching it)?
> 
I may just send you a patch for a target that fixes the hanging. ESXi
will reconnect to the target and will continue work with it without a
reboot.

> Environment:
> * CentOS 8 Stream
> * Kernel: 4.18.0-394.el8.x86_64
> * DRBD: 9.1.7
> * 2 CPU, 4GB of RAM per VM
> * Shared block devices is 1 TB
> 
> Thanks - Nick



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Unable to recover from DataOut timeout while in ERL=0
@ 2022-07-13 19:04 Nick Couchman
  2022-07-13 20:40 ` Dmitry Bogdanov
  0 siblings, 1 reply; 7+ messages in thread
From: Nick Couchman @ 2022-07-13 19:04 UTC (permalink / raw)
  To: target-devel

(Apologies if this ends up as a double-post, re-sending in Plain Text Mode)

Hello, everyone,
Hopefully this is the correct place to ask a general
usage/troubleshooting question regarding the Linux Target iSCSI
system.

I'm using the Linux iSCSI target on a pair of CentOS 8 Stream VMs that
are configured with DRBD to synchronize data between two ESXi hosts,
and then present that disk back to the ESXi hosts via iSCSI. Basically
I'm attempting to achieve a vSAN-like configuration, where I have
"shared storage" backed by the underlying physical storage of the
individual hosts.

It's worth noting that, at present, I'm not using an Active/Active
configuration (DRBD dual-primary), but each of the VMs has the DRBD
configuration and iSCSI configuration, and I can fail the primary and
iSCSI service back and forth between the nodes.

I'm running into a situation where, once I get the system under
moderate I/O load (installing Linux in another VM, for example), I
start seeing the following errors in dmesg and/or journalctl on the
active node:

Unable to recover from DataOut timeout while in ERL=0, closing iSCSI
connection for I_T Nexus
iqn.1998-01.com.vmware:esx01-18f91cf9,i,0x00023d000001,iqn.1902-01.com.example.site:drbd1,t,0x01

This gets repeated a couple of dozen or so times, and then I/O to the
iSCSI LUN from the ESXi host halts, the path to the LUN shows as
"Dead", and I have to reboot the active node and fail over to the
other node, at which point VMware picks back up and continues.

I've searched around the web to try to find assistance with this
error, but it doesn't seem all that common - in one case it appears to
be a bug from several years ago that was patched, and beyond that not
much relevant has turned up. Based on the error message, it almost
seems as if the target system is trying to say that it couldn't write
its data out to the disk in a timely fashion (which might be because
DRBD can't sync as quickly as is expected?), but it isn't all that
clear from the error.

I'm wondering if anyone can provide tips as to how to best mitigate
this - any tuning that can be done to change the time out, or throttle
the iSCSI traffic, or is it indicative of a lack of available RAM for
buffering (I'm not seeing a lot of RAM pressure, but possible I'm just
not catching it)?

Environment:
* CentOS 8 Stream
* Kernel: 4.18.0-394.el8.x86_64
* DRBD: 9.1.7
* 2 CPU, 4GB of RAM per VM
* Shared block devices is 1 TB

Thanks - Nick

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-03-05 16:56 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-05 10:36 Unable to recover from DataOut timeout while in ERL=0 Holger Amberg
2024-03-05 16:55 ` Mike Christie
  -- strict thread matches above, loose matches on Subject: below --
2022-07-13 19:04 Nick Couchman
2022-07-13 20:40 ` Dmitry Bogdanov
2022-07-14  1:47   ` Nick Couchman
2022-07-26 11:12     ` Dmitry Bogdanov
2022-07-26 11:21       ` Nick Couchman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).