All of lore.kernel.org
 help / color / mirror / Atom feed
* Need some pointers to debug a target hang
@ 2016-09-02 14:14 Johannes Thumshirn
  2016-10-18  5:57 ` Nicholas A. Bellinger
  0 siblings, 1 reply; 11+ messages in thread
From: Johannes Thumshirn @ 2016-09-02 14:14 UTC (permalink / raw)
  To: Nicholas A. Bellinger; +Cc: linux-scsi, target-devel

Hi Nick et al,

I'm having a "interesting" problem with the kernel's iSCSI target and
could use a debug hint.

My target uses an iblock backstore on a dm-linear target. When I now
get I/O form the initiator (I used a simple dd if=/dev/sda
of=/dev/null) and call 'dmsetup suspend $backstore' it'll take about
15 seconds for the iscsi_ttx kernel thread to disapear, the iscsi_trx
and iscsi_np threads are hanging in 'D'.

>From iscsi_trx's stack I see it's waiting in
__transport_wait_for_tasks(). The last thing I see in dmesg is the
'ABORT_TASK: Found referenced %s task_tag: %llu' printk but the
'ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: %llu" printk
is missing from core_tmr_abort_task(). As there's a
transport_wait_for_tasks() call in between I _think_ it is stuck in
aborting this one task and none of the
complete(se_cmd->t_transport_stop_comp) callers is called. What
puzzels me a bit is that right after transport_wait_for_tasks() in
core_tmr_abort_task() there's a call to transport_cmd_finish_abort()
which in turn calls transport_cmd_check_stop_to_fabric() ->
transport_cmd_check_stop() ->
complete_all(&cmd->t_transport_stop_comp).

Doing 

--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -2739,7 +2739,7 @@ __transport_wait_for_tasks(struct se_cmd
 
        spin_unlock_irqrestore(&cmd->t_state_lock, *flags);
 
-       wait_for_completion(&cmd->t_transport_stop_comp);
+       wait_for_completion_interruptible(&cmd->t_transport_stop_comp, 5 * HZ);
 
        spin_lock_irqsave(&cmd->t_state_lock, *flags);
        cmd->transport_state &= ~(CMD_T_ACTIVE | CMD_T_STOP);

"resolves" the bug, but I don't think this is correct.

This is all easily reproducible with v4.8-rc4 in qemu (for instance).

Any advice is aprechiated.

Thanks in advance,
       Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Need some pointers to debug a target hang
  2016-09-02 14:14 Need some pointers to debug a target hang Johannes Thumshirn
@ 2016-10-18  5:57 ` Nicholas A. Bellinger
  2016-10-18  7:01   ` Johannes Thumshirn
  0 siblings, 1 reply; 11+ messages in thread
From: Nicholas A. Bellinger @ 2016-10-18  5:57 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-scsi, target-devel

Hello Johannes,

Apologies for the extended delayed follow-up on this bug report.

On Fri, 2016-09-02 at 16:14 +0200, Johannes Thumshirn wrote:
> Hi Nick et al,
> 
> I'm having a "interesting" problem with the kernel's iSCSI target and
> could use a debug hint.
> 
> My target uses an iblock backstore on a dm-linear target. When I now
> get I/O form the initiator (I used a simple dd if=/dev/sda
> of=/dev/null) and call 'dmsetup suspend $backstore' it'll take about
> 15 seconds for the iscsi_ttx kernel thread to disapear, the iscsi_trx
> and iscsi_np threads are hanging in 'D'.
> 
> From iscsi_trx's stack I see it's waiting in
> __transport_wait_for_tasks(). The last thing I see in dmesg is the
> 'ABORT_TASK: Found referenced %s task_tag: %llu' printk but the
> 'ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: %llu" printk
> is missing from core_tmr_abort_task(). As there's a
> transport_wait_for_tasks() call in between I _think_ it is stuck in
> aborting this one task and none of the
> complete(se_cmd->t_transport_stop_comp) callers is called. What
> puzzels me a bit is that right after transport_wait_for_tasks() in
> core_tmr_abort_task() there's a call to transport_cmd_finish_abort()
> which in turn calls transport_cmd_check_stop_to_fabric() ->
> transport_cmd_check_stop() ->
> complete_all(&cmd->t_transport_stop_comp).
> 
> Doing 
> 
> --- a/drivers/target/target_core_transport.c
> +++ b/drivers/target/target_core_transport.c
> @@ -2739,7 +2739,7 @@ __transport_wait_for_tasks(struct se_cmd
>  
>         spin_unlock_irqrestore(&cmd->t_state_lock, *flags);
>  
> -       wait_for_completion(&cmd->t_transport_stop_comp);
> +       wait_for_completion_interruptible(&cmd->t_transport_stop_comp, 5 * HZ);
>  
>         spin_lock_irqsave(&cmd->t_state_lock, *flags);
>         cmd->transport_state &= ~(CMD_T_ACTIVE | CMD_T_STOP);
> 
> "resolves" the bug, but I don't think this is correct.
> 
> This is all easily reproducible with v4.8-rc4 in qemu (for instance).
> 
> Any advice is aprechiated.
> 

This is likely the missing SCF_ACK_KREF assignment in >= v4.1.y:

http://www.spinics.net/lists/target-devel/msg13530.html

At your earliest convenience, please verify using this patch for TMR
ABORT_TASK due to target-core backend I/O still outstanding, with
simultaneous failed iscsi session reinstatement -> repeated iscsi login
timeout scenario.

Also once target-core backend I/O has (finally) been completed back to
fabric driver code, the iscsi_np configfs group shutdown is allowed to
proceed.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Need some pointers to debug a target hang
  2016-10-18  5:57 ` Nicholas A. Bellinger
@ 2016-10-18  7:01   ` Johannes Thumshirn
  2016-10-18 17:35     ` Johannes Thumshirn
  0 siblings, 1 reply; 11+ messages in thread
From: Johannes Thumshirn @ 2016-10-18  7:01 UTC (permalink / raw)
  To: Nicholas A. Bellinger; +Cc: linux-scsi, target-devel, Robert LeBlanc

[+Cc Robert LeBlanc <robert@leblancnet.us> and full quote for context ]

On Mon, Oct 17, 2016 at 10:57:55PM -0700, Nicholas A. Bellinger wrote:
> Hello Johannes,
> 
> Apologies for the extended delayed follow-up on this bug report.
> 
> On Fri, 2016-09-02 at 16:14 +0200, Johannes Thumshirn wrote:
> > Hi Nick et al,
> > 
> > I'm having a "interesting" problem with the kernel's iSCSI target and
> > could use a debug hint.
> > 
> > My target uses an iblock backstore on a dm-linear target. When I now
> > get I/O form the initiator (I used a simple dd if=/dev/sda
> > of=/dev/null) and call 'dmsetup suspend $backstore' it'll take about
> > 15 seconds for the iscsi_ttx kernel thread to disapear, the iscsi_trx
> > and iscsi_np threads are hanging in 'D'.
> > 
> > From iscsi_trx's stack I see it's waiting in
> > __transport_wait_for_tasks(). The last thing I see in dmesg is the
> > 'ABORT_TASK: Found referenced %s task_tag: %llu' printk but the
> > 'ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: %llu" printk
> > is missing from core_tmr_abort_task(). As there's a
> > transport_wait_for_tasks() call in between I _think_ it is stuck in
> > aborting this one task and none of the
> > complete(se_cmd->t_transport_stop_comp) callers is called. What
> > puzzels me a bit is that right after transport_wait_for_tasks() in
> > core_tmr_abort_task() there's a call to transport_cmd_finish_abort()
> > which in turn calls transport_cmd_check_stop_to_fabric() ->
> > transport_cmd_check_stop() ->
> > complete_all(&cmd->t_transport_stop_comp).
> > 
> > Doing 
> > 
> > --- a/drivers/target/target_core_transport.c
> > +++ b/drivers/target/target_core_transport.c
> > @@ -2739,7 +2739,7 @@ __transport_wait_for_tasks(struct se_cmd
> >  
> >         spin_unlock_irqrestore(&cmd->t_state_lock, *flags);
> >  
> > -       wait_for_completion(&cmd->t_transport_stop_comp);
> > +       wait_for_completion_interruptible(&cmd->t_transport_stop_comp, 5 * HZ);
> >  
> >         spin_lock_irqsave(&cmd->t_state_lock, *flags);
> >         cmd->transport_state &= ~(CMD_T_ACTIVE | CMD_T_STOP);
> > 
> > "resolves" the bug, but I don't think this is correct.
> > 
> > This is all easily reproducible with v4.8-rc4 in qemu (for instance).
> > 
> > Any advice is aprechiated.
> > 
> 
> This is likely the missing SCF_ACK_KREF assignment in >= v4.1.y:
> 
> http://www.spinics.net/lists/target-devel/msg13530.html
> 
> At your earliest convenience, please verify using this patch for TMR
> ABORT_TASK due to target-core backend I/O still outstanding, with
> simultaneous failed iscsi session reinstatement -> repeated iscsi login
> timeout scenario.
> 
> Also once target-core backend I/O has (finally) been completed back to
> fabric driver code, the iscsi_np configfs group shutdown is allowed to
> proceed.
> 

Hi Nic, 

Thanks for the heads up, I'll give it a try.

Robert has sent a similar bug report on
http://www.spinics.net/lists/linux-rdma/msg41296.html so I CCed him as well.

Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Need some pointers to debug a target hang
  2016-10-18  7:01   ` Johannes Thumshirn
@ 2016-10-18 17:35     ` Johannes Thumshirn
  2016-10-19  6:29       ` Nicholas A. Bellinger
  0 siblings, 1 reply; 11+ messages in thread
From: Johannes Thumshirn @ 2016-10-18 17:35 UTC (permalink / raw)
  To: Nicholas A. Bellinger; +Cc: linux-scsi, target-devel, Robert LeBlanc

On Tue, Oct 18, 2016 at 09:01:34AM +0200, Johannes Thumshirn wrote:

[...]

> > 
> > This is likely the missing SCF_ACK_KREF assignment in >= v4.1.y:
> > 
> > http://www.spinics.net/lists/target-devel/msg13530.html

Sorry to disappoint you but it didn't fix my issue. Is there any debug data I
can give you, or do you have an advice where I could start looking?

I have to admit I only tested the patch on our downstream kernel and not the
upstream kernel, I'll repeat the tests on 4.8 final tomorrow.

Thanks,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Need some pointers to debug a target hang
  2016-10-18 17:35     ` Johannes Thumshirn
@ 2016-10-19  6:29       ` Nicholas A. Bellinger
  2016-10-29 22:53         ` Nicholas A. Bellinger
  0 siblings, 1 reply; 11+ messages in thread
From: Nicholas A. Bellinger @ 2016-10-19  6:29 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-scsi, target-devel, Robert LeBlanc

On Tue, 2016-10-18 at 19:35 +0200, Johannes Thumshirn wrote:
> On Tue, Oct 18, 2016 at 09:01:34AM +0200, Johannes Thumshirn wrote:
> 
> [...]
> 
> > > 
> > > This is likely the missing SCF_ACK_KREF assignment in >= v4.1.y:
> > > 
> > > http://www.spinics.net/lists/target-devel/msg13530.html
> 
> Sorry to disappoint you but it didn't fix my issue. Is there any debug data I
> can give you, or do you have an advice where I could start looking?
> 
> I have to admit I only tested the patch on our downstream kernel and not the
> upstream kernel, I'll repeat the tests on 4.8 final tomorrow.
> 

Is it possible to generate vmcore to have a look in crash + gdb..?









^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Need some pointers to debug a target hang
  2016-10-19  6:29       ` Nicholas A. Bellinger
@ 2016-10-29 22:53         ` Nicholas A. Bellinger
  2016-10-29 23:03           ` Johannes Thumshirn
  2016-10-30 19:51           ` Johannes Thumshirn
  0 siblings, 2 replies; 11+ messages in thread
From: Nicholas A. Bellinger @ 2016-10-29 22:53 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-scsi, target-devel, Zhu Lingshan

Hi Johannes & Zhu,

On Tue, 2016-10-18 at 23:29 -0700, Nicholas A. Bellinger wrote:
> On Tue, 2016-10-18 at 19:35 +0200, Johannes Thumshirn wrote:
> > On Tue, Oct 18, 2016 at 09:01:34AM +0200, Johannes Thumshirn wrote:
> > 
> > [...]
> > 
> > > > 
> > > > This is likely the missing SCF_ACK_KREF assignment in >= v4.1.y:
> > > > 
> > > > http://www.spinics.net/lists/target-devel/msg13530.html
> > 
> > Sorry to disappoint you but it didn't fix my issue. Is there any debug data I
> > can give you, or do you have an advice where I could start looking?
> > 
> > I have to admit I only tested the patch on our downstream kernel and not the
> > upstream kernel, I'll repeat the tests on 4.8 final tomorrow.
> > 
> 
> Is it possible to generate vmcore to have a look in crash + gdb..?
> 
> 

Just curious if you've been able to verify the above patch for
v4.8.y iscsi-target ports on your specific config(s)..?

To confirm, using v4.1.y + patch I've not run into any further
se_cmd->cmd_kref leaks and/or hung tasks with iscsi-target ports due to
high backend device I/O latency, resulting in host timeouts + ABORT_TASK
+ session reinstatement to occur while waiting for outstanding se_cmd
backend I/O to complete.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Need some pointers to debug a target hang
  2016-10-29 22:53         ` Nicholas A. Bellinger
@ 2016-10-29 23:03           ` Johannes Thumshirn
  2016-10-31  8:51             ` Zhu Lingshan
  2016-10-30 19:51           ` Johannes Thumshirn
  1 sibling, 1 reply; 11+ messages in thread
From: Johannes Thumshirn @ 2016-10-29 23:03 UTC (permalink / raw)
  To: Nicholas A. Bellinger; +Cc: linux-scsi, target-devel, Zhu Lingshan

On Sat, Oct 29, 2016 at 03:53:25PM -0700, Nicholas A. Bellinger wrote:
> Hi Johannes & Zhu,
> 
> On Tue, 2016-10-18 at 23:29 -0700, Nicholas A. Bellinger wrote:
> > On Tue, 2016-10-18 at 19:35 +0200, Johannes Thumshirn wrote:
> > > On Tue, Oct 18, 2016 at 09:01:34AM +0200, Johannes Thumshirn wrote:
> > > 
> > > [...]
> > > 
> > > > > 
> > > > > This is likely the missing SCF_ACK_KREF assignment in >= v4.1.y:
> > > > > 
> > > > > http://www.spinics.net/lists/target-devel/msg13530.html
> > > 
> > > Sorry to disappoint you but it didn't fix my issue. Is there any debug data I
> > > can give you, or do you have an advice where I could start looking?
> > > 
> > > I have to admit I only tested the patch on our downstream kernel and not the
> > > upstream kernel, I'll repeat the tests on 4.8 final tomorrow.
> > > 
> > 
> > Is it possible to generate vmcore to have a look in crash + gdb..?
> > 
> > 
> 
> Just curious if you've been able to verify the above patch for
> v4.8.y iscsi-target ports on your specific config(s)..?
> 
> To confirm, using v4.1.y + patch I've not run into any further
> se_cmd->cmd_kref leaks and/or hung tasks with iscsi-target ports due to
> high backend device I/O latency, resulting in host timeouts + ABORT_TASK
> + session reinstatement to occur while waiting for outstanding se_cmd
> backend I/O to complete.

Hi Nic,

I'm sorry I haven't come around to testing it but I do have my test VMs with
me and I should have time tomorrow to do it. Sorry for not giving you any updates
on this.

Thanks (and sorry),
       Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Need some pointers to debug a target hang
  2016-10-29 22:53         ` Nicholas A. Bellinger
  2016-10-29 23:03           ` Johannes Thumshirn
@ 2016-10-30 19:51           ` Johannes Thumshirn
  1 sibling, 0 replies; 11+ messages in thread
From: Johannes Thumshirn @ 2016-10-30 19:51 UTC (permalink / raw)
  To: Nicholas A. Bellinger; +Cc: linux-scsi, target-devel, Zhu Lingshan

On Sat, Oct 29, 2016 at 03:53:25PM -0700, Nicholas A. Bellinger wrote:
> Hi Johannes & Zhu,
> 
> On Tue, 2016-10-18 at 23:29 -0700, Nicholas A. Bellinger wrote:
> > On Tue, 2016-10-18 at 19:35 +0200, Johannes Thumshirn wrote:
> > > On Tue, Oct 18, 2016 at 09:01:34AM +0200, Johannes Thumshirn wrote:
> > > 
> > > [...]
> > > 
> > > > > 
> > > > > This is likely the missing SCF_ACK_KREF assignment in >= v4.1.y:
> > > > > 
> > > > > http://www.spinics.net/lists/target-devel/msg13530.html
> > > 
> > > Sorry to disappoint you but it didn't fix my issue. Is there any debug data I
> > > can give you, or do you have an advice where I could start looking?
> > > 
> > > I have to admit I only tested the patch on our downstream kernel and not the
> > > upstream kernel, I'll repeat the tests on 4.8 final tomorrow.
> > > 
> > 
> > Is it possible to generate vmcore to have a look in crash + gdb..?
> > 
> > 
> 
> Just curious if you've been able to verify the above patch for
> v4.8.y iscsi-target ports on your specific config(s)..?
> 
> To confirm, using v4.1.y + patch I've not run into any further
> se_cmd->cmd_kref leaks and/or hung tasks with iscsi-target ports due to
> high backend device I/O latency, resulting in host timeouts + ABORT_TASK
> + session reinstatement to occur while waiting for outstanding se_cmd
> backend I/O to complete.

Unfortunately v4.8.5 which has above patch included (I double checked to be
sure) shows the same effects. 

Here's my "reproducer":

 VM1 (target)                       VM2 (Initiator)
 ============                       ===============
                                    dd if=/dev/sda of=/dev/null bs=1 (given sda is the iscsi LUN)

		Wait a little bit (~10 secs)

 dmsetup suspend dm-0
 watch -n1 'ps aux |\
  grep -E "\[iscsi" | grep -v grep'

I'll build a kernel with all target stuff builtin and full debug symbols.
Let's see if I can find out something.

Thanks,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Need some pointers to debug a target hang
  2016-10-29 23:03           ` Johannes Thumshirn
@ 2016-10-31  8:51             ` Zhu Lingshan
  2016-11-04  6:52               ` Nicholas A. Bellinger
  0 siblings, 1 reply; 11+ messages in thread
From: Zhu Lingshan @ 2016-10-31  8:51 UTC (permalink / raw)
  To: Johannes Thumshirn, Nicholas A. Bellinger; +Cc: linux-scsi, target-devel

Hi Nicholas,

(sorry it would be a long mail)

Sorry for the delay, I spent some test and debug work. I find the patch 
http://www.spinics.net/lists/target-devel/msg13530.html can solve two 
issues:
(a). iscsit_stop_session() on the top of iscsi_np stack.
(b).iscsi_check_for_session_reinstatement() on the top of iscsi_np 
stack, it is a must to reboot to recover.

But I also find two more issues.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Please let me explain my setup:
I have a target server, kernel version 4.4.21, it has a SATA HDD as a 
LUN, only one LUN.
I have two initiators, both login to the target.
Create two partitions on the LUN, each initiator mount a certain 
partition. Like initiator1 mount /dev/sdc1, initiator2 mount /dev/sdc2.
looks like this:
lszhu_DEV:~ # lsscsi
[0:0:0:0]    cd/dvd  HL-DT-ST DVD+-RW GHB0N    A1C0  /dev/sr0
[4:0:0:0]    disk    ATA      TOSHIBA DT01ACA2 MX4O  /dev/sda
[9:0:0:0]    disk    LIO-ORG  IBLOCK           4.0   /dev/sdc
lszhu_DEV:~ # lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0   1.8T  0 disk
├─sda1   8:1    0    16G  0 part [SWAP]
├─sda2   8:2    0   200G  0 part /
└─sda3   8:3    0   1.6T  0 part /home
sdc      8:32   0 465.8G  0 disk
├─sdc1   8:33   0   200G  0 part
└─sdc2   8:34   0 265.8G  0 part /mnt
sr0     11:0    1  1024M  0 rom

bj-ha-5:~ # lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0 465.8G  0 disk
├─sda1   8:1    0     8G  0 part [SWAP]
├─sda2   8:2    0   100G  0 part /
└─sda3   8:3    0 357.8G  0 part /home
sdb      8:16   0 465.8G  0 disk
sdc      8:32   0 465.8G  0 disk
├─sdc1   8:33   0   200G  0 part /mnt
└─sdc2   8:34   0 265.8G  0 part
sr0     11:0    1  1024M  0 rom

so you can see each initiator will read / write on their own partition 
from the LUN.

mount the partition like this:
mount -o dioread_nolock /dev/sdc1 /mnt,  option dioread_nolock can help 
we reproduce this bug a little quicker.

Then run fio on both initiators like this:
fio -filename=/mnt/file1 -direct=1 -rw=randread -iodepth=64 
-ioengin=libaio -bs=64K -size=5G -numjobs=24 -runtime=60000 -time_based 
-group_reporting -name=init1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Here are the three more issues I found, kernel version 4.4.21
(1) The first one looks like this, it is quite rare, only once, I failed 
to get stack information for iscsi_trx and hard to reproduce, call stack 
of iscsi_np is here:

bj-ha-3:~ # ps -aux | grep iscsi
root     10501  0.0  0.0      0     0 ?        D    16:45   0:00 [iscsi_np]
root     10519  0.4  0.0      0     0 ?        S    16:49   0:02 [iscsi_ttx]
root     10520  2.2  0.0      0     0 ?        S    16:49   0:14 [iscsi_trx]
root     10533  0.9  0.0      0     0 ?        D    16:54   0:03 [iscsi_trx]
root     10547  0.0  0.0  10508  1552 pts/0    S+   16:59   0:00 grep 
--color=auto iscsi
bj-ha-3:~ # cat /proc/10501/stack
[<ffffffff815334c9>] inet_csk_accept+0x269/0x2e0
[<ffffffff8155f4aa>] inet_accept+0x2a/0x100
[<ffffffff814d2a88>] kernel_accept+0x48/0xa0
[<ffffffffa06fe871>] iscsit_accept_np+0x31/0x230 [iscsi_target_mod]
[<ffffffffa06fefab>] iscsi_target_login_thread+0xeb/0xfd0 [iscsi_target_mod]
[<ffffffff810996bd>] kthread+0xbd/0xe0
[<ffffffff815e15bf>] ret_from_fork+0x3f/0x70
[<ffffffff81099600>] kthread+0x0/0xe0
[<ffffffffffffffff>] 0xffffffffffffffff
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(2) The second issue I found has iscsi_check_for_session_reinstatement() 
on the top of iscsi_np call stack, but different from case (b) we 
mentioned before. In case (b), we must reboot the target to recover, but 
in this case, it will auto recover, here is the details:
dmesg:
[  100.487421] iSCSI/iqn.1996-04.de.suse:01:faad5846cde9: Unsupported 
SCSI Opcode 0xa3, sending CHECK_CONDITION.
[  182.582323] ABORT_TASK: Found referenced iSCSI task_tag: 15
[  197.616800] Unexpected ret: -32 send data 48
[  197.712278] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 15
[  263.603305] ABORT_TASK: Found referenced iSCSI task_tag: 268435537
[  278.640172] Unexpected ret: -32 send data 48
[  278.671299] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 
268435537
[  341.584329] ABORT_TASK: Found referenced iSCSI task_tag: 536871038
[  357.423710] Unexpected ret: -32 send data 48
[  373.660476] iSCSI Login timeout on Network Portal 192.168.100.233:3260

call stacks:
bj-ha-3:~ # ps -aux | grep iscsi
root      3063  0.0  0.0      0     0 ?        D    15:36   0:00 [iscsi_np]
root      3073  0.3  0.0      0     0 ?        S    15:36   0:01 [iscsi_ttx]
root      3074  0.4  0.0      0     0 ?        S    15:36   0:01 [iscsi_trx]
root      3094  0.3  0.0      0     0 ?        D    15:39   0:00 [iscsi_trx]
root      3116  0.0  0.0  10508  1616 pts/0    S+   15:42   0:00 grep 
--color=auto iscsi
bj-ha-3:~ # cat /proc/3063/stack
[<ffffffffa03fcd97>] iscsi_check_for_session_reinstatement+0x1d7/0x270 
[iscsi_target_mod]
[<ffffffffa03ff971>] iscsi_target_do_login+0x111/0x5f0 [iscsi_target_mod]
[<ffffffffa0400b17>] iscsi_target_start_negotiation+0x17/0xa0 
[iscsi_target_mod]
[<ffffffffa03fe932>] iscsi_target_login_thread+0xa72/0xfd0 
[iscsi_target_mod]
[<ffffffff810997cd>] kthread+0xbd/0xe0
[<ffffffff815e177f>] ret_from_fork+0x3f/0x70
[<ffffffff81099710>] kthread+0x0/0xe0
[<ffffffffffffffff>] 0xffffffffffffffff
bj-ha-3:~ # cat /proc/3094/stack
[<ffffffffa03adcc6>] transport_generic_free_cmd+0x76/0x110 [target_core_mod]
[<ffffffffa0404917>] iscsit_free_cmd+0x67/0x130 [iscsi_target_mod]
[<ffffffffa040b929>] iscsit_close_connection+0x4a9/0x860 [iscsi_target_mod]
[<ffffffffa040b0b3>] iscsi_target_rx_thread+0x93/0xa0 [iscsi_target_mod]
[<ffffffff810997cd>] kthread+0xbd/0xe0
[<ffffffff815e177f>] ret_from_fork+0x3f/0x70
[<ffffffff81099710>] kthread+0x0/0xe0
[<ffffffffffffffff>] 0xffffffffffffffff

At this moment, we can see IO errors at one initiator side, the other 
will survive. Once we killed the survival side FIO processes or wait a 
moment, the target would auto recover, looks like this:

bj-ha-3:~ # ps -aux | grep iscsi
root      3063  0.0  0.0      0     0 ?        S    15:36   0:00 [iscsi_np]
root      3142  0.0  0.0      0     0 ?        S    15:43   0:00 [iscsi_ttx]
root      3143  0.0  0.0      0     0 ?        S    15:43   0:00 [iscsi_trx]
root      3146  0.6  0.0      0     0 ?        S    15:43   0:00 [iscsi_ttx]
root      3147  0.8  0.0      0     0 ?        S    15:43   0:00 [iscsi_trx]
root      3149  0.0  0.0  10508  1636 pts/0    S+   15:43   0:00 grep 
--color=auto iscsi

dmesg at this moment:
[   90.507364] iSCSI/iqn.1996-04.de.suse:01:4039926d2313: Unsupported 
SCSI Opcode 0xa3, sending CHECK_CONDITION.
[  100.487421] iSCSI/iqn.1996-04.de.suse:01:faad5846cde9: Unsupported 
SCSI Opcode 0xa3, sending CHECK_CONDITION.
[  182.582323] ABORT_TASK: Found referenced iSCSI task_tag: 15
[  197.616800] Unexpected ret: -32 send data 48
[  197.712278] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 15
[  263.603305] ABORT_TASK: Found referenced iSCSI task_tag: 268435537
[  278.640172] Unexpected ret: -32 send data 48
[  278.671299] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 
268435537
[  341.584329] ABORT_TASK: Found referenced iSCSI task_tag: 536871038
[  357.423710] Unexpected ret: -32 send data 48
[  373.660476] iSCSI Login timeout on Network Portal 192.168.100.233:3260
[  535.852903] Unexpected ret: -32 send data 48
[  535.923803] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 
536871038
[  535.923811] ABORT_TASK: Found referenced iSCSI task_tag: 76
[  535.923815] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 76
[  535.924325] Unexpected ret: -104 send data 360
[  535.924329] tx_data returned -32, expecting 360.
[  535.924489] iSCSI Login negotiation failed.
[  535.924916] Unexpected ret: -104 send data 360
[  535.924918] tx_data returned -32, expecting 360.
[  535.924943] iSCSI Login negotiation failed.
[  535.925424] Unexpected ret: -104 send data 360
[  535.925428] tx_data returned -32, expecting 360.
[  535.925650] iSCSI Login negotiation failed.
[  535.926095] Unexpected ret: -104 send data 360
[  535.926097] tx_data returned -32, expecting 360.
[  535.926132] iSCSI Login negotiation failed.
[  535.926349] Unexpected ret: -104 send data 360
[  535.926351] tx_data returned -32, expecting 360.
[  535.926369] iSCSI Login negotiation failed.
[  535.926576] Unexpected ret: -104 send data 360
[  535.926577] tx_data returned -32, expecting 360.
[  535.926593] iSCSI Login negotiation failed.
[  535.926750] Unexpected ret: -104 send data 360
[  535.926752] tx_data returned -32, expecting 360.
[  535.926768] iSCSI Login negotiation failed.
[  535.926905] Unexpected ret: -104 send data 360
[  535.926906] tx_data returned -32, expecting 360.
[  535.926925] iSCSI Login negotiation failed.
[  535.927064] Unexpected ret: -104 send data 360
[  535.927065] tx_data returned -32, expecting 360.
[  535.927082] iSCSI Login negotiation failed.
[  535.927219] Unexpected ret: -104 send data 360
[  535.927220] tx_data returned -32, expecting 360.
[  535.927237] iSCSI Login negotiation failed.

I am still debugging on this case, I have core dumps, I think the dump 
files are too big for a mail list, I can send it off the list if needed.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(3) In this case, dmesg and call stack looks like this:
bj-ha-3:~ # dmesg
[  803.570868] ABORT_TASK: Found referenced iSCSI task_tag: 96
[  818.584038] Unexpected ret: -32 send data 48
[  818.620210] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 96
[  882.555923] ABORT_TASK: Found referenced iSCSI task_tag: 268435496
[  897.587925] Unexpected ret: -32 send data 48
[  898.845646] Unexpected ret: -32 send data 48
[  898.916090] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 
268435496
[  898.916095] ABORT_TASK: Found referenced iSCSI task_tag: 45
[  898.916377] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 45
[  962.785419] ABORT_TASK: Found referenced iSCSI task_tag: 268435489

bj-ha-3:~ # ps -aux | grep iscsi
root      3063  0.0  0.0      0     0 ?        D    15:36   0:00 [iscsi_np]
root      3296  1.9  0.0      0     0 ?        S    15:49   0:01 [iscsi_ttx]
root      3297  1.9  0.0      0     0 ?        R    15:49   0:01 [iscsi_trx]
root      3299  0.0  0.0      0     0 ?        D    15:49   0:00 [iscsi_trx]
root      3506  0.0  0.0  10508  1516 pts/0    S+   15:51   0:00 grep 
--color=auto iscsi


bj-ha-3:~ # cat /proc/3063/stack
[<ffffffffa03fcd97>] iscsi_check_for_session_reinstatement+0x1d7/0x270 
[iscsi_target_mod]
[<ffffffffa03ff971>] iscsi_target_do_login+0x111/0x5f0 [iscsi_target_mod]
[<ffffffffa0400b17>] iscsi_target_start_negotiation+0x17/0xa0 
[iscsi_target_mod]
[<ffffffffa03fe932>] iscsi_target_login_thread+0xa72/0xfd0 
[iscsi_target_mod]
[<ffffffff810997cd>] kthread+0xbd/0xe0
[<ffffffff815e177f>] ret_from_fork+0x3f/0x70
[<ffffffff81099710>] kthread+0x0/0xe0
[<ffffffffffffffff>] 0xffffffffffffffff
bj-ha-3:~ # cat /proc/3099/stack
[<ffffffffa03ac784>] __transport_wait_for_tasks+0xb4/0x190 [target_core_mod]
[<ffffffffa03ac905>] transport_wait_for_tasks+0x45/0x60 [target_core_mod]
[<ffffffffa03aa034>] core_tmr_abort_task+0xf4/0x160 [target_core_mod]
[<ffffffffa03aca43>] target_tmr_work+0x123/0x140 [target_core_mod]
[<ffffffff81093ace>] process_one_work+0x14e/0x410
[<ffffffff81094326>] worker_thread+0x116/0x490
[<ffffffff810997cd>] kthread+0xbd/0xe0
[<ffffffff815e177f>] ret_from_fork+0x3f/0x70
[<ffffffff81099710>] kthread+0x0/0xe0
[<ffffffffffffffff>] 0xffffffffffffffff

This case is also rare, only once, but I got the core dump files.
-------------------------------------------------------------------------------------------------------------------------------------------------
Sorry again for the delay and send such a long mail, I am still working 
on there cases, I will keep the list update once I find anything new.


Thanks,
BR
Zhu Lingshan


On 10/30/2016 07:03 AM, Johannes Thumshirn wrote:
> On Sat, Oct 29, 2016 at 03:53:25PM -0700, Nicholas A. Bellinger wrote:
>> Hi Johannes & Zhu,
>>
>> On Tue, 2016-10-18 at 23:29 -0700, Nicholas A. Bellinger wrote:
>>> On Tue, 2016-10-18 at 19:35 +0200, Johannes Thumshirn wrote:
>>>> On Tue, Oct 18, 2016 at 09:01:34AM +0200, Johannes Thumshirn wrote:
>>>>
>>>> [...]
>>>>
>>>>>> This is likely the missing SCF_ACK_KREF assignment in >= v4.1.y:
>>>>>>
>>>>>> http://www.spinics.net/lists/target-devel/msg13530.html
>>>> Sorry to disappoint you but it didn't fix my issue. Is there any debug data I
>>>> can give you, or do you have an advice where I could start looking?
>>>>
>>>> I have to admit I only tested the patch on our downstream kernel and not the
>>>> upstream kernel, I'll repeat the tests on 4.8 final tomorrow.
>>>>
>>> Is it possible to generate vmcore to have a look in crash + gdb..?
>>>
>>>
>> Just curious if you've been able to verify the above patch for
>> v4.8.y iscsi-target ports on your specific config(s)..?
>>
>> To confirm, using v4.1.y + patch I've not run into any further
>> se_cmd->cmd_kref leaks and/or hung tasks with iscsi-target ports due to
>> high backend device I/O latency, resulting in host timeouts + ABORT_TASK
>> + session reinstatement to occur while waiting for outstanding se_cmd
>> backend I/O to complete.
> Hi Nic,
>
> I'm sorry I haven't come around to testing it but I do have my test VMs with
> me and I should have time tomorrow to do it. Sorry for not giving you any updates
> on this.
>
> Thanks (and sorry),
>         Johannes


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Need some pointers to debug a target hang
  2016-10-31  8:51             ` Zhu Lingshan
@ 2016-11-04  6:52               ` Nicholas A. Bellinger
  2016-11-09  3:42                 ` Zhu Lingshan
  0 siblings, 1 reply; 11+ messages in thread
From: Nicholas A. Bellinger @ 2016-11-04  6:52 UTC (permalink / raw)
  To: Zhu Lingshan; +Cc: Johannes Thumshirn, linux-scsi, target-devel

Hi Zhu & Co,

Thanks for the detailed logs.  Comments below.

On Mon, 2016-10-31 at 16:51 +0800, Zhu Lingshan wrote:
> Hi Nicholas,
> 
> (sorry it would be a long mail)
> 
> Sorry for the delay, I spent some test and debug work. I find the patch 
> http://www.spinics.net/lists/target-devel/msg13530.html can solve two 
> issues:
> (a). iscsit_stop_session() on the top of iscsi_np stack.
> (b).iscsi_check_for_session_reinstatement() on the top of iscsi_np 
> stack, it is a must to reboot to recover.
> 

Great, thanks for confirming the patch above.

The key wrt this scenario, and other scenarios below is once target-core
ABORT_TASK logic locates a se_cmd descriptor for IBLOCK backend I/O
still outstanding, both iscsi session reinstatement in iscsi_np context
and iscsi session shutdown in iscsi_t[t,r]x context are blocked until
the specific outstanding se_cmd I/Os are completed, back to target-core
backend driver.

Note this is expected behavior during target-core ABORT_TASK and
iscsi-target session reinstatement + session shutdown.

> But I also find two more issues.
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Please let me explain my setup:
> I have a target server, kernel version 4.4.21, it has a SATA HDD as a 
> LUN, only one LUN.
> I have two initiators, both login to the target.
> Create two partitions on the LUN, each initiator mount a certain 
> partition. Like initiator1 mount /dev/sdc1, initiator2 mount /dev/sdc2.
> looks like this:
> lszhu_DEV:~ # lsscsi
> [0:0:0:0]    cd/dvd  HL-DT-ST DVD+-RW GHB0N    A1C0  /dev/sr0
> [4:0:0:0]    disk    ATA      TOSHIBA DT01ACA2 MX4O  /dev/sda
> [9:0:0:0]    disk    LIO-ORG  IBLOCK           4.0   /dev/sdc
> lszhu_DEV:~ # lsblk
> NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> sda      8:0    0   1.8T  0 disk
> ├─sda1   8:1    0    16G  0 part [SWAP]
> ├─sda2   8:2    0   200G  0 part /
> └─sda3   8:3    0   1.6T  0 part /home
> sdc      8:32   0 465.8G  0 disk
> ├─sdc1   8:33   0   200G  0 part
> └─sdc2   8:34   0 265.8G  0 part /mnt
> sr0     11:0    1  1024M  0 rom
> 
> bj-ha-5:~ # lsblk
> NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> sda      8:0    0 465.8G  0 disk
> ├─sda1   8:1    0     8G  0 part [SWAP]
> ├─sda2   8:2    0   100G  0 part /
> └─sda3   8:3    0 357.8G  0 part /home
> sdb      8:16   0 465.8G  0 disk
> sdc      8:32   0 465.8G  0 disk
> ├─sdc1   8:33   0   200G  0 part /mnt
> └─sdc2   8:34   0 265.8G  0 part
> sr0     11:0    1  1024M  0 rom
> 
> so you can see each initiator will read / write on their own partition 
> from the LUN.
> 
> mount the partition like this:
> mount -o dioread_nolock /dev/sdc1 /mnt,  option dioread_nolock can help 
> we reproduce this bug a little quicker.
> 
> Then run fio on both initiators like this:
> fio -filename=/mnt/file1 -direct=1 -rw=randread -iodepth=64 
> -ioengin=libaio -bs=64K -size=5G -numjobs=24 -runtime=60000 -time_based 
> -group_reporting -name=init1
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
> Here are the three more issues I found, kernel version 4.4.21
> (1) The first one looks like this, it is quite rare, only once, I failed 
> to get stack information for iscsi_trx and hard to reproduce, call stack 
> of iscsi_np is here:
> 
> bj-ha-3:~ # ps -aux | grep iscsi
> root     10501  0.0  0.0      0     0 ?        D    16:45   0:00 [iscsi_np]
> root     10519  0.4  0.0      0     0 ?        S    16:49   0:02 [iscsi_ttx]
> root     10520  2.2  0.0      0     0 ?        S    16:49   0:14 [iscsi_trx]
> root     10533  0.9  0.0      0     0 ?        D    16:54   0:03 [iscsi_trx]
> root     10547  0.0  0.0  10508  1552 pts/0    S+   16:59   0:00 grep 
> --color=auto iscsi
> bj-ha-3:~ # cat /proc/10501/stack
> [<ffffffff815334c9>] inet_csk_accept+0x269/0x2e0
> [<ffffffff8155f4aa>] inet_accept+0x2a/0x100
> [<ffffffff814d2a88>] kernel_accept+0x48/0xa0
> [<ffffffffa06fe871>] iscsit_accept_np+0x31/0x230 [iscsi_target_mod]
> [<ffffffffa06fefab>] iscsi_target_login_thread+0xeb/0xfd0 [iscsi_target_mod]
> [<ffffffff810996bd>] kthread+0xbd/0xe0
> [<ffffffff815e15bf>] ret_from_fork+0x3f/0x70
> [<ffffffff81099600>] kthread+0x0/0xe0
> [<ffffffffffffffff>] 0xffffffffffffffff

The iscsi_np here is looks like it's doing normal Linux/NET accept.

Is the uninterruptible sleep for PID=10533 waiting for outstanding I/O
to complete during iscsit_close_connection() shutdown after
ABORT_TASK..?

> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> (2) The second issue I found has iscsi_check_for_session_reinstatement() 
> on the top of iscsi_np call stack, but different from case (b) we 
> mentioned before. In case (b), we must reboot the target to recover, but 
> in this case, it will auto recover, here is the details:
> dmesg:
> [  100.487421] iSCSI/iqn.1996-04.de.suse:01:faad5846cde9: Unsupported 
> SCSI Opcode 0xa3, sending CHECK_CONDITION.
> [  182.582323] ABORT_TASK: Found referenced iSCSI task_tag: 15
> [  197.616800] Unexpected ret: -32 send data 48
> [  197.712278] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 15
> [  263.603305] ABORT_TASK: Found referenced iSCSI task_tag: 268435537
> [  278.640172] Unexpected ret: -32 send data 48
> [  278.671299] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 
> 268435537
> [  341.584329] ABORT_TASK: Found referenced iSCSI task_tag: 536871038
> [  357.423710] Unexpected ret: -32 send data 48
> [  373.660476] iSCSI Login timeout on Network Portal 192.168.100.233:3260
> 

Great, recovering from uninterruptible sleep once the backend (finally)
completes outstanding I/O here confirms the earlier SCF_ACK_KREF patch.

So the backend device in question looks like it's repeatably taking
extended time to complete I/O.  At least 15 seconds plus the host
initiator side ABORT_TASK timeout.  

Is the device failing..?

> call stacks:
> bj-ha-3:~ # ps -aux | grep iscsi
> root      3063  0.0  0.0      0     0 ?        D    15:36   0:00 [iscsi_np]
> root      3073  0.3  0.0      0     0 ?        S    15:36   0:01 [iscsi_ttx]
> root      3074  0.4  0.0      0     0 ?        S    15:36   0:01 [iscsi_trx]
> root      3094  0.3  0.0      0     0 ?        D    15:39   0:00 [iscsi_trx]
> root      3116  0.0  0.0  10508  1616 pts/0    S+   15:42   0:00 grep 
> --color=auto iscsi
> bj-ha-3:~ # cat /proc/3063/stack
> [<ffffffffa03fcd97>] iscsi_check_for_session_reinstatement+0x1d7/0x270 
> [iscsi_target_mod]
> [<ffffffffa03ff971>] iscsi_target_do_login+0x111/0x5f0 [iscsi_target_mod]
> [<ffffffffa0400b17>] iscsi_target_start_negotiation+0x17/0xa0 
> [iscsi_target_mod]
> [<ffffffffa03fe932>] iscsi_target_login_thread+0xa72/0xfd0 
> [iscsi_target_mod]
> [<ffffffff810997cd>] kthread+0xbd/0xe0
> [<ffffffff815e177f>] ret_from_fork+0x3f/0x70
> [<ffffffff81099710>] kthread+0x0/0xe0
> [<ffffffffffffffff>] 0xffffffffffffffff
> bj-ha-3:~ # cat /proc/3094/stack
> [<ffffffffa03adcc6>] transport_generic_free_cmd+0x76/0x110 [target_core_mod]
> [<ffffffffa0404917>] iscsit_free_cmd+0x67/0x130 [iscsi_target_mod]
> [<ffffffffa040b929>] iscsit_close_connection+0x4a9/0x860 [iscsi_target_mod]
> [<ffffffffa040b0b3>] iscsi_target_rx_thread+0x93/0xa0 [iscsi_target_mod]
> [<ffffffff810997cd>] kthread+0xbd/0xe0
> [<ffffffff815e177f>] ret_from_fork+0x3f/0x70
> [<ffffffff81099710>] kthread+0x0/0xe0
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> At this moment, we can see IO errors at one initiator side, the other 
> will survive. Once we killed the survival side FIO processes or wait a 
> moment, the target would auto recover, looks like this:
> 
> bj-ha-3:~ # ps -aux | grep iscsi
> root      3063  0.0  0.0      0     0 ?        S    15:36   0:00 [iscsi_np]
> root      3142  0.0  0.0      0     0 ?        S    15:43   0:00 [iscsi_ttx]
> root      3143  0.0  0.0      0     0 ?        S    15:43   0:00 [iscsi_trx]
> root      3146  0.6  0.0      0     0 ?        S    15:43   0:00 [iscsi_ttx]
> root      3147  0.8  0.0      0     0 ?        S    15:43   0:00 [iscsi_trx]
> root      3149  0.0  0.0  10508  1636 pts/0    S+   15:43   0:00 grep 
> --color=auto iscsi
> 
> dmesg at this moment:
> [   90.507364] iSCSI/iqn.1996-04.de.suse:01:4039926d2313: Unsupported 
> SCSI Opcode 0xa3, sending CHECK_CONDITION.
> [  100.487421] iSCSI/iqn.1996-04.de.suse:01:faad5846cde9: Unsupported 
> SCSI Opcode 0xa3, sending CHECK_CONDITION.
> [  182.582323] ABORT_TASK: Found referenced iSCSI task_tag: 15
> [  197.616800] Unexpected ret: -32 send data 48
> [  197.712278] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 15
> [  263.603305] ABORT_TASK: Found referenced iSCSI task_tag: 268435537
> [  278.640172] Unexpected ret: -32 send data 48
> [  278.671299] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 
> 268435537
> [  341.584329] ABORT_TASK: Found referenced iSCSI task_tag: 536871038
> [  357.423710] Unexpected ret: -32 send data 48
> [  373.660476] iSCSI Login timeout on Network Portal 192.168.100.233:3260
> [  535.852903] Unexpected ret: -32 send data 48
> [  535.923803] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 
> 536871038
> [  535.923811] ABORT_TASK: Found referenced iSCSI task_tag: 76
> [  535.923815] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 76
> [  535.924325] Unexpected ret: -104 send data 360
> [  535.924329] tx_data returned -32, expecting 360.
> [  535.924489] iSCSI Login negotiation failed.
> [  535.924916] Unexpected ret: -104 send data 360
> [  535.924918] tx_data returned -32, expecting 360.
> [  535.924943] iSCSI Login negotiation failed.
> [  535.925424] Unexpected ret: -104 send data 360
> [  535.925428] tx_data returned -32, expecting 360.
> [  535.925650] iSCSI Login negotiation failed.
> [  535.926095] Unexpected ret: -104 send data 360
> [  535.926097] tx_data returned -32, expecting 360.
> [  535.926132] iSCSI Login negotiation failed.
> [  535.926349] Unexpected ret: -104 send data 360
> [  535.926351] tx_data returned -32, expecting 360.
> [  535.926369] iSCSI Login negotiation failed.
> [  535.926576] Unexpected ret: -104 send data 360
> [  535.926577] tx_data returned -32, expecting 360.
> [  535.926593] iSCSI Login negotiation failed.
> [  535.926750] Unexpected ret: -104 send data 360
> [  535.926752] tx_data returned -32, expecting 360.
> [  535.926768] iSCSI Login negotiation failed.
> [  535.926905] Unexpected ret: -104 send data 360
> [  535.926906] tx_data returned -32, expecting 360.
> [  535.926925] iSCSI Login negotiation failed.
> [  535.927064] Unexpected ret: -104 send data 360
> [  535.927065] tx_data returned -32, expecting 360.
> [  535.927082] iSCSI Login negotiation failed.
> [  535.927219] Unexpected ret: -104 send data 360
> [  535.927220] tx_data returned -32, expecting 360.
> [  535.927237] iSCSI Login negotiation failed.
> 
> I am still debugging on this case, I have core dumps, I think the dump 
> files are too big for a mail list, I can send it off the list if needed.

AFAICT, this is all expected behavior from target-core + iscsi-target
when the backend device is not completing I/O for extended periods of
time.

> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> (3) In this case, dmesg and call stack looks like this:
> bj-ha-3:~ # dmesg
> [  803.570868] ABORT_TASK: Found referenced iSCSI task_tag: 96
> [  818.584038] Unexpected ret: -32 send data 48
> [  818.620210] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 96
> [  882.555923] ABORT_TASK: Found referenced iSCSI task_tag: 268435496
> [  897.587925] Unexpected ret: -32 send data 48
> [  898.845646] Unexpected ret: -32 send data 48
> [  898.916090] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 
> 268435496
> [  898.916095] ABORT_TASK: Found referenced iSCSI task_tag: 45
> [  898.916377] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 45
> [  962.785419] ABORT_TASK: Found referenced iSCSI task_tag: 268435489
> 
> bj-ha-3:~ # ps -aux | grep iscsi
> root      3063  0.0  0.0      0     0 ?        D    15:36   0:00 [iscsi_np]
> root      3296  1.9  0.0      0     0 ?        S    15:49   0:01 [iscsi_ttx]
> root      3297  1.9  0.0      0     0 ?        R    15:49   0:01 [iscsi_trx]
> root      3299  0.0  0.0      0     0 ?        D    15:49   0:00 [iscsi_trx]
> root      3506  0.0  0.0  10508  1516 pts/0    S+   15:51   0:00 grep 
> --color=auto iscsi
> 
> 
> bj-ha-3:~ # cat /proc/3063/stack
> [<ffffffffa03fcd97>] iscsi_check_for_session_reinstatement+0x1d7/0x270 
> [iscsi_target_mod]
> [<ffffffffa03ff971>] iscsi_target_do_login+0x111/0x5f0 [iscsi_target_mod]
> [<ffffffffa0400b17>] iscsi_target_start_negotiation+0x17/0xa0 
> [iscsi_target_mod]
> [<ffffffffa03fe932>] iscsi_target_login_thread+0xa72/0xfd0 
> [iscsi_target_mod]
> [<ffffffff810997cd>] kthread+0xbd/0xe0
> [<ffffffff815e177f>] ret_from_fork+0x3f/0x70
> [<ffffffff81099710>] kthread+0x0/0xe0
> [<ffffffffffffffff>] 0xffffffffffffffff
> bj-ha-3:~ # cat /proc/3099/stack
> [<ffffffffa03ac784>] __transport_wait_for_tasks+0xb4/0x190 [target_core_mod]
> [<ffffffffa03ac905>] transport_wait_for_tasks+0x45/0x60 [target_core_mod]
> [<ffffffffa03aa034>] core_tmr_abort_task+0xf4/0x160 [target_core_mod]
> [<ffffffffa03aca43>] target_tmr_work+0x123/0x140 [target_core_mod]
> [<ffffffff81093ace>] process_one_work+0x14e/0x410
> [<ffffffff81094326>] worker_thread+0x116/0x490
> [<ffffffff810997cd>] kthread+0xbd/0xe0
> [<ffffffff815e177f>] ret_from_fork+0x3f/0x70
> [<ffffffff81099710>] kthread+0x0/0xe0
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> This case is also rare, only once, but I got the core dump files.

Looks the same as #2.

> -------------------------------------------------------------------------------------------------------------------------------------------------
> Sorry again for the delay and send such a long mail, I am still working 
> on there cases, I will keep the list update once I find anything new.

Thanks again for confirming the earlier patch.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Need some pointers to debug a target hang
  2016-11-04  6:52               ` Nicholas A. Bellinger
@ 2016-11-09  3:42                 ` Zhu Lingshan
  0 siblings, 0 replies; 11+ messages in thread
From: Zhu Lingshan @ 2016-11-09  3:42 UTC (permalink / raw)
  To: Nicholas A. Bellinger; +Cc: Johannes Thumshirn, linux-scsi, target-devel



On 11/04/2016 02:52 PM, Nicholas A. Bellinger wrote:
> Hi Zhu & Co,
>
> Thanks for the detailed logs.  Comments below.
>
> On Mon, 2016-10-31 at 16:51 +0800, Zhu Lingshan wrote:
>> Hi Nicholas,
>>
>> (sorry it would be a long mail)
>>
>> Sorry for the delay, I spent some test and debug work. I find the patch
>> http://www.spinics.net/lists/target-devel/msg13530.html can solve two
>> issues:
>> (a). iscsit_stop_session() on the top of iscsi_np stack.
>> (b).iscsi_check_for_session_reinstatement() on the top of iscsi_np
>> stack, it is a must to reboot to recover.
>>
> Great, thanks for confirming the patch above.
>
> The key wrt this scenario, and other scenarios below is once target-core
> ABORT_TASK logic locates a se_cmd descriptor for IBLOCK backend I/O
> still outstanding, both iscsi session reinstatement in iscsi_np context
> and iscsi session shutdown in iscsi_t[t,r]x context are blocked until
> the specific outstanding se_cmd I/Os are completed, back to target-core
> backend driver.
>
> Note this is expected behavior during target-core ABORT_TASK and
> iscsi-target session reinstatement + session shutdown.
>
>> But I also find two more issues.
>> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> Please let me explain my setup:
>> I have a target server, kernel version 4.4.21, it has a SATA HDD as a
>> LUN, only one LUN.
>> I have two initiators, both login to the target.
>> Create two partitions on the LUN, each initiator mount a certain
>> partition. Like initiator1 mount /dev/sdc1, initiator2 mount /dev/sdc2.
>> looks like this:
>> lszhu_DEV:~ # lsscsi
>> [0:0:0:0]    cd/dvd  HL-DT-ST DVD+-RW GHB0N    A1C0  /dev/sr0
>> [4:0:0:0]    disk    ATA      TOSHIBA DT01ACA2 MX4O  /dev/sda
>> [9:0:0:0]    disk    LIO-ORG  IBLOCK           4.0   /dev/sdc
>> lszhu_DEV:~ # lsblk
>> NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
>> sda      8:0    0   1.8T  0 disk
>> ├─sda1   8:1    0    16G  0 part [SWAP]
>> ├─sda2   8:2    0   200G  0 part /
>> └─sda3   8:3    0   1.6T  0 part /home
>> sdc      8:32   0 465.8G  0 disk
>> ├─sdc1   8:33   0   200G  0 part
>> └─sdc2   8:34   0 265.8G  0 part /mnt
>> sr0     11:0    1  1024M  0 rom
>>
>> bj-ha-5:~ # lsblk
>> NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
>> sda      8:0    0 465.8G  0 disk
>> ├─sda1   8:1    0     8G  0 part [SWAP]
>> ├─sda2   8:2    0   100G  0 part /
>> └─sda3   8:3    0 357.8G  0 part /home
>> sdb      8:16   0 465.8G  0 disk
>> sdc      8:32   0 465.8G  0 disk
>> ├─sdc1   8:33   0   200G  0 part /mnt
>> └─sdc2   8:34   0 265.8G  0 part
>> sr0     11:0    1  1024M  0 rom
>>
>> so you can see each initiator will read / write on their own partition
>> from the LUN.
>>
>> mount the partition like this:
>> mount -o dioread_nolock /dev/sdc1 /mnt,  option dioread_nolock can help
>> we reproduce this bug a little quicker.
>>
>> Then run fio on both initiators like this:
>> fio -filename=/mnt/file1 -direct=1 -rw=randread -iodepth=64
>> -ioengin=libaio -bs=64K -size=5G -numjobs=24 -runtime=60000 -time_based
>> -group_reporting -name=init1
>> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> Here are the three more issues I found, kernel version 4.4.21
>> (1) The first one looks like this, it is quite rare, only once, I failed
>> to get stack information for iscsi_trx and hard to reproduce, call stack
>> of iscsi_np is here:
>>
>> bj-ha-3:~ # ps -aux | grep iscsi
>> root     10501  0.0  0.0      0     0 ?        D    16:45   0:00 [iscsi_np]
>> root     10519  0.4  0.0      0     0 ?        S    16:49   0:02 [iscsi_ttx]
>> root     10520  2.2  0.0      0     0 ?        S    16:49   0:14 [iscsi_trx]
>> root     10533  0.9  0.0      0     0 ?        D    16:54   0:03 [iscsi_trx]
>> root     10547  0.0  0.0  10508  1552 pts/0    S+   16:59   0:00 grep
>> --color=auto iscsi
>> bj-ha-3:~ # cat /proc/10501/stack
>> [<ffffffff815334c9>] inet_csk_accept+0x269/0x2e0
>> [<ffffffff8155f4aa>] inet_accept+0x2a/0x100
>> [<ffffffff814d2a88>] kernel_accept+0x48/0xa0
>> [<ffffffffa06fe871>] iscsit_accept_np+0x31/0x230 [iscsi_target_mod]
>> [<ffffffffa06fefab>] iscsi_target_login_thread+0xeb/0xfd0 [iscsi_target_mod]
>> [<ffffffff810996bd>] kthread+0xbd/0xe0
>> [<ffffffff815e15bf>] ret_from_fork+0x3f/0x70
>> [<ffffffff81099600>] kthread+0x0/0xe0
>> [<ffffffffffffffff>] 0xffffffffffffffff
> The iscsi_np here is looks like it's doing normal Linux/NET accept.
>
> Is the uninterruptible sleep for PID=10533 waiting for outstanding I/O
> to complete during iscsit_close_connection() shutdown after
> ABORT_TASK..?

sorry I have been trying to reproduce this case for a week, but can not 
reproduce this case anymore, it is quite rare.

>
>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> (2) The second issue I found has iscsi_check_for_session_reinstatement()
>> on the top of iscsi_np call stack, but different from case (b) we
>> mentioned before. In case (b), we must reboot the target to recover, but
>> in this case, it will auto recover, here is the details:
>> dmesg:
>> [  100.487421] iSCSI/iqn.1996-04.de.suse:01:faad5846cde9: Unsupported
>> SCSI Opcode 0xa3, sending CHECK_CONDITION.
>> [  182.582323] ABORT_TASK: Found referenced iSCSI task_tag: 15
>> [  197.616800] Unexpected ret: -32 send data 48
>> [  197.712278] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 15
>> [  263.603305] ABORT_TASK: Found referenced iSCSI task_tag: 268435537
>> [  278.640172] Unexpected ret: -32 send data 48
>> [  278.671299] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag:
>> 268435537
>> [  341.584329] ABORT_TASK: Found referenced iSCSI task_tag: 536871038
>> [  357.423710] Unexpected ret: -32 send data 48
>> [  373.660476] iSCSI Login timeout on Network Portal 192.168.100.233:3260
>>
> Great, recovering from uninterruptible sleep once the backend (finally)
> completes outstanding I/O here confirms the earlier SCF_ACK_KREF patch.
>
> So the backend device in question looks like it's repeatably taking
> extended time to complete I/O.  At least 15 seconds plus the host
> initiator side ABORT_TASK timeout.
>
> Is the device failing..?

I tried many times (more than 50), I_T nexus always alive, after killed 
the survival node( we have two initiator nodes, one fail one keep 
running), I see the former D threads on the target side back to S 
status, and I can read or write the LUN, the mount point is still 
active. The target is fully functional! Thanks for the patch!!!!
But I still see such message "[  887.448142] sd 8:0:0:0: rejecting I/O 
to offline device" on the failed initiator dmesg, I guess maybe target 
can not response to the failed initiator during that time because the 
target is fully occupied by the other initiator, I guess maybe HDD 
performance is the bottle neck.
But I think maybe there should not be an "offline device", yes IO 
traffic at the failed initiator is stuck there, but we still have much 
free network resources (FIO shows IO traffic < 20MB/s), also CPU and RAM 
resources, so I think it is still possible to ping in the protocol 
layer, should not be an "offline device".
Is it possible to investigate more(I will do that)? Maybe on the 
initiator side, any suggestions?
>
>> call stacks:
>> bj-ha-3:~ # ps -aux | grep iscsi
>> root      3063  0.0  0.0      0     0 ?        D    15:36   0:00 [iscsi_np]
>> root      3073  0.3  0.0      0     0 ?        S    15:36   0:01 [iscsi_ttx]
>> root      3074  0.4  0.0      0     0 ?        S    15:36   0:01 [iscsi_trx]
>> root      3094  0.3  0.0      0     0 ?        D    15:39   0:00 [iscsi_trx]
>> root      3116  0.0  0.0  10508  1616 pts/0    S+   15:42   0:00 grep
>> --color=auto iscsi
>> bj-ha-3:~ # cat /proc/3063/stack
>> [<ffffffffa03fcd97>] iscsi_check_for_session_reinstatement+0x1d7/0x270
>> [iscsi_target_mod]
>> [<ffffffffa03ff971>] iscsi_target_do_login+0x111/0x5f0 [iscsi_target_mod]
>> [<ffffffffa0400b17>] iscsi_target_start_negotiation+0x17/0xa0
>> [iscsi_target_mod]
>> [<ffffffffa03fe932>] iscsi_target_login_thread+0xa72/0xfd0
>> [iscsi_target_mod]
>> [<ffffffff810997cd>] kthread+0xbd/0xe0
>> [<ffffffff815e177f>] ret_from_fork+0x3f/0x70
>> [<ffffffff81099710>] kthread+0x0/0xe0
>> [<ffffffffffffffff>] 0xffffffffffffffff
>> bj-ha-3:~ # cat /proc/3094/stack
>> [<ffffffffa03adcc6>] transport_generic_free_cmd+0x76/0x110 [target_core_mod]
>> [<ffffffffa0404917>] iscsit_free_cmd+0x67/0x130 [iscsi_target_mod]
>> [<ffffffffa040b929>] iscsit_close_connection+0x4a9/0x860 [iscsi_target_mod]
>> [<ffffffffa040b0b3>] iscsi_target_rx_thread+0x93/0xa0 [iscsi_target_mod]
>> [<ffffffff810997cd>] kthread+0xbd/0xe0
>> [<ffffffff815e177f>] ret_from_fork+0x3f/0x70
>> [<ffffffff81099710>] kthread+0x0/0xe0
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> At this moment, we can see IO errors at one initiator side, the other
>> will survive. Once we killed the survival side FIO processes or wait a
>> moment, the target would auto recover, looks like this:
>>
>> bj-ha-3:~ # ps -aux | grep iscsi
>> root      3063  0.0  0.0      0     0 ?        S    15:36   0:00 [iscsi_np]
>> root      3142  0.0  0.0      0     0 ?        S    15:43   0:00 [iscsi_ttx]
>> root      3143  0.0  0.0      0     0 ?        S    15:43   0:00 [iscsi_trx]
>> root      3146  0.6  0.0      0     0 ?        S    15:43   0:00 [iscsi_ttx]
>> root      3147  0.8  0.0      0     0 ?        S    15:43   0:00 [iscsi_trx]
>> root      3149  0.0  0.0  10508  1636 pts/0    S+   15:43   0:00 grep
>> --color=auto iscsi
>>
>> dmesg at this moment:
>> [   90.507364] iSCSI/iqn.1996-04.de.suse:01:4039926d2313: Unsupported
>> SCSI Opcode 0xa3, sending CHECK_CONDITION.
>> [  100.487421] iSCSI/iqn.1996-04.de.suse:01:faad5846cde9: Unsupported
>> SCSI Opcode 0xa3, sending CHECK_CONDITION.
>> [  182.582323] ABORT_TASK: Found referenced iSCSI task_tag: 15
>> [  197.616800] Unexpected ret: -32 send data 48
>> [  197.712278] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 15
>> [  263.603305] ABORT_TASK: Found referenced iSCSI task_tag: 268435537
>> [  278.640172] Unexpected ret: -32 send data 48
>> [  278.671299] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag:
>> 268435537
>> [  341.584329] ABORT_TASK: Found referenced iSCSI task_tag: 536871038
>> [  357.423710] Unexpected ret: -32 send data 48
>> [  373.660476] iSCSI Login timeout on Network Portal 192.168.100.233:3260
>> [  535.852903] Unexpected ret: -32 send data 48
>> [  535.923803] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag:
>> 536871038
>> [  535.923811] ABORT_TASK: Found referenced iSCSI task_tag: 76
>> [  535.923815] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 76
>> [  535.924325] Unexpected ret: -104 send data 360
>> [  535.924329] tx_data returned -32, expecting 360.
>> [  535.924489] iSCSI Login negotiation failed.
>> [  535.924916] Unexpected ret: -104 send data 360
>> [  535.924918] tx_data returned -32, expecting 360.
>> [  535.924943] iSCSI Login negotiation failed.
>> [  535.925424] Unexpected ret: -104 send data 360
>> [  535.925428] tx_data returned -32, expecting 360.
>> [  535.925650] iSCSI Login negotiation failed.
>> [  535.926095] Unexpected ret: -104 send data 360
>> [  535.926097] tx_data returned -32, expecting 360.
>> [  535.926132] iSCSI Login negotiation failed.
>> [  535.926349] Unexpected ret: -104 send data 360
>> [  535.926351] tx_data returned -32, expecting 360.
>> [  535.926369] iSCSI Login negotiation failed.
>> [  535.926576] Unexpected ret: -104 send data 360
>> [  535.926577] tx_data returned -32, expecting 360.
>> [  535.926593] iSCSI Login negotiation failed.
>> [  535.926750] Unexpected ret: -104 send data 360
>> [  535.926752] tx_data returned -32, expecting 360.
>> [  535.926768] iSCSI Login negotiation failed.
>> [  535.926905] Unexpected ret: -104 send data 360
>> [  535.926906] tx_data returned -32, expecting 360.
>> [  535.926925] iSCSI Login negotiation failed.
>> [  535.927064] Unexpected ret: -104 send data 360
>> [  535.927065] tx_data returned -32, expecting 360.
>> [  535.927082] iSCSI Login negotiation failed.
>> [  535.927219] Unexpected ret: -104 send data 360
>> [  535.927220] tx_data returned -32, expecting 360.
>> [  535.927237] iSCSI Login negotiation failed.
>>
>> I am still debugging on this case, I have core dumps, I think the dump
>> files are too big for a mail list, I can send it off the list if needed.
> AFAICT, this is all expected behavior from target-core + iscsi-target
> when the backend device is not completing I/O for extended periods of
> time.
>
>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> (3) In this case, dmesg and call stack looks like this:
>> bj-ha-3:~ # dmesg
>> [  803.570868] ABORT_TASK: Found referenced iSCSI task_tag: 96
>> [  818.584038] Unexpected ret: -32 send data 48
>> [  818.620210] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 96
>> [  882.555923] ABORT_TASK: Found referenced iSCSI task_tag: 268435496
>> [  897.587925] Unexpected ret: -32 send data 48
>> [  898.845646] Unexpected ret: -32 send data 48
>> [  898.916090] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag:
>> 268435496
>> [  898.916095] ABORT_TASK: Found referenced iSCSI task_tag: 45
>> [  898.916377] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 45
>> [  962.785419] ABORT_TASK: Found referenced iSCSI task_tag: 268435489
>>
>> bj-ha-3:~ # ps -aux | grep iscsi
>> root      3063  0.0  0.0      0     0 ?        D    15:36   0:00 [iscsi_np]
>> root      3296  1.9  0.0      0     0 ?        S    15:49   0:01 [iscsi_ttx]
>> root      3297  1.9  0.0      0     0 ?        R    15:49   0:01 [iscsi_trx]
>> root      3299  0.0  0.0      0     0 ?        D    15:49   0:00 [iscsi_trx]
>> root      3506  0.0  0.0  10508  1516 pts/0    S+   15:51   0:00 grep
>> --color=auto iscsi
>>
>>
>> bj-ha-3:~ # cat /proc/3063/stack
>> [<ffffffffa03fcd97>] iscsi_check_for_session_reinstatement+0x1d7/0x270
>> [iscsi_target_mod]
>> [<ffffffffa03ff971>] iscsi_target_do_login+0x111/0x5f0 [iscsi_target_mod]
>> [<ffffffffa0400b17>] iscsi_target_start_negotiation+0x17/0xa0
>> [iscsi_target_mod]
>> [<ffffffffa03fe932>] iscsi_target_login_thread+0xa72/0xfd0
>> [iscsi_target_mod]
>> [<ffffffff810997cd>] kthread+0xbd/0xe0
>> [<ffffffff815e177f>] ret_from_fork+0x3f/0x70
>> [<ffffffff81099710>] kthread+0x0/0xe0
>> [<ffffffffffffffff>] 0xffffffffffffffff
>> bj-ha-3:~ # cat /proc/3099/stack
>> [<ffffffffa03ac784>] __transport_wait_for_tasks+0xb4/0x190 [target_core_mod]
>> [<ffffffffa03ac905>] transport_wait_for_tasks+0x45/0x60 [target_core_mod]
>> [<ffffffffa03aa034>] core_tmr_abort_task+0xf4/0x160 [target_core_mod]
>> [<ffffffffa03aca43>] target_tmr_work+0x123/0x140 [target_core_mod]
>> [<ffffffff81093ace>] process_one_work+0x14e/0x410
>> [<ffffffff81094326>] worker_thread+0x116/0x490
>> [<ffffffff810997cd>] kthread+0xbd/0xe0
>> [<ffffffff815e177f>] ret_from_fork+0x3f/0x70
>> [<ffffffff81099710>] kthread+0x0/0xe0
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> This case is also rare, only once, but I got the core dump files.
> Looks the same as #2.
>
>> -------------------------------------------------------------------------------------------------------------------------------------------------
>> Sorry again for the delay and send such a long mail, I am still working
>> on there cases, I will keep the list update once I find anything new.
> Thanks again for confirming the earlier patch.
>
>
Thanks a lot for your help!
BR
Zhu Lingshan

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-11-09  3:42 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-02 14:14 Need some pointers to debug a target hang Johannes Thumshirn
2016-10-18  5:57 ` Nicholas A. Bellinger
2016-10-18  7:01   ` Johannes Thumshirn
2016-10-18 17:35     ` Johannes Thumshirn
2016-10-19  6:29       ` Nicholas A. Bellinger
2016-10-29 22:53         ` Nicholas A. Bellinger
2016-10-29 23:03           ` Johannes Thumshirn
2016-10-31  8:51             ` Zhu Lingshan
2016-11-04  6:52               ` Nicholas A. Bellinger
2016-11-09  3:42                 ` Zhu Lingshan
2016-10-30 19:51           ` Johannes Thumshirn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.