All of lore.kernel.org
 help / color / mirror / Atom feed
From: Doug Ledford <dledford@redhat.com>
To: Robert LeBlanc <robert@leblancnet.us>,
	"Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Zhu Lingshan <lszhu@suse.com>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	linux-scsi@vger.kernel.org, Sagi Grimberg <sagi@grimberg.me>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: iscsi_trx going into D state
Date: Thu, 22 Dec 2016 14:15:43 -0500	[thread overview]
Message-ID: <6beeb461-0391-cb29-8cf7-74f4c8feda34@redhat.com> (raw)
In-Reply-To: <CAANLjFpoWruzYEDohQrAKCxp9NSGhZM3mv=x+9+L7x1+fwYN6w@mail.gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 9202 bytes --]

On 12/21/2016 6:39 PM, Robert LeBlanc wrote:
> I hit a new backtrace today, hopefully it adds something.
> 
> # cat /proc/19659/stack
> [<ffffffff815304d1>] iscsit_stop_session+0x1b1/0x1c0
> [<ffffffff81521c62>] iscsi_check_for_session_reinstatement+0x1e2/0x270
> [<ffffffff81524660>] iscsi_target_check_for_existing_instances+0x30/0x40
> [<ffffffff815247a8>] iscsi_target_do_login+0x138/0x630
> [<ffffffff815259be>] iscsi_target_start_negotiation+0x4e/0xa0
> [<ffffffff8152355e>] __iscsi_target_login_thread+0x83e/0xf20
> [<ffffffff81523c64>] iscsi_target_login_thread+0x24/0x30
> [<ffffffff810a3059>] kthread+0xd9/0xf0
> [<ffffffff817732d5>] ret_from_fork+0x25/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> # cat /proc/21342/stack
> [<ffffffffa0292b10>] __ib_drain_sq+0x190/0x1c0 [ib_core]
> [<ffffffffa0292b65>] ib_drain_sq+0x25/0x30 [ib_core]
> [<ffffffffa0292d72>] ib_drain_qp+0x12/0x30 [ib_core]
> [<ffffffffa062c5ff>] isert_wait_conn+0x5f/0x2d0 [ib_isert]
> [<ffffffff815309b7>] iscsit_close_connection+0x157/0x860
> [<ffffffff8151f10b>] iscsit_take_action_for_connection_exit+0x7b/0xf0
> [<ffffffff81530265>] iscsi_target_rx_thread+0x95/0xa0
> [<ffffffff810a3059>] kthread+0xd9/0xf0
> [<ffffffff817732d5>] ret_from_fork+0x25/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> # ps aux | grep iscsi | grep D
> root     19659  0.0  0.0      0     0 ?        D    16:12   0:00 [iscsi_np]
> root     21342  0.0  0.0      0     0 ?        D    16:29   0:00 [iscsi_trx]
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

That looks suspiciously like the __ib_drain_sq is stuck forever waiting
on a completion that never comes.

> 
> On Thu, Dec 15, 2016 at 1:38 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
>> Nicholas,
>>
>> I've found that the kernels I used were not able to be inspected using
>> crash and I could not build the debug info for them. So I built a new
>> 4.9 kernel and verified that I could inspect the crash. It is located
>> at [1].
>>
>> [1] http://mirrors.betterservers.com/trace/crash2.tar.xz
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Mon, Dec 12, 2016 at 4:57 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
>>> Nicholas,
>>>
>>> After lots of set backs and having to give up trying to get kernel
>>> dumps on our "production" systems, I've been able to work out the
>>> issues we had with kdump and replicate the issue on my dev boxes. I
>>> have dumps from 4.4.30 and 4.9-rc8 (makedumpfile would not dump, so it
>>> is a straight copy of /proc/vmcore from the crash kernel). In each
>>> crash directory, I put a details.txt file that has the process IDs
>>> that were having problems and a brief description of the set-up at the
>>> time. This was mostly replicated by starting fio and pulling the
>>> Infiniband cable until fio gave up. This hardware also has Mellanox
>>> ConnectX4-LX cards and I also replicated the issue over RoCE using 4.9
>>> since it has the drivers in-box. Please let me know if you need more
>>> info, I can test much faster now. The cores/kernels/modules are
>>> located at [1].
>>>
>>> [1] http://mirrors.betterservers.com/trace/crash.tar.xz
>>>
>>> Thanks,
>>> Robert
>>> ----------------
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Fri, Nov 4, 2016 at 3:57 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
>>>> We hit this yesterday, this time it was on the tx thread (the other
>>>> ones before seem to be on the rx thread). We weren't able to get a
>>>> kernel dump on this. We'll try to get one next time.
>>>>
>>>> # ps axuw | grep "D.*iscs[i]"
>>>> root     12383  0.0  0.0      0     0 ?        D    Nov03   0:04 [iscsi_np]
>>>> root     23016  0.0  0.0      0     0 ?        D    Nov03   0:00 [iscsi_ttx]
>>>> root     23018  0.0  0.0      0     0 ?        D    Nov03   0:00 [iscsi_ttx]
>>>> # cat /proc/12383/stack
>>>> [<ffffffff814f24af>] iscsit_stop_session+0x19f/0x1d0
>>>> [<ffffffff814e3c66>] iscsi_check_for_session_reinstatement+0x1e6/0x270
>>>> [<ffffffff814e6620>] iscsi_target_check_for_existing_instances+0x30/0x40
>>>> [<ffffffff814e6770>] iscsi_target_do_login+0x140/0x640
>>>> [<ffffffff814e7b0c>] iscsi_target_start_negotiation+0x1c/0xb0
>>>> [<ffffffff814e585b>] iscsi_target_login_thread+0xa9b/0xfc0
>>>> [<ffffffff8109d7c8>] kthread+0xd8/0xf0
>>>> [<ffffffff81721a8f>] ret_from_fork+0x3f/0x70
>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>> # cat /proc/23016/stack
>>>> [<ffffffff814ce0d9>] target_wait_for_sess_cmds+0x49/0x1a0
>>>> [<ffffffffa058b92b>] isert_wait_conn+0x1ab/0x2f0 [ib_isert]
>>>> [<ffffffff814f2642>] iscsit_close_connection+0x162/0x870
>>>> [<ffffffff814e110f>] iscsit_take_action_for_connection_exit+0x7f/0x100
>>>> [<ffffffff814f122a>] iscsi_target_tx_thread+0x1aa/0x1d0
>>>> [<ffffffff8109d7c8>] kthread+0xd8/0xf0
>>>> [<ffffffff81721a8f>] ret_from_fork+0x3f/0x70
>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>> # cat /proc/23018/stack
>>>> [<ffffffff814ce0d9>] target_wait_for_sess_cmds+0x49/0x1a0
>>>> [<ffffffffa058b92b>] isert_wait_conn+0x1ab/0x2f0 [ib_isert]
>>>> [<ffffffff814f2642>] iscsit_close_connection+0x162/0x870
>>>> [<ffffffff814e110f>] iscsit_take_action_for_connection_exit+0x7f/0x100
>>>> [<ffffffff814f122a>] iscsi_target_tx_thread+0x1aa/0x1d0
>>>> [<ffffffff8109d7c8>] kthread+0xd8/0xf0
>>>> [<ffffffff81721a8f>] ret_from_fork+0x3f/0x70
>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>>
>>>> From dmesg:
>>>> [  394.476332] INFO: rcu_sched self-detected stall on CPU
>>>> [  394.476334]  20-...: (23976 ticks this GP)
>>>> idle=edd/140000000000001/0 softirq=292/292 fqs=18788
>>>> [  394.476336]   (t=24003 jiffies g=3146 c=3145 q=0)
>>>> [  394.476337] Task dump for CPU 20:
>>>> [  394.476338] kworker/u68:2   R  running task        0 12906      2 0x00000008
>>>> [  394.476345] Workqueue: isert_comp_wq isert_cq_work [ib_isert]
>>>> [  394.476346]  ffff883f2fe38000 00000000f805705e ffff883f7fd03da8
>>>> ffffffff810ac8ff
>>>> [  394.476347]  0000000000000014 ffffffff81adb680 ffff883f7fd03dc0
>>>> ffffffff810af239
>>>> [  394.476348]  0000000000000015 ffff883f7fd03df0 ffffffff810e1cd0
>>>> ffff883f7fd17b80
>>>> [  394.476348] Call Trace:
>>>> [  394.476354]  <IRQ>  [<ffffffff810ac8ff>] sched_show_task+0xaf/0x110
>>>> [  394.476355]  [<ffffffff810af239>] dump_cpu_task+0x39/0x40
>>>> [  394.476357]  [<ffffffff810e1cd0>] rcu_dump_cpu_stacks+0x80/0xb0
>>>> [  394.476359]  [<ffffffff810e6100>] rcu_check_callbacks+0x540/0x820
>>>> [  394.476360]  [<ffffffff810afe11>] ? account_system_time+0x81/0x110
>>>> [  394.476363]  [<ffffffff810faa60>] ? tick_sched_do_timer+0x50/0x50
>>>> [  394.476364]  [<ffffffff810eb599>] update_process_times+0x39/0x60
>>>> [  394.476365]  [<ffffffff810fa815>] tick_sched_handle.isra.17+0x25/0x60
>>>> [  394.476366]  [<ffffffff810faa9d>] tick_sched_timer+0x3d/0x70
>>>> [  394.476368]  [<ffffffff810ec182>] __hrtimer_run_queues+0x102/0x290
>>>> [  394.476369]  [<ffffffff810ec668>] hrtimer_interrupt+0xa8/0x1a0
>>>> [  394.476372]  [<ffffffff81052c65>] local_apic_timer_interrupt+0x35/0x60
>>>> [  394.476374]  [<ffffffff8172423d>] smp_apic_timer_interrupt+0x3d/0x50
>>>> [  394.476376]  [<ffffffff817224f7>] apic_timer_interrupt+0x87/0x90
>>>> [  394.476379]  <EOI>  [<ffffffff810d71be>] ? console_unlock+0x41e/0x4e0
>>>> [  394.476380]  [<ffffffff810d757c>] vprintk_emit+0x2fc/0x500
>>>> [  394.476382]  [<ffffffff810d78ff>] vprintk_default+0x1f/0x30
>>>> [  394.476384]  [<ffffffff81174dde>] printk+0x5d/0x74
>>>> [  394.476388]  [<ffffffff814bce21>] transport_lookup_cmd_lun+0x1d1/0x200
>>>> [  394.476390]  [<ffffffff814ee8c0>] iscsit_setup_scsi_cmd+0x230/0x540
>>>> [  394.476392]  [<ffffffffa058dbf3>] isert_rx_do_work+0x3f3/0x7f0 [ib_isert]
>>>> [  394.476394]  [<ffffffffa058e174>] isert_cq_work+0x184/0x770 [ib_isert]
>>>> [  394.476396]  [<ffffffff8109740f>] process_one_work+0x14f/0x400
>>>> [  394.476397]  [<ffffffff81097c84>] worker_thread+0x114/0x470
>>>> [  394.476398]  [<ffffffff8171d32a>] ? __schedule+0x34a/0x7f0
>>>> [  394.476399]  [<ffffffff81097b70>] ? rescuer_thread+0x310/0x310
>>>> [  394.476400]  [<ffffffff8109d7c8>] kthread+0xd8/0xf0
>>>> [  394.476402]  [<ffffffff8109d6f0>] ? kthread_park+0x60/0x60
>>>> [  394.476403]  [<ffffffff81721a8f>] ret_from_fork+0x3f/0x70
>>>> [  394.476404]  [<ffffffff8109d6f0>] ? kthread_park+0x60/0x60
>>>> [  405.716632] Unexpected ret: -104 send data 360
>>>> [  405.721711] tx_data returned -32, expecting 360.
>>>> ----------------
>>>> Robert LeBlanc
>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

When you combine this trace with the newest one, it really makes me
thing there is something of a bad interaction between the new drain cq
API and the iser/isert implementation to use said API.  Sagi, Christoph?

-- 
Doug Ledford <dledford@redhat.com>
    GPG Key ID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

  reply	other threads:[~2016-12-22 19:15 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-30 17:14 iscsi_trx going into D state Robert LeBlanc
     [not found] ` <CAANLjFoj9-qscJOSf2jtKYt2+4cQxMHNJ9q2QTey4wyG5OTSAA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-04  7:55   ` Johannes Thumshirn
     [not found]     ` <20161004075545.j52mg3a2jckrchlp-qw2SdCWA0PpjqqEj2zc+bA@public.gmane.org>
2016-10-04  9:11       ` Hannes Reinecke
2016-10-04 11:46         ` Christoph Hellwig
     [not found]           ` <20161004114642.GA2377-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2016-10-04 16:39             ` Robert LeBlanc
2016-10-05 17:40           ` Robert LeBlanc
2016-10-05 18:03             ` Christoph Hellwig
2016-10-05 18:19               ` Robert LeBlanc
2016-10-08  2:59 ` Zhu Lingshan
2016-10-17 16:32   ` Robert LeBlanc
     [not found]     ` <CAANLjFobXiBO2tXxTBB-8BQjM8FC0wmxdxQvEd6Rp=1LZkrvpA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-17 19:03       ` Robert LeBlanc
2016-10-17 19:11       ` Robert LeBlanc
     [not found]         ` <CAANLjFoh+C8QE=qcPKqUUG3SnH2EMmS7DWZ5D4AD7yWMxoK0Zw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-18  3:06           ` Zhu Lingshan
     [not found]             ` <4fc72e32-26fb-96bd-8a0d-814eef712b43-IBi9RG/b67k@public.gmane.org>
2016-10-18  4:42               ` Robert LeBlanc
2016-10-18  7:05                 ` Nicholas A. Bellinger
2016-10-18  7:52                   ` Nicholas A. Bellinger
     [not found]                   ` <1476774332.8490.43.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
2016-10-18 22:13                     ` Robert LeBlanc
     [not found]                       ` <CAANLjFqXt5r=c9F75vjeK=_zLa8zCS1priLuZo=A1ZSHKZ=1Bw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-19  6:25                         ` Nicholas A. Bellinger
     [not found]                           ` <1476858359.8490.97.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
2016-10-19 16:41                             ` Robert LeBlanc
     [not found]                               ` <CAANLjFoGEi29goybqsvEg6trystEkurVz52P8SwqGUSNV1jdSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-29 22:29                                 ` Nicholas A. Bellinger
     [not found]                                   ` <1477780190.22703.47.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
2016-10-31 16:34                                     ` Robert LeBlanc
     [not found]                                       ` <CAANLjFpkEVmO83r5YWh=hCnN=AUf9bvrrCyVJHc-=CRpc3P0vQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-11-04 21:57                                         ` Robert LeBlanc
     [not found]                                           ` <CAANLjFqoHuSq2SsNZ4J2uvAQGPg0F1tpxeJuAQT1oM1hXQ0wew-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-12 23:57                                             ` Robert LeBlanc
     [not found]                                               ` <CAANLjFpYT62G86w-r00+shJUyrPd68BS64y8f9OZemz_5kojzg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-15 20:38                                                 ` Robert LeBlanc
     [not found]                                                   ` <CAANLjFon+re7eMriFjnFfR-4SnzxR4LLSb2qcwhfkb7ODbuTwg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-21 23:39                                                     ` Robert LeBlanc
2016-12-22 19:15                                                       ` Doug Ledford [this message]
2016-12-27 20:22                                                         ` Robert LeBlanc
     [not found]                                                           ` <CAANLjFq2ib0H+W3RFVAdqvWF8_qDOkM5mvmAhVh0x4Usha2dOg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-27 20:58                                                             ` Robert LeBlanc
     [not found]                                                               ` <CAANLjFqRskoM7dn_zj_-V=uUb5KYq0OLLdLLuC4Uuba4+mq5Vw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-28 20:39                                                                 ` Robert LeBlanc
2016-12-28 20:58                                                                   ` Robert LeBlanc
     [not found]                                                                     ` <CAANLjFpbE9-B8qWtU5nDfg4+t+kD8TSVy0JOfN+zuFYsZ05_Dg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-29 21:23                                                                       ` Robert LeBlanc
     [not found]                                                                         ` <CAANLjFpEpJ4647u9R-7phf68fw--pOfThbp5Sntd4c7DdRSwwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-29 23:57                                                                           ` Robert LeBlanc
     [not found]                                                                             ` <CAANLjFooGrt51a9rOy8TKMyXyxBYmGEPm=h1YJm81Nj6YS=5yg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-30 23:07                                                                               ` Robert LeBlanc
     [not found]                                                                                 ` <CAANLjFrZrTPUuzP_NjkgG5h_YwwYKEWT-KzVjTvuXZ1d04z6Fg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-01-03 20:07                                                                                   ` Robert LeBlanc
     [not found]                                                                                     ` <CAANLjFpSnQ7ApOK5HDRHXQQeQNGWLUv4e+2N=_e-zBeziYm5tw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-01-04  0:11                                                                                       ` Robert LeBlanc
2017-01-06 17:06                                                                                         ` Laurence Oberman
2017-01-06 19:12                                                                                           ` Robert LeBlanc
2017-01-12 21:22                                                                                             ` Robert LeBlanc
2017-01-12 21:26                                                                                               ` Robert LeBlanc
2017-01-13 15:10                                                                                                 ` Laurence Oberman
     [not found]                                                                                                   ` <1449740553.15880491.1484320214006.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-01-13 23:38                                                                                                     ` Robert LeBlanc
     [not found]                                                                                                       ` <CAANLjFrFxasp6e=jWq4FwPFjRLgX-nwHc5n+eYRTz9EjTCAQ5g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-01-15 18:15                                                                                                         ` Laurence Oberman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6beeb461-0391-cb29-8cf7-74f4c8feda34@redhat.com \
    --to=dledford@redhat.com \
    --cc=hch@lst.de \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=lszhu@suse.com \
    --cc=nab@linux-iscsi.org \
    --cc=robert@leblancnet.us \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.