linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: Hannes Reinecke <hare@suse.de>
To: Sagi Grimberg <sagi@grimberg.me>,
	Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>
Subject: Re: blktests failures with v6.4
Date: Thu, 13 Jul 2023 10:41:39 +0200	[thread overview]
Message-ID: <142962ad-55fc-d193-de7a-949f290d3fdf@suse.de> (raw)
In-Reply-To: <152f0684-4bcd-699f-e0e3-40189be4b80a@grimberg.me>

On 7/13/23 09:48, Sagi Grimberg wrote:
> 
>>>> #3: nvme/003 (fabrics transport)
>>>>
>>>>      When nvme test group is run with trtype=rdma or tcp, the test 
>>>> case fails
>>>>      due to lockdep WARNING "possible circular locking dependency 
>>>> detected".
>>>>      Reported in May/2023. Bart suggested a fix for trytpe=rdma [4] 
>>>> but it
>>>>      needs more discussion.
>>>>
>>>>      [4] 
>>>> https://lore.kernel.org/linux-nvme/20230511150321.103172-1-bvanassche@acm.org/
>>>
>>> This patch is unfortunately incorrect and buggy.
>>>
>>> This will likely make the issue go away, but adds another
>>> old issue where a client can DDOS a target by bombarding it
>>> with connect/disconnect. When releases are async and we don't
>>> have any back-pressure, it is likely to happen.
>>> -- 
>>> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
>>> index 4597bca43a6d..8b4f4aa48206 100644
>>> --- a/drivers/nvme/target/rdma.c
>>> +++ b/drivers/nvme/target/rdma.c
>>> @@ -1582,11 +1582,6 @@ static int nvmet_rdma_queue_connect(struct 
>>> rdma_cm_id
>>> *cm_id,
>>>                  goto put_device;
>>>          }
>>>
>>> -       if (queue->host_qid == 0) {
>>> -               /* Let inflight controller teardown complete */
>>> -               flush_workqueue(nvmet_wq);
>>> -       }
>>> -
>>>          ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>>>          if (ret) {
>>>                  /*
>>> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
>>> index 868aa4de2e4c..c8cfa19e11c7 100644
>>> --- a/drivers/nvme/target/tcp.c
>>> +++ b/drivers/nvme/target/tcp.c
>>> @@ -1844,11 +1844,6 @@ static u16 nvmet_tcp_install_queue(struct 
>>> nvmet_sq
>>> *sq)
>>>          struct nvmet_tcp_queue *queue =
>>>                  container_of(sq, struct nvmet_tcp_queue, nvme_sq);
>>>
>>> -       if (sq->qid == 0) {
>>> -               /* Let inflight controller teardown complete */
>>> -               flush_workqueue(nvmet_wq);
>>> -       }
>>> -
>>>          queue->nr_cmds = sq->size * 2;
>>>          if (nvmet_tcp_alloc_cmds(queue))
>>>                  return NVME_SC_INTERNAL;
>>> -- 
>>
>> Thanks Sagi, I tried the patch above and confirmed the lockdep WARN 
>> disappears
>> for both rdma and tcp. It indicates that the flush_workqueue(nvmet_wq)
>> introduced the circular lock dependency.
> 
> Thanks for confirming. This was expected.
> 
>> I also found the two commits below
>> record why the flush_workqueue(nvmet_wq) was introduced.
>>
>>   777dc82395de ("nvmet-rdma: occasionally flush ongoing controller 
>> teardown")
>>   8832cf922151 ("nvmet: use a private workqueue instead of the system 
>> workqueue")
> 
> The second patch is unrelated, before we used a global workqueue and
> fundamentally had the same issue.
> 
>> The left question is how to avoid both the connect/disconnect 
>> bombarding DDOS
>> and the circular lock possibility related to the nvmet_wq completion.
> 
> I don't see any way to synchronize connects with releases without moving 
> connect sequences to a dedicated thread. Which in my mind is undesirable.
> 
> The only solution I can think of is to fail a host connect expecting the
> host to reconnect and throttle this way, but that would lead to spurious
> connect failures (at least from the host PoV).
> 
> Maybe we can add a NOT_READY connect error code in nvme for that...
> 
You know, I have been seeing the very same lockdep warning during TLS 
testing; wasn't sure, though, if it's a generic issue or something I've 
introduced with the TLS logic.

I guess we can solve this by adding another NVMET_TCP_Q_INIT state 
before NVMET_TCP_Q_CONNECTING; then we can flip the state from INIT to 
CONNECTING in nvmet_tcp_alloc_queue():

diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index ed98df72c76b..e6f699a44128 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -121,6 +121,7 @@ struct nvmet_tcp_cmd {
  };

  enum nvmet_tcp_queue_state {
+       NVMET_TCP_Q_INIT,
         NVMET_TCP_Q_CONNECTING,
         NVMET_TCP_Q_LIVE,
         NVMET_TCP_Q_DISCONNECTING,
@@ -1274,7 +1275,8 @@ static int nvmet_tcp_try_recv(struct 
nvmet_tcp_queue *queue,
  static void nvmet_tcp_schedule_release_queue(struct nvmet_tcp_queue 
*queue)
  {
         spin_lock(&queue->state_lock);
-       if (queue->state != NVMET_TCP_Q_DISCONNECTING) {
+       if (queue->state != NVMET_TCP_Q_INIT &&
+           queue->state != NVMET_TCP_Q_DISCONNECTING) {
                 queue->state = NVMET_TCP_Q_DISCONNECTING;
                 queue_work(nvmet_wq, &queue->release_work);
         }
@@ -1625,7 +1627,7 @@ static int nvmet_tcp_alloc_queue(struct 
nvmet_tcp_port *port,
         queue->port = port;
         queue->nr_cmds = 0;
         spin_lock_init(&queue->state_lock);
-       queue->state = NVMET_TCP_Q_CONNECTING;
+       queue->state = NVMET_TCP_Q_INIT;
         INIT_LIST_HEAD(&queue->free_list);
         init_llist_head(&queue->resp_list);
         INIT_LIST_HEAD(&queue->resp_send_list);
@@ -1832,10 +1834,12 @@ static u16 nvmet_tcp_install_queue(struct 
nvmet_sq *sq)
         struct nvmet_tcp_queue *queue =
                 container_of(sq, struct nvmet_tcp_queue, nvme_sq);

-       if (sq->qid == 0) {
-               /* Let inflight controller teardown complete */
-               flush_workqueue(nvmet_wq);
+       spin_lock(&queue->state_lock);
+       if (queue->state != NVMET_TCP_Q_INIT) {
+               spin_unlock(&queue->state_lock);
+               return NVME_SC_NOT_READY;
         }
+       spin_unlock(&queue->state_lock);

         queue->nr_cmds = sq->size * 2;
         if (nvmet_tcp_alloc_cmds(queue))

With that we'll return 'not ready' whenever we hit this condition, but 
that should be fine as we would've crashed anyway with the old code.

Hmm?

Cheers,

Hannes


  reply	other threads:[~2023-07-13  8:41 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-07  7:27 blktests failures with v6.4 Shinichiro Kawasaki
2023-07-09 14:32 ` Sagi Grimberg
2023-07-13  1:22   ` Shinichiro Kawasaki
2023-07-13  7:48     ` Sagi Grimberg
2023-07-13  8:41       ` Hannes Reinecke [this message]
2023-07-13 10:16         ` Sagi Grimberg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=142962ad-55fc-d193-de7a-949f290d3fdf@suse.de \
    --to=hare@suse.de \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=sagi@grimberg.me \
    --cc=shinichiro.kawasaki@wdc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).