All of lore.kernel.org
 help / color / mirror / Atom feed
From: sagi@grimberg.me (Sagi Grimberg)
Subject: [PATCH v1 0/5] Fix keep-alive mechanism for fabrics
Date: Wed, 11 Apr 2018 16:04:52 +0300	[thread overview]
Message-ID: <512640d9-9e1d-5e52-d7be-5d8203381023@grimberg.me> (raw)
In-Reply-To: <1523380689-17151-1-git-send-email-maxg@mellanox.com>


> Hi all,
> I've been debugging the KA mechanism lately and found a lack of
> coordination between the target and host implementations.
> 
> Johannes,
> Sorry for reverting your commit - I'll use nvme_start_keep_alive
> for my fix.
> 
> I've noticed that there is no clear definition in the NVMe spec
> regarding the keep-alive mechanism association. IMO, it should be
> a property of the admin queue and should be triggered as soon as
> the admin queue configured successfuly.
> 
> Idan/Christoph/Sagi,
> Any thoughts on that proposal ?

I don't understand what you mean by a "property of the admin queue",
there is no such definition in the spec. In any event, the keep alive
mechanism was defined to detect host/controller health on a periodic
basis.

The spec clearly states that the host needs to send keep alive at
a faster rate than the Keep alive timeout accounting things such
as transport roundtrip times, transport delays, keep alive timer
granularity and so on.

To me, this is a clear case where your use-case requires a larger
keep alive timeout. I'm not sure why sprinkling the keep alive execution
around solve/help anything. To me this looks completely redundant
(sorry).

We could theoretically make NVME_KATO_GRACE configurable, but I
seriously doubt its useful at all...

As a side note, Note that this should be solved by "Traffic Based Keep
Alive" TP which is approaching ratification. This is yet another example
showing that Having keep alive timer firing regardless of command
execution is hurtful.

> This patchset was tested using RDMA transport only:
> I've created 20 subsystems, 5 namespaces per subsystem and exposed
> all through 8 portals (total 160 ctrl's created) on 1 target.
> I used 1 initiator (host) and connected successfuly.
> Later on I've destroyed the target and caused a reconnection flow
> in the initiator side.
> Ater ~30-50 seconds, I've configured the target again but the initiator
> couldn't reconnect to it (after many retries).
> The reason for this was that the keep-alive timer expired at the target
> side, caused ctrl fatal error and the io-queue connect failed to find
> the ctrl. This loop never converged.
> 
> With the patches below, the test passed successfully after 1/2
> reconnection attempts.

Hence my comment above. As I said, I think you should either increase
the keep alive timeout for this use-case, or wait for TBKAS to ratify.

  parent reply	other threads:[~2018-04-11 13:04 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-10 17:18 [PATCH v1 0/5] Fix keep-alive mechanism for fabrics Max Gurtovoy
2018-04-10 17:18 ` [PATCH v1 1/5] Revert "nvme: unexport nvme_start_keep_alive" Max Gurtovoy
2018-04-10 17:18 ` [PATCH v1 2/5] nvme: remove association between ctrl and keep-alive Max Gurtovoy
2018-04-13 17:01   ` Christoph Hellwig
2018-04-10 17:18 ` [PATCH v1 3/5] nvme-rdma: add keep-alive mechanism as admin_q property Max Gurtovoy
2018-04-10 17:18 ` [PATCH v1 4/5] nvme-fc: " Max Gurtovoy
2018-04-10 17:18 ` [PATCH 5/5] nvme-loop: " Max Gurtovoy
2018-04-11 13:04 ` Sagi Grimberg [this message]
2018-04-11 13:38   ` [PATCH v1 0/5] Fix keep-alive mechanism for fabrics Max Gurtovoy
2018-04-11 14:07     ` Sagi Grimberg
2018-04-11 14:40       ` Max Gurtovoy
2018-04-11 16:48         ` James Smart
2018-04-12  8:49           ` Max Gurtovoy
2018-04-12 12:34             ` Sagi Grimberg
2018-04-12 17:28               ` James Smart
2018-04-12 12:29           ` Sagi Grimberg
2018-04-12 13:32             ` Max Gurtovoy
2018-04-12 15:17               ` Sagi Grimberg
2018-04-12 16:43                 ` Max Gurtovoy
2018-04-12 12:14         ` Sagi Grimberg
2018-04-12 12:18           ` Max Gurtovoy
2018-04-12 12:31             ` Sagi Grimberg
2018-04-13 17:06 ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=512640d9-9e1d-5e52-d7be-5d8203381023@grimberg.me \
    --to=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.