All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: <dledford@redhat.com>, <linux-rdma@vger.kernel.org>
Subject: Re: [PATCH for-rc] IB/hfi1: Correct an interlock issue for TID RDMA WRITE request
Date: Tue, 18 Aug 2020 13:34:02 -0300	[thread overview]
Message-ID: <20200818163402.GD1990081@nvidia.com> (raw)
In-Reply-To: <20200811174931.191210.84093.stgit@awfm-01.aw.intel.com>

On Tue, Aug 11, 2020 at 01:49:31PM -0400, Mike Marciniszyn wrote:
> From: Kaike Wan <kaike.wan@intel.com>
> 
> The following message occurs when running an AI application
> with TID RDMA enabled:
> 
> hfi1 0000:7f:00.0: hfi1_0: [QP74] hfi1_tid_timeout 4084
> hfi1 0000:7f:00.0: hfi1_0: [QP70] hfi1_tid_timeout 4084
> 
> The issue happens when TID RDMA WRITE request is followed by an
> IB_WR_RDMA_WRITE_WITH_IMM request, the latter could be completed
> first on the responder side. As a result, no ACK packet for the
> latter could be sent because the TID RDMA WRITE request is still
> being processed on the responder side.
> 
> When the TID RDMA WRITE request is eventually completed, the requester
> will wait for the IB_WR_RDMA_WRITE_WITH_IMM request to be acknowledged.
> 
> If the next request is another TID RDMA WRITE request, no
> TID RDMA WRITE DATA packet could be sent because the preceding
> IB_WR_RDMA_WRITE_WITH_IMM request is not completed yet.
> 
> Consequently the IB_WR_RDMA_WRITE_WITH_IMM will be retried but
> it will be ignored on the responder side because the responder
> thinks it has already been completed. Eventually the retry will
> be exhausted and the qp will be put into error state on the requester
> side. On the responder side, the TID resource timer will eventually
> expire because no TID RDMA WRITE DATA packets will be received for
> the second TID RDMA WRITE request.  There is also risk of a
> write-after-write memory corruption due to the issue.
> 
> Fix by adding a requester side interlock to prevent any potential
> data corruption and TID RDMA protocol error.
> 
> Fixes: a0b34f75ec20 ("IB/hfi1: Add interlock between a TID RDMA request and other requests")
> Cc: <stable@vger.kernel.org> # 5.4.x+
> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
> Signed-off-by: Kaike Wan <kaike.wan@intel.com>
> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
> ---
>  drivers/infiniband/hw/hfi1/tid_rdma.c |    1 +
>  1 file changed, 1 insertion(+)

Applied to for-rc, thanks

Jason

      reply	other threads:[~2020-08-18 16:34 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-11 17:49 [PATCH for-rc] IB/hfi1: Correct an interlock issue for TID RDMA WRITE request Mike Marciniszyn
2020-08-18 16:34 ` Jason Gunthorpe [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200818163402.GD1990081@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=dledford@redhat.com \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mike.marciniszyn@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.