All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Leon Romanovsky <leon@kernel.org>
Cc: Doug Ledford <dledford@redhat.com>,
	Leon Romanovsky <leonro@mellanox.com>,
	<linux-rdma@vger.kernel.org>
Subject: Re: [PATCH rdma-next v1 03/10] RDMA/mlx5: Issue FW command to destroy SRQ on reentry
Date: Wed, 2 Sep 2020 21:31:15 -0300	[thread overview]
Message-ID: <20200903003115.GA1480685@nvidia.com> (raw)
In-Reply-To: <20200830084010.102381-4-leon@kernel.org>

On Sun, Aug 30, 2020 at 11:40:03AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@mellanox.com>
> 
> The HW release can fail and leave the system in limbo state,
> where SRQ is removed from the table, but can't be destroyed later.
> In every reentry, the initial xa_erase_irq() check will fail.
> 
> Rewrite the erase logic to keep index, but don't store the entry
> itself. By doing it, we can safely reinsert entry back in the case
> of destroy failure and be safe from any xa_store_irq() error.
> 
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
>  drivers/infiniband/hw/mlx5/srq_cmd.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/mlx5/srq_cmd.c b/drivers/infiniband/hw/mlx5/srq_cmd.c
> index 37aaacebd3f2..c6d807f04d9d 100644
> +++ b/drivers/infiniband/hw/mlx5/srq_cmd.c
> @@ -596,13 +596,22 @@ void mlx5_cmd_destroy_srq(struct mlx5_ib_dev *dev, struct mlx5_core_srq *srq)
>  	struct mlx5_core_srq *tmp;
>  	int err;
>  
> -	tmp = xa_erase_irq(&table->array, srq->srqn);
> -	if (!tmp || tmp != srq)
> +	/* Delete entry, but leave index occupied */
> +	tmp = xa_store_irq(&table->array, srq->srqn, NULL, 0);
> +	if (WARN_ON(!tmp || tmp != srq))
>  		return;

This isn't an allocating xarray:

	xa_init_flags(&table->array, XA_FLAGS_LOCK_IRQ);

So storing NULL actually does delete the entry and clean up the memory
and the store below could fail.

I think this should be written as

   xa_cmpxchg_irq(&table->array, srq->srqn, srq, XA_ZERO_ENTRY, 0);

And the undo below would be

   xa_cmpxchg_irq(&table->array, srq->srqn, XA_ZERO_ENTRY, srq 0);

> +	xa_erase_irq(&table->array, srq->srqn);

And this is racy since the FW could have reallocated the same srqn and
already set it in the xarray.

It needs to be xa_release_irq(), which looks like it needs to be
added to match xa_reserve_irq()

Jason

  reply	other threads:[~2020-09-03  0:31 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-30  8:40 [PATCH rdma-next v1 00/10] Restore failure of destroy commands Leon Romanovsky
2020-08-30  8:40 ` [PATCH rdma-next v1 01/10] RDMA: Restore ability to fail on PD deallocate Leon Romanovsky
2020-08-30  8:40 ` [PATCH rdma-next v1 02/10] RDMA: Restore ability to fail on AH destroy Leon Romanovsky
2020-08-30  8:40 ` [PATCH rdma-next v1 03/10] RDMA/mlx5: Issue FW command to destroy SRQ on reentry Leon Romanovsky
2020-09-03  0:31   ` Jason Gunthorpe [this message]
2020-09-03  5:08     ` Leon Romanovsky
2020-09-03 11:54       ` Jason Gunthorpe
2020-08-30  8:40 ` [PATCH rdma-next v1 04/10] RDMA/mlx5: Fix potential race between destroy and CQE poll Leon Romanovsky
2020-09-03 13:42   ` Jason Gunthorpe
2020-08-30  8:40 ` [PATCH rdma-next v1 05/10] RDMA: Restore ability to fail on SRQ destroy Leon Romanovsky
2020-09-03  0:08   ` Jason Gunthorpe
2020-09-03  5:11     ` Leon Romanovsky
2020-09-03 11:55       ` Jason Gunthorpe
2020-09-03  0:18   ` Jason Gunthorpe
2020-09-03  5:28     ` Leon Romanovsky
2020-09-03 12:22       ` Jason Gunthorpe
2020-09-03 13:12         ` Jason Gunthorpe
2020-08-30  8:40 ` [PATCH rdma-next v1 06/10] RDMA/core: Delete function indirection for alloc/free kernel CQ Leon Romanovsky
2020-09-03  0:20   ` Jason Gunthorpe
2020-09-03  5:35     ` Leon Romanovsky
2020-09-03 12:24       ` Jason Gunthorpe
2020-08-30  8:40 ` [PATCH rdma-next v1 07/10] RDMA: Allow fail of destroy CQ Leon Romanovsky
2020-08-30  8:40 ` [PATCH rdma-next v1 08/10] RDMA: Change XRCD destroy return value Leon Romanovsky
2020-08-30  8:40 ` [PATCH rdma-next v1 09/10] RDMA: Restore ability to return error for destroy WQ Leon Romanovsky
2020-08-30  8:40 ` [PATCH rdma-next v1 10/10] RDMA: Make counters destroy symmetrical Leon Romanovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200903003115.GA1480685@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=dledford@redhat.com \
    --cc=leon@kernel.org \
    --cc=leonro@mellanox.com \
    --cc=linux-rdma@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.