From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-rdma-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 97042C433EF
	for <linux-rdma@archiver.kernel.org>; Fri, 15 Apr 2022 08:50:31 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232347AbiDOIw5 (ORCPT <rfc822;linux-rdma@archiver.kernel.org>);
        Fri, 15 Apr 2022 04:52:57 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33990 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1351677AbiDOIwz (ORCPT
        <rfc822;linux-rdma@vger.kernel.org>); Fri, 15 Apr 2022 04:52:55 -0400
Received: from out0.migadu.com (out0.migadu.com [94.23.1.103])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D309545522
        for <linux-rdma@vger.kernel.org>; Fri, 15 Apr 2022 01:50:26 -0700 (PDT)
Message-ID: <d615b938-b357-c8eb-f8de-c12a90703a21@linux.dev>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
        t=1650012624;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=c6/mETHjndeQs4FwZ/9fNaytmKWPKVkH2MpS30ELSlU=;
        b=xTevUiZjmfuJ6VKQHkZi6kXMBoi6qA8XLknauYn4WYlXpy4Jg5OJ6EcPpYjUL+YBdspkhE
        ZbXVB/ChirGdrDgQVtweVgqzGbfDvEVvwDMPUWBveK+xkihv2F9U37nC6mL0mnvM9hu+y/
        /uhQMfhPEDJG0E1yEfItXJZ7JNw83CY=
Date:   Fri, 15 Apr 2022 16:50:14 +0800
MIME-Version: 1.0
Subject: Re: [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem
To:     Bob Pearson <rpearsonhpe@gmail.com>,
        Yanjun Zhu <yanjun.zhu@linux.dev>, jgg@ziepe.ca,
        leon@kernel.org, linux-rdma@vger.kernel.org
References: <20220415195630.279510-1-yanjun.zhu@linux.dev>
 <e217ab50-75ff-9112-e492-a70cca50759b@gmail.com>
 <0d88246e-c29a-27c0-95c5-da73f52e6a59@linux.dev>
 <726e75b0-c165-92f8-c367-1a5a777bc8b1@gmail.com>
 <5597e6eb-1543-6895-77fc-426e0a601338@linux.dev>
 <72d96848-3414-b665-d800-759986540aed@gmail.com>
 <11ec6705-a810-6ec9-86f1-584d29b656c6@linux.dev>
 <c6de53e5-3f2e-adb1-b718-c42fbd2a1dbf@gmail.com>
 <c5747fcf-de10-a31c-d613-7d4c27d05430@gmail.com>
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From:   Yanjun Zhu <yanjun.zhu@linux.dev>
In-Reply-To: <c5747fcf-de10-a31c-d613-7d4c27d05430@gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Migadu-Flow: FLOW_OUT
X-Migadu-Auth-User: linux.dev
Precedence: bulk
List-ID: <linux-rdma.vger.kernel.org>
X-Mailing-List: linux-rdma@vger.kernel.org

在 2022/4/15 15:44, Bob Pearson 写道:
> On 4/15/22 02:35, Bob Pearson wrote:
>> On 4/15/22 02:32, Yanjun Zhu wrote:
>>>
>>> 在 2022/4/15 15:22, Bob Pearson 写道:
>>>> On 4/15/22 01:49, Yanjun Zhu wrote:
>>>>> 在 2022/4/15 14:35, Bob Pearson 写道:
>>>>>> On 4/15/22 00:54, Yanjun Zhu wrote:
>>>>>>> 在 2022/4/15 13:37, Bob Pearson 写道:
>>>>>>>> On 4/15/22 14:56, yanjun.zhu@linux.dev wrote:
>>>>>>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>>>>>
>>>>>>>>> This is a dead lock problem.
>>>>>>>>> The xa_lock first is acquired in this:
>>>>>>>>>
>>>>>>>>> {SOFTIRQ-ON-W} state was registered at:
>>>>>>>>>
>>>>>>>>>       lock_acquire+0x1d2/0x5a0
>>>>>>>>>       _raw_spin_lock+0x33/0x80
>>>>>>>>>       __rxe_add_to_pool+0x183/0x230 [rdma_rxe]
>>>>>>>>>       __ib_alloc_pd+0xf9/0x550 [ib_core]
>>>>>>>>>       ib_mad_init_device+0x2d9/0xd20 [ib_core]
>>>>>>>>>       add_client_context+0x2fa/0x450 [ib_core]
>>>>>>>>>       enable_device_and_get+0x1b7/0x350 [ib_core]
>>>>>>>>>       ib_register_device+0x757/0xaf0 [ib_core]
>>>>>>>>>       rxe_register_device+0x2eb/0x390 [rdma_rxe]
>>>>>>>>>       rxe_net_add+0x83/0xc0 [rdma_rxe]
>>>>>>>>>       rxe_newlink+0x76/0x90 [rdma_rxe]
>>>>>>>>>       nldev_newlink+0x245/0x3e0 [ib_core]
>>>>>>>>>       rdma_nl_rcv_msg+0x2d4/0x790 [ib_core]
>>>>>>>>>       rdma_nl_rcv+0x1ca/0x3f0 [ib_core]
>>>>>>>>>       netlink_unicast+0x43b/0x640
>>>>>>>>>       netlink_sendmsg+0x7eb/0xc40
>>>>>>>>>       sock_sendmsg+0xe0/0x110
>>>>>>>>>       __sys_sendto+0x1d7/0x2b0
>>>>>>>>>       __x64_sys_sendto+0xdd/0x1b0
>>>>>>>>>       do_syscall_64+0x37/0x80
>>>>>>>>>       entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>>>> There is a separate xarray for each object pool. So this one is
>>>>>>>> rxe->pd_pool.xa.xa_lock from rxe_alloc_pd().
>>>>>>>>
>>>>>>>>> Then xa_lock is acquired in this:
>>>>>>>>>
>>>>>>>>> {IN-SOFTIRQ-W}:
>>>>>>>>>
>>>>>>>>> Call Trace:
>>>>>>>>>      <TASK>
>>>>>>>>>       dump_stack_lvl+0x44/0x57
>>>>>>>>>       mark_lock.part.52.cold.79+0x3c/0x46
>>>>>>>>>       __lock_acquire+0x1565/0x34a0
>>>>>>>>>       lock_acquire+0x1d2/0x5a0
>>>>>>>>>       _raw_spin_lock_irqsave+0x42/0x90
>>>>>>>>>       rxe_pool_get_index+0x72/0x1d0 [rdma_rxe]
>>>>>>>>>       rxe_get_av+0x168/0x2a0 [rdma_rxe]
>>>>>>>>>       rxe_requester+0x75b/0x4a90 [rdma_rxe]
>>>>>>>>>       rxe_do_task+0x134/0x230 [rdma_rxe]
>>>>>>>>>       tasklet_action_common.isra.12+0x1f7/0x2d0
>>>>>>>>>       __do_softirq+0x1ea/0xa4c
>>>>>>>>>       run_ksoftirqd+0x32/0x60
>>>>>>>>>       smpboot_thread_fn+0x503/0x860
>>>>>>>>>       kthread+0x29b/0x340
>>>>>>>>>       ret_from_fork+0x1f/0x30
>>>>>>>> And this one is rxe->ah_pool.xa.xa_lock from rxe_requester
>>>>>>>> in the process of sending a UD packet from a work request
>>>>>>>> which contains the index of the ah.
>>>>>>>>
>>>>>>>> For your story to work there needs to be an another ah_pool.xa.xa_lock somewhere.
>>>>>>>> Let's assume it is there somewhere and it's from (a different) add_to_pool call
>>>>>>>> then the add_to_pool_ routine should disable interrupts when it gets the lock
>>>>>>>> with spin_lock_xxx. But only for AH objects.
>>>>>>>>
>>>>>>>> This may be old news.
>>>>>>> What do you mean? Please check the call trace in the bug.
>>>>>> I mean the trace you show here shows an instance of xa_lock being
>>>>>> acquired from the pd pool followed by an instance of xa_lock being
>>>>>> acquired from rxe_pool_get_index from the ah pool. They are different
>>>>>> locks. They can't deadlock against each other. So there must be
>>>>>> some other trace (not shown) that also gets xa_lock from the ah pool.
>>>>> Please check the bug report mail. The link is news://nntp.lore.kernel.org:119/CAHj4cs-MT13RiEsWXUAcX_H5jEtjsebuZgSeUcfptNVuELgjYQ@mail.gmail.com
>>>>>
>>>>> BTW, what is the update about wr crash caused by your xarray patches?
>>>>>
>>>>> Zhu Yanjun
>>>>>
>>>>>>> Zhu Yanjun
>>>>>>>
>>>>>>>>>      </TASK>
>>>>>>>>>
>>>>>>>>>     From the above, in the function __rxe_add_to_pool,
>>>>>>>>> xa_lock is acquired. Then the function __rxe_add_to_pool
>>>>>>>>> is interrupted by softirq. The function
>>>>>>>>> rxe_pool_get_index will also acquire xa_lock.
>>>>>>>>>
>>>>>>>>> Finally, the dead lock appears.
>>>>>>>>>
>>>>>>>>> [  296.806097]        CPU0
>>>>>>>>> [  296.808550]        ----
>>>>>>>>> [  296.811003]   lock(&xa->xa_lock#15);  <----- __rxe_add_to_pool
>>>>>>>>> [  296.814583]   <Interrupt>
>>>>>>>>> [  296.817209]     lock(&xa->xa_lock#15); <---- rxe_pool_get_index
>>>>>>>>> [  296.820961]
>>>>>>>>>                      *** DEADLOCK ***
>>>>>>>>>
>>>>>>>>> Fixes: 3225717f6dfa ("RDMA/rxe: Replace red-black trees by carrays")
>>>>>>>>> Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
>>>>>>>>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>>>>> ---
>>>>>>>>> V3->V4: xa_lock_irq locks are used.
>>>>>>>>> V2->V3: __rxe_add_to_pool is between spin_lock and spin_unlock, so
>>>>>>>>>             GFP_ATOMIC is used in __rxe_add_to_pool.
>>>>>>>>> V1->V2: Replace GFP_KERNEL with GFP_ATOMIC
>>>>>>>>> ---
>>>>>>>>>      drivers/infiniband/sw/rxe/rxe_pool.c | 20 ++++++++++++++------
>>>>>>>>>      1 file changed, 14 insertions(+), 6 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>>>> index 87066d04ed18..f1f06dc7e64f 100644
>>>>>>>>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>>>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>>>> @@ -106,7 +106,7 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
>>>>>>>>>            atomic_set(&pool->num_elem, 0);
>>>>>>>>>      -    xa_init_flags(&pool->xa, XA_FLAGS_ALLOC);
>>>>>>>>> +    xa_init_flags(&pool->xa, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
>>>>>>>>>          pool->limit.min = info->min_index;
>>>>>>>>>          pool->limit.max = info->max_index;
>>>>>>>>>      }
>>>>>>>>> @@ -138,8 +138,10 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>>>>>>>          elem->obj = obj;
>>>>>>>>>          kref_init(&elem->ref_cnt);
>>>>>>>>>      -    err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>>> -                  &pool->next, GFP_KERNEL);
>>>>>>>>> +    xa_lock_irq(&pool->xa);
>>>>>>>>> +    err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>>> +                &pool->next, GFP_KERNEL);
>>>>>>>>> +    xa_unlock_irq(&pool->xa);
>>>>>>>>>          if (err)
>>>>>>>>>              goto err_free;
>>>>>>>>>      @@ -155,6 +157,7 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>>>>>>>      int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>>>>>>>      {
>>>>>>>>>          int err;
>>>>>>>>> +    unsigned long flags;
>>>>>>>>>            if (WARN_ON(pool->flags & RXE_POOL_ALLOC))
>>>>>>>>>              return -EINVAL;
>>>>>>>>> @@ -166,8 +169,10 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>>>>>>>          elem->obj = (u8 *)elem - pool->elem_offset;
>>>>>>>>>          kref_init(&elem->ref_cnt);
>>>>>>>>>      -    err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>>> -                  &pool->next, GFP_KERNEL);
>>>>>>>>> +    xa_lock_irqsave(&pool->xa, flags);
>>>>>>>>> +    err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>>> +                &pool->next, GFP_ATOMIC);
>>>>>>>>> +    xa_unlock_irqrestore(&pool->xa, flags);
>>>>>>>>>          if (err)
>>>>>>>>>              goto err_cnt;
>>>>>>>>>      @@ -200,8 +205,11 @@ static void rxe_elem_release(struct kref *kref)
>>>>>>>>>      {
>>>>>>>>>          struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
>>>>>>>>>          struct rxe_pool *pool = elem->pool;
>>>>>>>>> +    unsigned long flags;
>>>>>>>>>      -    xa_erase(&pool->xa, elem->index);
>>>>>>>>> +    xa_lock_irqsave(&pool->xa, flags);
>>>>>>>>> +    __xa_erase(&pool->xa, elem->index);
>>>>>>>>> +    xa_unlock_irqrestore(&pool->xa, flags);
>>>>>>>>>            if (pool->cleanup)
>>>>>>>>>              pool->cleanup(elem);
>>>> Here is my output. Everything passes there are no bugs or unexpected warnings in the kernel trace.
>>>
>>> If I understand you correctly, you mean that the bug reported by Zhang Yi does not exist?
> 
> Not any more on my rxe tree.
>>>
>>> I can reproduce this bug with rping.
> 
> My tree does not cause this bug any more in rping. It used to but it was fixed a few days ago.
> But there remains a fairly rare race condition in rping which I described in a previous note related
> to retry timeouts occuring for RDMA read operations. It is caused by spurious retry timer firing
> and something wrong in the error path code that I am working on trying to isolate.

Hi, Bob

We should be in the same kernel tree.
Based on linux 5.18-rc2, a wr NULL caused by xarry patches.
To this problem, Do you know the root cause? And can you fix it?

If yes, please show us the root cause and fix.

Zhu Yanjun

>>>
>>> You can not reproduce this bug. It does not mean that this bug does not exist.
>>>
>>> And with rping, I also found another wr NULL bug. From the mail, you can also verify this wr NULL bug.
>>>
>>> Let us foucus on this wr NULL bug. OK?
>>>
>>> Zhu Yanjun
>>>
>>>>
>>>> bob@ubuntu-21:~/src/blktests$ sudo ./check -q srp
>>>>
>>>> srp/001 (Create and remove LUNs)                             [passed]
>>>>
>>>>       runtime  3.402s  ...  2.753s
>>>>
>>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]time  34.431s  ...
>>>>
>>>>       runtime  34.431s  ...  34.328s
>>>>
>>>> srp/003 (File I/O on top of multipath concurrently with logout and login (sq)) [not run]
>>>>
>>>>       legacy device mapper support is missing
>>>>
>>>> srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-mq)) [not run]
>>>>
>>>>       legacy device mapper support is missing
>>>>
>>>> srp/005 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]
>>>>
>>>>       runtime  14.332s  ...  12.919s
>>>>
>>>> srp/006 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]
>>>>
>>>>       runtime  13.361s  ...  12.959s
>>>>
>>>> srp/007 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=4M) [passed]
>>>>
>>>>       runtime  14.293s  ...  12.912s
>>>>
>>>> srp/008 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=8M) [passed]
>>>>
>>>>       runtime  13.369s  ...  13.165s
>>>>
>>>> srp/009 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]
>>>>
>>>>       runtime  13.636s  ...  14.201s
>>>>
>>>> srp/010 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]
>>>>
>>>>       runtime  13.361s  ...  12.909s
>>>>
>>>> srp/011 (Block I/O on top of multipath concurrently with logout and login) [passed]
>>>>
>>>>       runtime  33.706s  ...  33.571s
>>>>
>>>> srp/012 (dm-mpath on top of multiple I/O schedulers)         [passed]
>>>>
>>>>       runtime  13.592s  ...  14.138s
>>>>
>>>> srp/013 (Direct I/O using a discontiguous buffer)            [passed]
>>>>
>>>>       runtime  3.230s  ...  3.513s
>>>>
>>>> srp/014 (Run sg_reset while I/O is ongoing)                  [passed]
>>>>
>>>>       runtime  33.070s  ...  33.059s
>>>>
>>>> srp/015 (File I/O on top of multipath concurrently with logout and login (mq) using the SoftiWARP (siw) dsrp/015 (File I/O on top of multipath concurrently with logout and login (mq) using the SoftiWARP (siw) driver) [passed].148s  ...
>>>
>>> you are using SoftiWARP (siw)?
>>
>> not me. it is just the normal behavior of the srp/015 test case. it has always done that. my rdma-core
>> does support siw.
>>
>>>
>>>>
>>>>       runtime  35.148s  ...  34.974s
>>>>
>>>> bob@ubuntu-21:~/src/blktests$
>>>>
>>>> Bob
>>
>