* Re: [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem
2022-04-15 19:56 [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem yanjun.zhu
@ 2022-04-15 5:37 ` Bob Pearson
2022-04-15 5:54 ` Yanjun Zhu
` (2 more replies)
2022-04-15 19:56 ` [PATCH 2/2] RDMA/rxe: Use different xa locks on different path yanjun.zhu
1 sibling, 3 replies; 17+ messages in thread
From: Bob Pearson @ 2022-04-15 5:37 UTC (permalink / raw)
To: yanjun.zhu, jgg, leon, linux-rdma
On 4/15/22 14:56, yanjun.zhu@linux.dev wrote:
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>
> This is a dead lock problem.
> The xa_lock first is acquired in this:
>
> {SOFTIRQ-ON-W} state was registered at:
>
> lock_acquire+0x1d2/0x5a0
> _raw_spin_lock+0x33/0x80
> __rxe_add_to_pool+0x183/0x230 [rdma_rxe]
> __ib_alloc_pd+0xf9/0x550 [ib_core]
> ib_mad_init_device+0x2d9/0xd20 [ib_core]
> add_client_context+0x2fa/0x450 [ib_core]
> enable_device_and_get+0x1b7/0x350 [ib_core]
> ib_register_device+0x757/0xaf0 [ib_core]
> rxe_register_device+0x2eb/0x390 [rdma_rxe]
> rxe_net_add+0x83/0xc0 [rdma_rxe]
> rxe_newlink+0x76/0x90 [rdma_rxe]
> nldev_newlink+0x245/0x3e0 [ib_core]
> rdma_nl_rcv_msg+0x2d4/0x790 [ib_core]
> rdma_nl_rcv+0x1ca/0x3f0 [ib_core]
> netlink_unicast+0x43b/0x640
> netlink_sendmsg+0x7eb/0xc40
> sock_sendmsg+0xe0/0x110
> __sys_sendto+0x1d7/0x2b0
> __x64_sys_sendto+0xdd/0x1b0
> do_syscall_64+0x37/0x80
> entry_SYSCALL_64_after_hwframe+0x44/0xae
There is a separate xarray for each object pool. So this one is
rxe->pd_pool.xa.xa_lock from rxe_alloc_pd().
>
> Then xa_lock is acquired in this:
>
> {IN-SOFTIRQ-W}:
>
> Call Trace:
> <TASK>
> dump_stack_lvl+0x44/0x57
> mark_lock.part.52.cold.79+0x3c/0x46
> __lock_acquire+0x1565/0x34a0
> lock_acquire+0x1d2/0x5a0
> _raw_spin_lock_irqsave+0x42/0x90
> rxe_pool_get_index+0x72/0x1d0 [rdma_rxe]
> rxe_get_av+0x168/0x2a0 [rdma_rxe]
> rxe_requester+0x75b/0x4a90 [rdma_rxe]
> rxe_do_task+0x134/0x230 [rdma_rxe]
> tasklet_action_common.isra.12+0x1f7/0x2d0
> __do_softirq+0x1ea/0xa4c
> run_ksoftirqd+0x32/0x60
> smpboot_thread_fn+0x503/0x860
> kthread+0x29b/0x340
> ret_from_fork+0x1f/0x30
And this one is rxe->ah_pool.xa.xa_lock from rxe_requester
in the process of sending a UD packet from a work request
which contains the index of the ah.
For your story to work there needs to be an another ah_pool.xa.xa_lock somewhere.
Let's assume it is there somewhere and it's from (a different) add_to_pool call
then the add_to_pool_ routine should disable interrupts when it gets the lock
with spin_lock_xxx. But only for AH objects.
This may be old news.
> </TASK>
>
> From the above, in the function __rxe_add_to_pool,
> xa_lock is acquired. Then the function __rxe_add_to_pool
> is interrupted by softirq. The function
> rxe_pool_get_index will also acquire xa_lock.
>
> Finally, the dead lock appears.
>
> [ 296.806097] CPU0
> [ 296.808550] ----
> [ 296.811003] lock(&xa->xa_lock#15); <----- __rxe_add_to_pool
> [ 296.814583] <Interrupt>
> [ 296.817209] lock(&xa->xa_lock#15); <---- rxe_pool_get_index
> [ 296.820961]
> *** DEADLOCK ***
>
> Fixes: 3225717f6dfa ("RDMA/rxe: Replace red-black trees by carrays")
> Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
> ---
> V3->V4: xa_lock_irq locks are used.
> V2->V3: __rxe_add_to_pool is between spin_lock and spin_unlock, so
> GFP_ATOMIC is used in __rxe_add_to_pool.
> V1->V2: Replace GFP_KERNEL with GFP_ATOMIC
> ---
> drivers/infiniband/sw/rxe/rxe_pool.c | 20 ++++++++++++++------
> 1 file changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
> index 87066d04ed18..f1f06dc7e64f 100644
> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
> @@ -106,7 +106,7 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
>
> atomic_set(&pool->num_elem, 0);
>
> - xa_init_flags(&pool->xa, XA_FLAGS_ALLOC);
> + xa_init_flags(&pool->xa, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
> pool->limit.min = info->min_index;
> pool->limit.max = info->max_index;
> }
> @@ -138,8 +138,10 @@ void *rxe_alloc(struct rxe_pool *pool)
> elem->obj = obj;
> kref_init(&elem->ref_cnt);
>
> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
> - &pool->next, GFP_KERNEL);
> + xa_lock_irq(&pool->xa);
> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
> + &pool->next, GFP_KERNEL);
> + xa_unlock_irq(&pool->xa);
> if (err)
> goto err_free;
>
> @@ -155,6 +157,7 @@ void *rxe_alloc(struct rxe_pool *pool)
> int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
> {
> int err;
> + unsigned long flags;
>
> if (WARN_ON(pool->flags & RXE_POOL_ALLOC))
> return -EINVAL;
> @@ -166,8 +169,10 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
> elem->obj = (u8 *)elem - pool->elem_offset;
> kref_init(&elem->ref_cnt);
>
> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
> - &pool->next, GFP_KERNEL);
> + xa_lock_irqsave(&pool->xa, flags);
> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
> + &pool->next, GFP_ATOMIC);
> + xa_unlock_irqrestore(&pool->xa, flags);
> if (err)
> goto err_cnt;
>
> @@ -200,8 +205,11 @@ static void rxe_elem_release(struct kref *kref)
> {
> struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
> struct rxe_pool *pool = elem->pool;
> + unsigned long flags;
>
> - xa_erase(&pool->xa, elem->index);
> + xa_lock_irqsave(&pool->xa, flags);
> + __xa_erase(&pool->xa, elem->index);
> + xa_unlock_irqrestore(&pool->xa, flags);
>
> if (pool->cleanup)
> pool->cleanup(elem);
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem
2022-04-15 5:37 ` Bob Pearson
@ 2022-04-15 5:54 ` Yanjun Zhu
2022-04-15 6:35 ` Bob Pearson
2022-04-15 10:25 ` Yanjun Zhu
2022-04-15 15:19 ` Yanjun Zhu
2 siblings, 1 reply; 17+ messages in thread
From: Yanjun Zhu @ 2022-04-15 5:54 UTC (permalink / raw)
To: Bob Pearson, jgg, leon, linux-rdma
在 2022/4/15 13:37, Bob Pearson 写道:
> On 4/15/22 14:56, yanjun.zhu@linux.dev wrote:
>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>
>> This is a dead lock problem.
>> The xa_lock first is acquired in this:
>>
>> {SOFTIRQ-ON-W} state was registered at:
>>
>> lock_acquire+0x1d2/0x5a0
>> _raw_spin_lock+0x33/0x80
>> __rxe_add_to_pool+0x183/0x230 [rdma_rxe]
>> __ib_alloc_pd+0xf9/0x550 [ib_core]
>> ib_mad_init_device+0x2d9/0xd20 [ib_core]
>> add_client_context+0x2fa/0x450 [ib_core]
>> enable_device_and_get+0x1b7/0x350 [ib_core]
>> ib_register_device+0x757/0xaf0 [ib_core]
>> rxe_register_device+0x2eb/0x390 [rdma_rxe]
>> rxe_net_add+0x83/0xc0 [rdma_rxe]
>> rxe_newlink+0x76/0x90 [rdma_rxe]
>> nldev_newlink+0x245/0x3e0 [ib_core]
>> rdma_nl_rcv_msg+0x2d4/0x790 [ib_core]
>> rdma_nl_rcv+0x1ca/0x3f0 [ib_core]
>> netlink_unicast+0x43b/0x640
>> netlink_sendmsg+0x7eb/0xc40
>> sock_sendmsg+0xe0/0x110
>> __sys_sendto+0x1d7/0x2b0
>> __x64_sys_sendto+0xdd/0x1b0
>> do_syscall_64+0x37/0x80
>> entry_SYSCALL_64_after_hwframe+0x44/0xae
> There is a separate xarray for each object pool. So this one is
> rxe->pd_pool.xa.xa_lock from rxe_alloc_pd().
>
>> Then xa_lock is acquired in this:
>>
>> {IN-SOFTIRQ-W}:
>>
>> Call Trace:
>> <TASK>
>> dump_stack_lvl+0x44/0x57
>> mark_lock.part.52.cold.79+0x3c/0x46
>> __lock_acquire+0x1565/0x34a0
>> lock_acquire+0x1d2/0x5a0
>> _raw_spin_lock_irqsave+0x42/0x90
>> rxe_pool_get_index+0x72/0x1d0 [rdma_rxe]
>> rxe_get_av+0x168/0x2a0 [rdma_rxe]
>> rxe_requester+0x75b/0x4a90 [rdma_rxe]
>> rxe_do_task+0x134/0x230 [rdma_rxe]
>> tasklet_action_common.isra.12+0x1f7/0x2d0
>> __do_softirq+0x1ea/0xa4c
>> run_ksoftirqd+0x32/0x60
>> smpboot_thread_fn+0x503/0x860
>> kthread+0x29b/0x340
>> ret_from_fork+0x1f/0x30
> And this one is rxe->ah_pool.xa.xa_lock from rxe_requester
> in the process of sending a UD packet from a work request
> which contains the index of the ah.
>
> For your story to work there needs to be an another ah_pool.xa.xa_lock somewhere.
> Let's assume it is there somewhere and it's from (a different) add_to_pool call
> then the add_to_pool_ routine should disable interrupts when it gets the lock
> with spin_lock_xxx. But only for AH objects.
>
> This may be old news.
What do you mean? Please check the call trace in the bug.
Zhu Yanjun
>
>> </TASK>
>>
>> From the above, in the function __rxe_add_to_pool,
>> xa_lock is acquired. Then the function __rxe_add_to_pool
>> is interrupted by softirq. The function
>> rxe_pool_get_index will also acquire xa_lock.
>>
>> Finally, the dead lock appears.
>>
>> [ 296.806097] CPU0
>> [ 296.808550] ----
>> [ 296.811003] lock(&xa->xa_lock#15); <----- __rxe_add_to_pool
>> [ 296.814583] <Interrupt>
>> [ 296.817209] lock(&xa->xa_lock#15); <---- rxe_pool_get_index
>> [ 296.820961]
>> *** DEADLOCK ***
>>
>> Fixes: 3225717f6dfa ("RDMA/rxe: Replace red-black trees by carrays")
>> Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>> ---
>> V3->V4: xa_lock_irq locks are used.
>> V2->V3: __rxe_add_to_pool is between spin_lock and spin_unlock, so
>> GFP_ATOMIC is used in __rxe_add_to_pool.
>> V1->V2: Replace GFP_KERNEL with GFP_ATOMIC
>> ---
>> drivers/infiniband/sw/rxe/rxe_pool.c | 20 ++++++++++++++------
>> 1 file changed, 14 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
>> index 87066d04ed18..f1f06dc7e64f 100644
>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
>> @@ -106,7 +106,7 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
>>
>> atomic_set(&pool->num_elem, 0);
>>
>> - xa_init_flags(&pool->xa, XA_FLAGS_ALLOC);
>> + xa_init_flags(&pool->xa, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
>> pool->limit.min = info->min_index;
>> pool->limit.max = info->max_index;
>> }
>> @@ -138,8 +138,10 @@ void *rxe_alloc(struct rxe_pool *pool)
>> elem->obj = obj;
>> kref_init(&elem->ref_cnt);
>>
>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>> - &pool->next, GFP_KERNEL);
>> + xa_lock_irq(&pool->xa);
>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>> + &pool->next, GFP_KERNEL);
>> + xa_unlock_irq(&pool->xa);
>> if (err)
>> goto err_free;
>>
>> @@ -155,6 +157,7 @@ void *rxe_alloc(struct rxe_pool *pool)
>> int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>> {
>> int err;
>> + unsigned long flags;
>>
>> if (WARN_ON(pool->flags & RXE_POOL_ALLOC))
>> return -EINVAL;
>> @@ -166,8 +169,10 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>> elem->obj = (u8 *)elem - pool->elem_offset;
>> kref_init(&elem->ref_cnt);
>>
>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>> - &pool->next, GFP_KERNEL);
>> + xa_lock_irqsave(&pool->xa, flags);
>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>> + &pool->next, GFP_ATOMIC);
>> + xa_unlock_irqrestore(&pool->xa, flags);
>> if (err)
>> goto err_cnt;
>>
>> @@ -200,8 +205,11 @@ static void rxe_elem_release(struct kref *kref)
>> {
>> struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
>> struct rxe_pool *pool = elem->pool;
>> + unsigned long flags;
>>
>> - xa_erase(&pool->xa, elem->index);
>> + xa_lock_irqsave(&pool->xa, flags);
>> + __xa_erase(&pool->xa, elem->index);
>> + xa_unlock_irqrestore(&pool->xa, flags);
>>
>> if (pool->cleanup)
>> pool->cleanup(elem);
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem
2022-04-15 5:54 ` Yanjun Zhu
@ 2022-04-15 6:35 ` Bob Pearson
2022-04-15 6:49 ` Yanjun Zhu
2022-04-20 16:41 ` Jason Gunthorpe
0 siblings, 2 replies; 17+ messages in thread
From: Bob Pearson @ 2022-04-15 6:35 UTC (permalink / raw)
To: Yanjun Zhu, jgg, leon, linux-rdma
On 4/15/22 00:54, Yanjun Zhu wrote:
>
> 在 2022/4/15 13:37, Bob Pearson 写道:
>> On 4/15/22 14:56, yanjun.zhu@linux.dev wrote:
>>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>
>>> This is a dead lock problem.
>>> The xa_lock first is acquired in this:
>>>
>>> {SOFTIRQ-ON-W} state was registered at:
>>>
>>> lock_acquire+0x1d2/0x5a0
>>> _raw_spin_lock+0x33/0x80
>>> __rxe_add_to_pool+0x183/0x230 [rdma_rxe]
>>> __ib_alloc_pd+0xf9/0x550 [ib_core]
>>> ib_mad_init_device+0x2d9/0xd20 [ib_core]
>>> add_client_context+0x2fa/0x450 [ib_core]
>>> enable_device_and_get+0x1b7/0x350 [ib_core]
>>> ib_register_device+0x757/0xaf0 [ib_core]
>>> rxe_register_device+0x2eb/0x390 [rdma_rxe]
>>> rxe_net_add+0x83/0xc0 [rdma_rxe]
>>> rxe_newlink+0x76/0x90 [rdma_rxe]
>>> nldev_newlink+0x245/0x3e0 [ib_core]
>>> rdma_nl_rcv_msg+0x2d4/0x790 [ib_core]
>>> rdma_nl_rcv+0x1ca/0x3f0 [ib_core]
>>> netlink_unicast+0x43b/0x640
>>> netlink_sendmsg+0x7eb/0xc40
>>> sock_sendmsg+0xe0/0x110
>>> __sys_sendto+0x1d7/0x2b0
>>> __x64_sys_sendto+0xdd/0x1b0
>>> do_syscall_64+0x37/0x80
>>> entry_SYSCALL_64_after_hwframe+0x44/0xae
>> There is a separate xarray for each object pool. So this one is
>> rxe->pd_pool.xa.xa_lock from rxe_alloc_pd().
>>
>>> Then xa_lock is acquired in this:
>>>
>>> {IN-SOFTIRQ-W}:
>>>
>>> Call Trace:
>>> <TASK>
>>> dump_stack_lvl+0x44/0x57
>>> mark_lock.part.52.cold.79+0x3c/0x46
>>> __lock_acquire+0x1565/0x34a0
>>> lock_acquire+0x1d2/0x5a0
>>> _raw_spin_lock_irqsave+0x42/0x90
>>> rxe_pool_get_index+0x72/0x1d0 [rdma_rxe]
>>> rxe_get_av+0x168/0x2a0 [rdma_rxe]
>>> rxe_requester+0x75b/0x4a90 [rdma_rxe]
>>> rxe_do_task+0x134/0x230 [rdma_rxe]
>>> tasklet_action_common.isra.12+0x1f7/0x2d0
>>> __do_softirq+0x1ea/0xa4c
>>> run_ksoftirqd+0x32/0x60
>>> smpboot_thread_fn+0x503/0x860
>>> kthread+0x29b/0x340
>>> ret_from_fork+0x1f/0x30
>> And this one is rxe->ah_pool.xa.xa_lock from rxe_requester
>> in the process of sending a UD packet from a work request
>> which contains the index of the ah.
>>
>> For your story to work there needs to be an another ah_pool.xa.xa_lock somewhere.
>> Let's assume it is there somewhere and it's from (a different) add_to_pool call
>> then the add_to_pool_ routine should disable interrupts when it gets the lock
>> with spin_lock_xxx. But only for AH objects.
>>
>> This may be old news.
>
> What do you mean? Please check the call trace in the bug.
I mean the trace you show here shows an instance of xa_lock being
acquired from the pd pool followed by an instance of xa_lock being
acquired from rxe_pool_get_index from the ah pool. They are different
locks. They can't deadlock against each other. So there must be
some other trace (not shown) that also gets xa_lock from the ah pool.
>
> Zhu Yanjun
>
>>
>>> </TASK>
>>>
>>> From the above, in the function __rxe_add_to_pool,
>>> xa_lock is acquired. Then the function __rxe_add_to_pool
>>> is interrupted by softirq. The function
>>> rxe_pool_get_index will also acquire xa_lock.
>>>
>>> Finally, the dead lock appears.
>>>
>>> [ 296.806097] CPU0
>>> [ 296.808550] ----
>>> [ 296.811003] lock(&xa->xa_lock#15); <----- __rxe_add_to_pool
>>> [ 296.814583] <Interrupt>
>>> [ 296.817209] lock(&xa->xa_lock#15); <---- rxe_pool_get_index
>>> [ 296.820961]
>>> *** DEADLOCK ***
>>>
>>> Fixes: 3225717f6dfa ("RDMA/rxe: Replace red-black trees by carrays")
>>> Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
>>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>>> ---
>>> V3->V4: xa_lock_irq locks are used.
>>> V2->V3: __rxe_add_to_pool is between spin_lock and spin_unlock, so
>>> GFP_ATOMIC is used in __rxe_add_to_pool.
>>> V1->V2: Replace GFP_KERNEL with GFP_ATOMIC
>>> ---
>>> drivers/infiniband/sw/rxe/rxe_pool.c | 20 ++++++++++++++------
>>> 1 file changed, 14 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
>>> index 87066d04ed18..f1f06dc7e64f 100644
>>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
>>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
>>> @@ -106,7 +106,7 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
>>> atomic_set(&pool->num_elem, 0);
>>> - xa_init_flags(&pool->xa, XA_FLAGS_ALLOC);
>>> + xa_init_flags(&pool->xa, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
>>> pool->limit.min = info->min_index;
>>> pool->limit.max = info->max_index;
>>> }
>>> @@ -138,8 +138,10 @@ void *rxe_alloc(struct rxe_pool *pool)
>>> elem->obj = obj;
>>> kref_init(&elem->ref_cnt);
>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>> - &pool->next, GFP_KERNEL);
>>> + xa_lock_irq(&pool->xa);
>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>> + &pool->next, GFP_KERNEL);
>>> + xa_unlock_irq(&pool->xa);
>>> if (err)
>>> goto err_free;
>>> @@ -155,6 +157,7 @@ void *rxe_alloc(struct rxe_pool *pool)
>>> int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>> {
>>> int err;
>>> + unsigned long flags;
>>> if (WARN_ON(pool->flags & RXE_POOL_ALLOC))
>>> return -EINVAL;
>>> @@ -166,8 +169,10 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>> elem->obj = (u8 *)elem - pool->elem_offset;
>>> kref_init(&elem->ref_cnt);
>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>> - &pool->next, GFP_KERNEL);
>>> + xa_lock_irqsave(&pool->xa, flags);
>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>> + &pool->next, GFP_ATOMIC);
>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>> if (err)
>>> goto err_cnt;
>>> @@ -200,8 +205,11 @@ static void rxe_elem_release(struct kref *kref)
>>> {
>>> struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
>>> struct rxe_pool *pool = elem->pool;
>>> + unsigned long flags;
>>> - xa_erase(&pool->xa, elem->index);
>>> + xa_lock_irqsave(&pool->xa, flags);
>>> + __xa_erase(&pool->xa, elem->index);
>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>> if (pool->cleanup)
>>> pool->cleanup(elem);
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem
2022-04-15 6:35 ` Bob Pearson
@ 2022-04-15 6:49 ` Yanjun Zhu
2022-04-15 7:22 ` Bob Pearson
2022-04-20 16:41 ` Jason Gunthorpe
1 sibling, 1 reply; 17+ messages in thread
From: Yanjun Zhu @ 2022-04-15 6:49 UTC (permalink / raw)
To: Bob Pearson, Yanjun Zhu, jgg, leon, linux-rdma
在 2022/4/15 14:35, Bob Pearson 写道:
> On 4/15/22 00:54, Yanjun Zhu wrote:
>>
>> 在 2022/4/15 13:37, Bob Pearson 写道:
>>> On 4/15/22 14:56, yanjun.zhu@linux.dev wrote:
>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>
>>>> This is a dead lock problem.
>>>> The xa_lock first is acquired in this:
>>>>
>>>> {SOFTIRQ-ON-W} state was registered at:
>>>>
>>>> lock_acquire+0x1d2/0x5a0
>>>> _raw_spin_lock+0x33/0x80
>>>> __rxe_add_to_pool+0x183/0x230 [rdma_rxe]
>>>> __ib_alloc_pd+0xf9/0x550 [ib_core]
>>>> ib_mad_init_device+0x2d9/0xd20 [ib_core]
>>>> add_client_context+0x2fa/0x450 [ib_core]
>>>> enable_device_and_get+0x1b7/0x350 [ib_core]
>>>> ib_register_device+0x757/0xaf0 [ib_core]
>>>> rxe_register_device+0x2eb/0x390 [rdma_rxe]
>>>> rxe_net_add+0x83/0xc0 [rdma_rxe]
>>>> rxe_newlink+0x76/0x90 [rdma_rxe]
>>>> nldev_newlink+0x245/0x3e0 [ib_core]
>>>> rdma_nl_rcv_msg+0x2d4/0x790 [ib_core]
>>>> rdma_nl_rcv+0x1ca/0x3f0 [ib_core]
>>>> netlink_unicast+0x43b/0x640
>>>> netlink_sendmsg+0x7eb/0xc40
>>>> sock_sendmsg+0xe0/0x110
>>>> __sys_sendto+0x1d7/0x2b0
>>>> __x64_sys_sendto+0xdd/0x1b0
>>>> do_syscall_64+0x37/0x80
>>>> entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> There is a separate xarray for each object pool. So this one is
>>> rxe->pd_pool.xa.xa_lock from rxe_alloc_pd().
>>>
>>>> Then xa_lock is acquired in this:
>>>>
>>>> {IN-SOFTIRQ-W}:
>>>>
>>>> Call Trace:
>>>> <TASK>
>>>> dump_stack_lvl+0x44/0x57
>>>> mark_lock.part.52.cold.79+0x3c/0x46
>>>> __lock_acquire+0x1565/0x34a0
>>>> lock_acquire+0x1d2/0x5a0
>>>> _raw_spin_lock_irqsave+0x42/0x90
>>>> rxe_pool_get_index+0x72/0x1d0 [rdma_rxe]
>>>> rxe_get_av+0x168/0x2a0 [rdma_rxe]
>>>> rxe_requester+0x75b/0x4a90 [rdma_rxe]
>>>> rxe_do_task+0x134/0x230 [rdma_rxe]
>>>> tasklet_action_common.isra.12+0x1f7/0x2d0
>>>> __do_softirq+0x1ea/0xa4c
>>>> run_ksoftirqd+0x32/0x60
>>>> smpboot_thread_fn+0x503/0x860
>>>> kthread+0x29b/0x340
>>>> ret_from_fork+0x1f/0x30
>>> And this one is rxe->ah_pool.xa.xa_lock from rxe_requester
>>> in the process of sending a UD packet from a work request
>>> which contains the index of the ah.
>>>
>>> For your story to work there needs to be an another ah_pool.xa.xa_lock somewhere.
>>> Let's assume it is there somewhere and it's from (a different) add_to_pool call
>>> then the add_to_pool_ routine should disable interrupts when it gets the lock
>>> with spin_lock_xxx. But only for AH objects.
>>>
>>> This may be old news.
>>
>> What do you mean? Please check the call trace in the bug.
>
> I mean the trace you show here shows an instance of xa_lock being
> acquired from the pd pool followed by an instance of xa_lock being
> acquired from rxe_pool_get_index from the ah pool. They are different
> locks. They can't deadlock against each other. So there must be
> some other trace (not shown) that also gets xa_lock from the ah pool.
Please check the bug report mail. The link is
news://nntp.lore.kernel.org:119/CAHj4cs-MT13RiEsWXUAcX_H5jEtjsebuZgSeUcfptNVuELgjYQ@mail.gmail.com
BTW, what is the update about wr crash caused by your xarray patches?
Zhu Yanjun
>
>>
>> Zhu Yanjun
>>
>>>
>>>> </TASK>
>>>>
>>>> From the above, in the function __rxe_add_to_pool,
>>>> xa_lock is acquired. Then the function __rxe_add_to_pool
>>>> is interrupted by softirq. The function
>>>> rxe_pool_get_index will also acquire xa_lock.
>>>>
>>>> Finally, the dead lock appears.
>>>>
>>>> [ 296.806097] CPU0
>>>> [ 296.808550] ----
>>>> [ 296.811003] lock(&xa->xa_lock#15); <----- __rxe_add_to_pool
>>>> [ 296.814583] <Interrupt>
>>>> [ 296.817209] lock(&xa->xa_lock#15); <---- rxe_pool_get_index
>>>> [ 296.820961]
>>>> *** DEADLOCK ***
>>>>
>>>> Fixes: 3225717f6dfa ("RDMA/rxe: Replace red-black trees by carrays")
>>>> Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
>>>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>> ---
>>>> V3->V4: xa_lock_irq locks are used.
>>>> V2->V3: __rxe_add_to_pool is between spin_lock and spin_unlock, so
>>>> GFP_ATOMIC is used in __rxe_add_to_pool.
>>>> V1->V2: Replace GFP_KERNEL with GFP_ATOMIC
>>>> ---
>>>> drivers/infiniband/sw/rxe/rxe_pool.c | 20 ++++++++++++++------
>>>> 1 file changed, 14 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>> index 87066d04ed18..f1f06dc7e64f 100644
>>>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
>>>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>> @@ -106,7 +106,7 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
>>>> atomic_set(&pool->num_elem, 0);
>>>> - xa_init_flags(&pool->xa, XA_FLAGS_ALLOC);
>>>> + xa_init_flags(&pool->xa, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
>>>> pool->limit.min = info->min_index;
>>>> pool->limit.max = info->max_index;
>>>> }
>>>> @@ -138,8 +138,10 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>> elem->obj = obj;
>>>> kref_init(&elem->ref_cnt);
>>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>> - &pool->next, GFP_KERNEL);
>>>> + xa_lock_irq(&pool->xa);
>>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>> + &pool->next, GFP_KERNEL);
>>>> + xa_unlock_irq(&pool->xa);
>>>> if (err)
>>>> goto err_free;
>>>> @@ -155,6 +157,7 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>> int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>> {
>>>> int err;
>>>> + unsigned long flags;
>>>> if (WARN_ON(pool->flags & RXE_POOL_ALLOC))
>>>> return -EINVAL;
>>>> @@ -166,8 +169,10 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>> elem->obj = (u8 *)elem - pool->elem_offset;
>>>> kref_init(&elem->ref_cnt);
>>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>> - &pool->next, GFP_KERNEL);
>>>> + xa_lock_irqsave(&pool->xa, flags);
>>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>> + &pool->next, GFP_ATOMIC);
>>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>>> if (err)
>>>> goto err_cnt;
>>>> @@ -200,8 +205,11 @@ static void rxe_elem_release(struct kref *kref)
>>>> {
>>>> struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
>>>> struct rxe_pool *pool = elem->pool;
>>>> + unsigned long flags;
>>>> - xa_erase(&pool->xa, elem->index);
>>>> + xa_lock_irqsave(&pool->xa, flags);
>>>> + __xa_erase(&pool->xa, elem->index);
>>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>>> if (pool->cleanup)
>>>> pool->cleanup(elem);
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem
2022-04-15 6:49 ` Yanjun Zhu
@ 2022-04-15 7:22 ` Bob Pearson
2022-04-15 7:32 ` Yanjun Zhu
0 siblings, 1 reply; 17+ messages in thread
From: Bob Pearson @ 2022-04-15 7:22 UTC (permalink / raw)
To: Yanjun Zhu, jgg, leon, linux-rdma
On 4/15/22 01:49, Yanjun Zhu wrote:
> 在 2022/4/15 14:35, Bob Pearson 写道:
>> On 4/15/22 00:54, Yanjun Zhu wrote:
>>>
>>> 在 2022/4/15 13:37, Bob Pearson 写道:
>>>> On 4/15/22 14:56, yanjun.zhu@linux.dev wrote:
>>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>
>>>>> This is a dead lock problem.
>>>>> The xa_lock first is acquired in this:
>>>>>
>>>>> {SOFTIRQ-ON-W} state was registered at:
>>>>>
>>>>> lock_acquire+0x1d2/0x5a0
>>>>> _raw_spin_lock+0x33/0x80
>>>>> __rxe_add_to_pool+0x183/0x230 [rdma_rxe]
>>>>> __ib_alloc_pd+0xf9/0x550 [ib_core]
>>>>> ib_mad_init_device+0x2d9/0xd20 [ib_core]
>>>>> add_client_context+0x2fa/0x450 [ib_core]
>>>>> enable_device_and_get+0x1b7/0x350 [ib_core]
>>>>> ib_register_device+0x757/0xaf0 [ib_core]
>>>>> rxe_register_device+0x2eb/0x390 [rdma_rxe]
>>>>> rxe_net_add+0x83/0xc0 [rdma_rxe]
>>>>> rxe_newlink+0x76/0x90 [rdma_rxe]
>>>>> nldev_newlink+0x245/0x3e0 [ib_core]
>>>>> rdma_nl_rcv_msg+0x2d4/0x790 [ib_core]
>>>>> rdma_nl_rcv+0x1ca/0x3f0 [ib_core]
>>>>> netlink_unicast+0x43b/0x640
>>>>> netlink_sendmsg+0x7eb/0xc40
>>>>> sock_sendmsg+0xe0/0x110
>>>>> __sys_sendto+0x1d7/0x2b0
>>>>> __x64_sys_sendto+0xdd/0x1b0
>>>>> do_syscall_64+0x37/0x80
>>>>> entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>> There is a separate xarray for each object pool. So this one is
>>>> rxe->pd_pool.xa.xa_lock from rxe_alloc_pd().
>>>>
>>>>> Then xa_lock is acquired in this:
>>>>>
>>>>> {IN-SOFTIRQ-W}:
>>>>>
>>>>> Call Trace:
>>>>> <TASK>
>>>>> dump_stack_lvl+0x44/0x57
>>>>> mark_lock.part.52.cold.79+0x3c/0x46
>>>>> __lock_acquire+0x1565/0x34a0
>>>>> lock_acquire+0x1d2/0x5a0
>>>>> _raw_spin_lock_irqsave+0x42/0x90
>>>>> rxe_pool_get_index+0x72/0x1d0 [rdma_rxe]
>>>>> rxe_get_av+0x168/0x2a0 [rdma_rxe]
>>>>> rxe_requester+0x75b/0x4a90 [rdma_rxe]
>>>>> rxe_do_task+0x134/0x230 [rdma_rxe]
>>>>> tasklet_action_common.isra.12+0x1f7/0x2d0
>>>>> __do_softirq+0x1ea/0xa4c
>>>>> run_ksoftirqd+0x32/0x60
>>>>> smpboot_thread_fn+0x503/0x860
>>>>> kthread+0x29b/0x340
>>>>> ret_from_fork+0x1f/0x30
>>>> And this one is rxe->ah_pool.xa.xa_lock from rxe_requester
>>>> in the process of sending a UD packet from a work request
>>>> which contains the index of the ah.
>>>>
>>>> For your story to work there needs to be an another ah_pool.xa.xa_lock somewhere.
>>>> Let's assume it is there somewhere and it's from (a different) add_to_pool call
>>>> then the add_to_pool_ routine should disable interrupts when it gets the lock
>>>> with spin_lock_xxx. But only for AH objects.
>>>>
>>>> This may be old news.
>>>
>>> What do you mean? Please check the call trace in the bug.
>>
>> I mean the trace you show here shows an instance of xa_lock being
>> acquired from the pd pool followed by an instance of xa_lock being
>> acquired from rxe_pool_get_index from the ah pool. They are different
>> locks. They can't deadlock against each other. So there must be
>> some other trace (not shown) that also gets xa_lock from the ah pool.
>
> Please check the bug report mail. The link is news://nntp.lore.kernel.org:119/CAHj4cs-MT13RiEsWXUAcX_H5jEtjsebuZgSeUcfptNVuELgjYQ@mail.gmail.com
>
> BTW, what is the update about wr crash caused by your xarray patches?
>
> Zhu Yanjun
>
>>
>>>
>>> Zhu Yanjun
>>>
>>>>
>>>>> </TASK>
>>>>>
>>>>> From the above, in the function __rxe_add_to_pool,
>>>>> xa_lock is acquired. Then the function __rxe_add_to_pool
>>>>> is interrupted by softirq. The function
>>>>> rxe_pool_get_index will also acquire xa_lock.
>>>>>
>>>>> Finally, the dead lock appears.
>>>>>
>>>>> [ 296.806097] CPU0
>>>>> [ 296.808550] ----
>>>>> [ 296.811003] lock(&xa->xa_lock#15); <----- __rxe_add_to_pool
>>>>> [ 296.814583] <Interrupt>
>>>>> [ 296.817209] lock(&xa->xa_lock#15); <---- rxe_pool_get_index
>>>>> [ 296.820961]
>>>>> *** DEADLOCK ***
>>>>>
>>>>> Fixes: 3225717f6dfa ("RDMA/rxe: Replace red-black trees by carrays")
>>>>> Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
>>>>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>> ---
>>>>> V3->V4: xa_lock_irq locks are used.
>>>>> V2->V3: __rxe_add_to_pool is between spin_lock and spin_unlock, so
>>>>> GFP_ATOMIC is used in __rxe_add_to_pool.
>>>>> V1->V2: Replace GFP_KERNEL with GFP_ATOMIC
>>>>> ---
>>>>> drivers/infiniband/sw/rxe/rxe_pool.c | 20 ++++++++++++++------
>>>>> 1 file changed, 14 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>> index 87066d04ed18..f1f06dc7e64f 100644
>>>>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>> @@ -106,7 +106,7 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
>>>>> atomic_set(&pool->num_elem, 0);
>>>>> - xa_init_flags(&pool->xa, XA_FLAGS_ALLOC);
>>>>> + xa_init_flags(&pool->xa, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
>>>>> pool->limit.min = info->min_index;
>>>>> pool->limit.max = info->max_index;
>>>>> }
>>>>> @@ -138,8 +138,10 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>>> elem->obj = obj;
>>>>> kref_init(&elem->ref_cnt);
>>>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>> - &pool->next, GFP_KERNEL);
>>>>> + xa_lock_irq(&pool->xa);
>>>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>> + &pool->next, GFP_KERNEL);
>>>>> + xa_unlock_irq(&pool->xa);
>>>>> if (err)
>>>>> goto err_free;
>>>>> @@ -155,6 +157,7 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>>> int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>>> {
>>>>> int err;
>>>>> + unsigned long flags;
>>>>> if (WARN_ON(pool->flags & RXE_POOL_ALLOC))
>>>>> return -EINVAL;
>>>>> @@ -166,8 +169,10 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>>> elem->obj = (u8 *)elem - pool->elem_offset;
>>>>> kref_init(&elem->ref_cnt);
>>>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>> - &pool->next, GFP_KERNEL);
>>>>> + xa_lock_irqsave(&pool->xa, flags);
>>>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>> + &pool->next, GFP_ATOMIC);
>>>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>>>> if (err)
>>>>> goto err_cnt;
>>>>> @@ -200,8 +205,11 @@ static void rxe_elem_release(struct kref *kref)
>>>>> {
>>>>> struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
>>>>> struct rxe_pool *pool = elem->pool;
>>>>> + unsigned long flags;
>>>>> - xa_erase(&pool->xa, elem->index);
>>>>> + xa_lock_irqsave(&pool->xa, flags);
>>>>> + __xa_erase(&pool->xa, elem->index);
>>>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>>>> if (pool->cleanup)
>>>>> pool->cleanup(elem);
>>
>
Here is my output. Everything passes there are no bugs or unexpected warnings in the kernel trace.
bob@ubuntu-21:~/src/blktests$ sudo ./check -q srp
srp/001 (Create and remove LUNs) [passed]
runtime 3.402s ... 2.753s
srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]time 34.431s ...
runtime 34.431s ... 34.328s
srp/003 (File I/O on top of multipath concurrently with logout and login (sq)) [not run]
legacy device mapper support is missing
srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-mq)) [not run]
legacy device mapper support is missing
srp/005 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]
runtime 14.332s ... 12.919s
srp/006 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]
runtime 13.361s ... 12.959s
srp/007 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=4M) [passed]
runtime 14.293s ... 12.912s
srp/008 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=8M) [passed]
runtime 13.369s ... 13.165s
srp/009 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]
runtime 13.636s ... 14.201s
srp/010 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]
runtime 13.361s ... 12.909s
srp/011 (Block I/O on top of multipath concurrently with logout and login) [passed]
runtime 33.706s ... 33.571s
srp/012 (dm-mpath on top of multiple I/O schedulers) [passed]
runtime 13.592s ... 14.138s
srp/013 (Direct I/O using a discontiguous buffer) [passed]
runtime 3.230s ... 3.513s
srp/014 (Run sg_reset while I/O is ongoing) [passed]
runtime 33.070s ... 33.059s
srp/015 (File I/O on top of multipath concurrently with logout and login (mq) using the SoftiWARP (siw) dsrp/015 (File I/O on top of multipath concurrently with logout and login (mq) using the SoftiWARP (siw) driver) [passed].148s ...
runtime 35.148s ... 34.974s
bob@ubuntu-21:~/src/blktests$
Bob
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem
2022-04-15 7:22 ` Bob Pearson
@ 2022-04-15 7:32 ` Yanjun Zhu
2022-04-15 7:35 ` Bob Pearson
0 siblings, 1 reply; 17+ messages in thread
From: Yanjun Zhu @ 2022-04-15 7:32 UTC (permalink / raw)
To: Bob Pearson, jgg, leon, linux-rdma
在 2022/4/15 15:22, Bob Pearson 写道:
> On 4/15/22 01:49, Yanjun Zhu wrote:
>> 在 2022/4/15 14:35, Bob Pearson 写道:
>>> On 4/15/22 00:54, Yanjun Zhu wrote:
>>>> 在 2022/4/15 13:37, Bob Pearson 写道:
>>>>> On 4/15/22 14:56, yanjun.zhu@linux.dev wrote:
>>>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>>
>>>>>> This is a dead lock problem.
>>>>>> The xa_lock first is acquired in this:
>>>>>>
>>>>>> {SOFTIRQ-ON-W} state was registered at:
>>>>>>
>>>>>> lock_acquire+0x1d2/0x5a0
>>>>>> _raw_spin_lock+0x33/0x80
>>>>>> __rxe_add_to_pool+0x183/0x230 [rdma_rxe]
>>>>>> __ib_alloc_pd+0xf9/0x550 [ib_core]
>>>>>> ib_mad_init_device+0x2d9/0xd20 [ib_core]
>>>>>> add_client_context+0x2fa/0x450 [ib_core]
>>>>>> enable_device_and_get+0x1b7/0x350 [ib_core]
>>>>>> ib_register_device+0x757/0xaf0 [ib_core]
>>>>>> rxe_register_device+0x2eb/0x390 [rdma_rxe]
>>>>>> rxe_net_add+0x83/0xc0 [rdma_rxe]
>>>>>> rxe_newlink+0x76/0x90 [rdma_rxe]
>>>>>> nldev_newlink+0x245/0x3e0 [ib_core]
>>>>>> rdma_nl_rcv_msg+0x2d4/0x790 [ib_core]
>>>>>> rdma_nl_rcv+0x1ca/0x3f0 [ib_core]
>>>>>> netlink_unicast+0x43b/0x640
>>>>>> netlink_sendmsg+0x7eb/0xc40
>>>>>> sock_sendmsg+0xe0/0x110
>>>>>> __sys_sendto+0x1d7/0x2b0
>>>>>> __x64_sys_sendto+0xdd/0x1b0
>>>>>> do_syscall_64+0x37/0x80
>>>>>> entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>> There is a separate xarray for each object pool. So this one is
>>>>> rxe->pd_pool.xa.xa_lock from rxe_alloc_pd().
>>>>>
>>>>>> Then xa_lock is acquired in this:
>>>>>>
>>>>>> {IN-SOFTIRQ-W}:
>>>>>>
>>>>>> Call Trace:
>>>>>> <TASK>
>>>>>> dump_stack_lvl+0x44/0x57
>>>>>> mark_lock.part.52.cold.79+0x3c/0x46
>>>>>> __lock_acquire+0x1565/0x34a0
>>>>>> lock_acquire+0x1d2/0x5a0
>>>>>> _raw_spin_lock_irqsave+0x42/0x90
>>>>>> rxe_pool_get_index+0x72/0x1d0 [rdma_rxe]
>>>>>> rxe_get_av+0x168/0x2a0 [rdma_rxe]
>>>>>> rxe_requester+0x75b/0x4a90 [rdma_rxe]
>>>>>> rxe_do_task+0x134/0x230 [rdma_rxe]
>>>>>> tasklet_action_common.isra.12+0x1f7/0x2d0
>>>>>> __do_softirq+0x1ea/0xa4c
>>>>>> run_ksoftirqd+0x32/0x60
>>>>>> smpboot_thread_fn+0x503/0x860
>>>>>> kthread+0x29b/0x340
>>>>>> ret_from_fork+0x1f/0x30
>>>>> And this one is rxe->ah_pool.xa.xa_lock from rxe_requester
>>>>> in the process of sending a UD packet from a work request
>>>>> which contains the index of the ah.
>>>>>
>>>>> For your story to work there needs to be an another ah_pool.xa.xa_lock somewhere.
>>>>> Let's assume it is there somewhere and it's from (a different) add_to_pool call
>>>>> then the add_to_pool_ routine should disable interrupts when it gets the lock
>>>>> with spin_lock_xxx. But only for AH objects.
>>>>>
>>>>> This may be old news.
>>>> What do you mean? Please check the call trace in the bug.
>>> I mean the trace you show here shows an instance of xa_lock being
>>> acquired from the pd pool followed by an instance of xa_lock being
>>> acquired from rxe_pool_get_index from the ah pool. They are different
>>> locks. They can't deadlock against each other. So there must be
>>> some other trace (not shown) that also gets xa_lock from the ah pool.
>> Please check the bug report mail. The link is news://nntp.lore.kernel.org:119/CAHj4cs-MT13RiEsWXUAcX_H5jEtjsebuZgSeUcfptNVuELgjYQ@mail.gmail.com
>>
>> BTW, what is the update about wr crash caused by your xarray patches?
>>
>> Zhu Yanjun
>>
>>>> Zhu Yanjun
>>>>
>>>>>> </TASK>
>>>>>>
>>>>>> From the above, in the function __rxe_add_to_pool,
>>>>>> xa_lock is acquired. Then the function __rxe_add_to_pool
>>>>>> is interrupted by softirq. The function
>>>>>> rxe_pool_get_index will also acquire xa_lock.
>>>>>>
>>>>>> Finally, the dead lock appears.
>>>>>>
>>>>>> [ 296.806097] CPU0
>>>>>> [ 296.808550] ----
>>>>>> [ 296.811003] lock(&xa->xa_lock#15); <----- __rxe_add_to_pool
>>>>>> [ 296.814583] <Interrupt>
>>>>>> [ 296.817209] lock(&xa->xa_lock#15); <---- rxe_pool_get_index
>>>>>> [ 296.820961]
>>>>>> *** DEADLOCK ***
>>>>>>
>>>>>> Fixes: 3225717f6dfa ("RDMA/rxe: Replace red-black trees by carrays")
>>>>>> Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
>>>>>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>> ---
>>>>>> V3->V4: xa_lock_irq locks are used.
>>>>>> V2->V3: __rxe_add_to_pool is between spin_lock and spin_unlock, so
>>>>>> GFP_ATOMIC is used in __rxe_add_to_pool.
>>>>>> V1->V2: Replace GFP_KERNEL with GFP_ATOMIC
>>>>>> ---
>>>>>> drivers/infiniband/sw/rxe/rxe_pool.c | 20 ++++++++++++++------
>>>>>> 1 file changed, 14 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>> index 87066d04ed18..f1f06dc7e64f 100644
>>>>>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>> @@ -106,7 +106,7 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
>>>>>> atomic_set(&pool->num_elem, 0);
>>>>>> - xa_init_flags(&pool->xa, XA_FLAGS_ALLOC);
>>>>>> + xa_init_flags(&pool->xa, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
>>>>>> pool->limit.min = info->min_index;
>>>>>> pool->limit.max = info->max_index;
>>>>>> }
>>>>>> @@ -138,8 +138,10 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>>>> elem->obj = obj;
>>>>>> kref_init(&elem->ref_cnt);
>>>>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>> - &pool->next, GFP_KERNEL);
>>>>>> + xa_lock_irq(&pool->xa);
>>>>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>> + &pool->next, GFP_KERNEL);
>>>>>> + xa_unlock_irq(&pool->xa);
>>>>>> if (err)
>>>>>> goto err_free;
>>>>>> @@ -155,6 +157,7 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>>>> int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>>>> {
>>>>>> int err;
>>>>>> + unsigned long flags;
>>>>>> if (WARN_ON(pool->flags & RXE_POOL_ALLOC))
>>>>>> return -EINVAL;
>>>>>> @@ -166,8 +169,10 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>>>> elem->obj = (u8 *)elem - pool->elem_offset;
>>>>>> kref_init(&elem->ref_cnt);
>>>>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>> - &pool->next, GFP_KERNEL);
>>>>>> + xa_lock_irqsave(&pool->xa, flags);
>>>>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>> + &pool->next, GFP_ATOMIC);
>>>>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>>>>> if (err)
>>>>>> goto err_cnt;
>>>>>> @@ -200,8 +205,11 @@ static void rxe_elem_release(struct kref *kref)
>>>>>> {
>>>>>> struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
>>>>>> struct rxe_pool *pool = elem->pool;
>>>>>> + unsigned long flags;
>>>>>> - xa_erase(&pool->xa, elem->index);
>>>>>> + xa_lock_irqsave(&pool->xa, flags);
>>>>>> + __xa_erase(&pool->xa, elem->index);
>>>>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>>>>> if (pool->cleanup)
>>>>>> pool->cleanup(elem);
> Here is my output. Everything passes there are no bugs or unexpected warnings in the kernel trace.
If I understand you correctly, you mean that the bug reported by Zhang
Yi does not exist?
I can reproduce this bug with rping.
You can not reproduce this bug. It does not mean that this bug does not
exist.
And with rping, I also found another wr NULL bug. From the mail, you can
also verify this wr NULL bug.
Let us foucus on this wr NULL bug. OK?
Zhu Yanjun
>
> bob@ubuntu-21:~/src/blktests$ sudo ./check -q srp
>
> srp/001 (Create and remove LUNs) [passed]
>
> runtime 3.402s ... 2.753s
>
> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]time 34.431s ...
>
> runtime 34.431s ... 34.328s
>
> srp/003 (File I/O on top of multipath concurrently with logout and login (sq)) [not run]
>
> legacy device mapper support is missing
>
> srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-mq)) [not run]
>
> legacy device mapper support is missing
>
> srp/005 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]
>
> runtime 14.332s ... 12.919s
>
> srp/006 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]
>
> runtime 13.361s ... 12.959s
>
> srp/007 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=4M) [passed]
>
> runtime 14.293s ... 12.912s
>
> srp/008 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=8M) [passed]
>
> runtime 13.369s ... 13.165s
>
> srp/009 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]
>
> runtime 13.636s ... 14.201s
>
> srp/010 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]
>
> runtime 13.361s ... 12.909s
>
> srp/011 (Block I/O on top of multipath concurrently with logout and login) [passed]
>
> runtime 33.706s ... 33.571s
>
> srp/012 (dm-mpath on top of multiple I/O schedulers) [passed]
>
> runtime 13.592s ... 14.138s
>
> srp/013 (Direct I/O using a discontiguous buffer) [passed]
>
> runtime 3.230s ... 3.513s
>
> srp/014 (Run sg_reset while I/O is ongoing) [passed]
>
> runtime 33.070s ... 33.059s
>
> srp/015 (File I/O on top of multipath concurrently with logout and login (mq) using the SoftiWARP (siw) dsrp/015 (File I/O on top of multipath concurrently with logout and login (mq) using the SoftiWARP (siw) driver) [passed].148s ...
you are using SoftiWARP (siw)?
>
> runtime 35.148s ... 34.974s
>
> bob@ubuntu-21:~/src/blktests$
>
> Bob
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem
2022-04-15 7:32 ` Yanjun Zhu
@ 2022-04-15 7:35 ` Bob Pearson
2022-04-15 7:42 ` Yanjun Zhu
2022-04-15 7:44 ` Bob Pearson
0 siblings, 2 replies; 17+ messages in thread
From: Bob Pearson @ 2022-04-15 7:35 UTC (permalink / raw)
To: Yanjun Zhu, jgg, leon, linux-rdma
On 4/15/22 02:32, Yanjun Zhu wrote:
>
> 在 2022/4/15 15:22, Bob Pearson 写道:
>> On 4/15/22 01:49, Yanjun Zhu wrote:
>>> 在 2022/4/15 14:35, Bob Pearson 写道:
>>>> On 4/15/22 00:54, Yanjun Zhu wrote:
>>>>> 在 2022/4/15 13:37, Bob Pearson 写道:
>>>>>> On 4/15/22 14:56, yanjun.zhu@linux.dev wrote:
>>>>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>>>
>>>>>>> This is a dead lock problem.
>>>>>>> The xa_lock first is acquired in this:
>>>>>>>
>>>>>>> {SOFTIRQ-ON-W} state was registered at:
>>>>>>>
>>>>>>> lock_acquire+0x1d2/0x5a0
>>>>>>> _raw_spin_lock+0x33/0x80
>>>>>>> __rxe_add_to_pool+0x183/0x230 [rdma_rxe]
>>>>>>> __ib_alloc_pd+0xf9/0x550 [ib_core]
>>>>>>> ib_mad_init_device+0x2d9/0xd20 [ib_core]
>>>>>>> add_client_context+0x2fa/0x450 [ib_core]
>>>>>>> enable_device_and_get+0x1b7/0x350 [ib_core]
>>>>>>> ib_register_device+0x757/0xaf0 [ib_core]
>>>>>>> rxe_register_device+0x2eb/0x390 [rdma_rxe]
>>>>>>> rxe_net_add+0x83/0xc0 [rdma_rxe]
>>>>>>> rxe_newlink+0x76/0x90 [rdma_rxe]
>>>>>>> nldev_newlink+0x245/0x3e0 [ib_core]
>>>>>>> rdma_nl_rcv_msg+0x2d4/0x790 [ib_core]
>>>>>>> rdma_nl_rcv+0x1ca/0x3f0 [ib_core]
>>>>>>> netlink_unicast+0x43b/0x640
>>>>>>> netlink_sendmsg+0x7eb/0xc40
>>>>>>> sock_sendmsg+0xe0/0x110
>>>>>>> __sys_sendto+0x1d7/0x2b0
>>>>>>> __x64_sys_sendto+0xdd/0x1b0
>>>>>>> do_syscall_64+0x37/0x80
>>>>>>> entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>> There is a separate xarray for each object pool. So this one is
>>>>>> rxe->pd_pool.xa.xa_lock from rxe_alloc_pd().
>>>>>>
>>>>>>> Then xa_lock is acquired in this:
>>>>>>>
>>>>>>> {IN-SOFTIRQ-W}:
>>>>>>>
>>>>>>> Call Trace:
>>>>>>> <TASK>
>>>>>>> dump_stack_lvl+0x44/0x57
>>>>>>> mark_lock.part.52.cold.79+0x3c/0x46
>>>>>>> __lock_acquire+0x1565/0x34a0
>>>>>>> lock_acquire+0x1d2/0x5a0
>>>>>>> _raw_spin_lock_irqsave+0x42/0x90
>>>>>>> rxe_pool_get_index+0x72/0x1d0 [rdma_rxe]
>>>>>>> rxe_get_av+0x168/0x2a0 [rdma_rxe]
>>>>>>> rxe_requester+0x75b/0x4a90 [rdma_rxe]
>>>>>>> rxe_do_task+0x134/0x230 [rdma_rxe]
>>>>>>> tasklet_action_common.isra.12+0x1f7/0x2d0
>>>>>>> __do_softirq+0x1ea/0xa4c
>>>>>>> run_ksoftirqd+0x32/0x60
>>>>>>> smpboot_thread_fn+0x503/0x860
>>>>>>> kthread+0x29b/0x340
>>>>>>> ret_from_fork+0x1f/0x30
>>>>>> And this one is rxe->ah_pool.xa.xa_lock from rxe_requester
>>>>>> in the process of sending a UD packet from a work request
>>>>>> which contains the index of the ah.
>>>>>>
>>>>>> For your story to work there needs to be an another ah_pool.xa.xa_lock somewhere.
>>>>>> Let's assume it is there somewhere and it's from (a different) add_to_pool call
>>>>>> then the add_to_pool_ routine should disable interrupts when it gets the lock
>>>>>> with spin_lock_xxx. But only for AH objects.
>>>>>>
>>>>>> This may be old news.
>>>>> What do you mean? Please check the call trace in the bug.
>>>> I mean the trace you show here shows an instance of xa_lock being
>>>> acquired from the pd pool followed by an instance of xa_lock being
>>>> acquired from rxe_pool_get_index from the ah pool. They are different
>>>> locks. They can't deadlock against each other. So there must be
>>>> some other trace (not shown) that also gets xa_lock from the ah pool.
>>> Please check the bug report mail. The link is news://nntp.lore.kernel.org:119/CAHj4cs-MT13RiEsWXUAcX_H5jEtjsebuZgSeUcfptNVuELgjYQ@mail.gmail.com
>>>
>>> BTW, what is the update about wr crash caused by your xarray patches?
>>>
>>> Zhu Yanjun
>>>
>>>>> Zhu Yanjun
>>>>>
>>>>>>> </TASK>
>>>>>>>
>>>>>>> From the above, in the function __rxe_add_to_pool,
>>>>>>> xa_lock is acquired. Then the function __rxe_add_to_pool
>>>>>>> is interrupted by softirq. The function
>>>>>>> rxe_pool_get_index will also acquire xa_lock.
>>>>>>>
>>>>>>> Finally, the dead lock appears.
>>>>>>>
>>>>>>> [ 296.806097] CPU0
>>>>>>> [ 296.808550] ----
>>>>>>> [ 296.811003] lock(&xa->xa_lock#15); <----- __rxe_add_to_pool
>>>>>>> [ 296.814583] <Interrupt>
>>>>>>> [ 296.817209] lock(&xa->xa_lock#15); <---- rxe_pool_get_index
>>>>>>> [ 296.820961]
>>>>>>> *** DEADLOCK ***
>>>>>>>
>>>>>>> Fixes: 3225717f6dfa ("RDMA/rxe: Replace red-black trees by carrays")
>>>>>>> Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
>>>>>>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>>> ---
>>>>>>> V3->V4: xa_lock_irq locks are used.
>>>>>>> V2->V3: __rxe_add_to_pool is between spin_lock and spin_unlock, so
>>>>>>> GFP_ATOMIC is used in __rxe_add_to_pool.
>>>>>>> V1->V2: Replace GFP_KERNEL with GFP_ATOMIC
>>>>>>> ---
>>>>>>> drivers/infiniband/sw/rxe/rxe_pool.c | 20 ++++++++++++++------
>>>>>>> 1 file changed, 14 insertions(+), 6 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>> index 87066d04ed18..f1f06dc7e64f 100644
>>>>>>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>> @@ -106,7 +106,7 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
>>>>>>> atomic_set(&pool->num_elem, 0);
>>>>>>> - xa_init_flags(&pool->xa, XA_FLAGS_ALLOC);
>>>>>>> + xa_init_flags(&pool->xa, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
>>>>>>> pool->limit.min = info->min_index;
>>>>>>> pool->limit.max = info->max_index;
>>>>>>> }
>>>>>>> @@ -138,8 +138,10 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>>>>> elem->obj = obj;
>>>>>>> kref_init(&elem->ref_cnt);
>>>>>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>> - &pool->next, GFP_KERNEL);
>>>>>>> + xa_lock_irq(&pool->xa);
>>>>>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>> + &pool->next, GFP_KERNEL);
>>>>>>> + xa_unlock_irq(&pool->xa);
>>>>>>> if (err)
>>>>>>> goto err_free;
>>>>>>> @@ -155,6 +157,7 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>>>>> int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>>>>> {
>>>>>>> int err;
>>>>>>> + unsigned long flags;
>>>>>>> if (WARN_ON(pool->flags & RXE_POOL_ALLOC))
>>>>>>> return -EINVAL;
>>>>>>> @@ -166,8 +169,10 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>>>>> elem->obj = (u8 *)elem - pool->elem_offset;
>>>>>>> kref_init(&elem->ref_cnt);
>>>>>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>> - &pool->next, GFP_KERNEL);
>>>>>>> + xa_lock_irqsave(&pool->xa, flags);
>>>>>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>> + &pool->next, GFP_ATOMIC);
>>>>>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>>>>>> if (err)
>>>>>>> goto err_cnt;
>>>>>>> @@ -200,8 +205,11 @@ static void rxe_elem_release(struct kref *kref)
>>>>>>> {
>>>>>>> struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
>>>>>>> struct rxe_pool *pool = elem->pool;
>>>>>>> + unsigned long flags;
>>>>>>> - xa_erase(&pool->xa, elem->index);
>>>>>>> + xa_lock_irqsave(&pool->xa, flags);
>>>>>>> + __xa_erase(&pool->xa, elem->index);
>>>>>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>>>>>> if (pool->cleanup)
>>>>>>> pool->cleanup(elem);
>> Here is my output. Everything passes there are no bugs or unexpected warnings in the kernel trace.
>
> If I understand you correctly, you mean that the bug reported by Zhang Yi does not exist?
>
> I can reproduce this bug with rping.
>
> You can not reproduce this bug. It does not mean that this bug does not exist.
>
> And with rping, I also found another wr NULL bug. From the mail, you can also verify this wr NULL bug.
>
> Let us foucus on this wr NULL bug. OK?
>
> Zhu Yanjun
>
>>
>> bob@ubuntu-21:~/src/blktests$ sudo ./check -q srp
>>
>> srp/001 (Create and remove LUNs) [passed]
>>
>> runtime 3.402s ... 2.753s
>>
>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]time 34.431s ...
>>
>> runtime 34.431s ... 34.328s
>>
>> srp/003 (File I/O on top of multipath concurrently with logout and login (sq)) [not run]
>>
>> legacy device mapper support is missing
>>
>> srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-mq)) [not run]
>>
>> legacy device mapper support is missing
>>
>> srp/005 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]
>>
>> runtime 14.332s ... 12.919s
>>
>> srp/006 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]
>>
>> runtime 13.361s ... 12.959s
>>
>> srp/007 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=4M) [passed]
>>
>> runtime 14.293s ... 12.912s
>>
>> srp/008 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=8M) [passed]
>>
>> runtime 13.369s ... 13.165s
>>
>> srp/009 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]
>>
>> runtime 13.636s ... 14.201s
>>
>> srp/010 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]
>>
>> runtime 13.361s ... 12.909s
>>
>> srp/011 (Block I/O on top of multipath concurrently with logout and login) [passed]
>>
>> runtime 33.706s ... 33.571s
>>
>> srp/012 (dm-mpath on top of multiple I/O schedulers) [passed]
>>
>> runtime 13.592s ... 14.138s
>>
>> srp/013 (Direct I/O using a discontiguous buffer) [passed]
>>
>> runtime 3.230s ... 3.513s
>>
>> srp/014 (Run sg_reset while I/O is ongoing) [passed]
>>
>> runtime 33.070s ... 33.059s
>>
>> srp/015 (File I/O on top of multipath concurrently with logout and login (mq) using the SoftiWARP (siw) dsrp/015 (File I/O on top of multipath concurrently with logout and login (mq) using the SoftiWARP (siw) driver) [passed].148s ...
>
> you are using SoftiWARP (siw)?
not me. it is just the normal behavior of the srp/015 test case. it has always done that. my rdma-core
does support siw.
>
>>
>> runtime 35.148s ... 34.974s
>>
>> bob@ubuntu-21:~/src/blktests$
>>
>> Bob
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem
2022-04-15 7:35 ` Bob Pearson
@ 2022-04-15 7:42 ` Yanjun Zhu
2022-04-15 7:49 ` Bob Pearson
2022-04-15 7:44 ` Bob Pearson
1 sibling, 1 reply; 17+ messages in thread
From: Yanjun Zhu @ 2022-04-15 7:42 UTC (permalink / raw)
To: Bob Pearson, Yanjun Zhu, jgg, leon, linux-rdma
在 2022/4/15 15:35, Bob Pearson 写道:
> On 4/15/22 02:32, Yanjun Zhu wrote:
>>
>> 在 2022/4/15 15:22, Bob Pearson 写道:
>>> On 4/15/22 01:49, Yanjun Zhu wrote:
>>>> 在 2022/4/15 14:35, Bob Pearson 写道:
>>>>> On 4/15/22 00:54, Yanjun Zhu wrote:
>>>>>> 在 2022/4/15 13:37, Bob Pearson 写道:
>>>>>>> On 4/15/22 14:56, yanjun.zhu@linux.dev wrote:
>>>>>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>>>>
>>>>>>>> This is a dead lock problem.
>>>>>>>> The xa_lock first is acquired in this:
>>>>>>>>
>>>>>>>> {SOFTIRQ-ON-W} state was registered at:
>>>>>>>>
>>>>>>>> lock_acquire+0x1d2/0x5a0
>>>>>>>> _raw_spin_lock+0x33/0x80
>>>>>>>> __rxe_add_to_pool+0x183/0x230 [rdma_rxe]
>>>>>>>> __ib_alloc_pd+0xf9/0x550 [ib_core]
>>>>>>>> ib_mad_init_device+0x2d9/0xd20 [ib_core]
>>>>>>>> add_client_context+0x2fa/0x450 [ib_core]
>>>>>>>> enable_device_and_get+0x1b7/0x350 [ib_core]
>>>>>>>> ib_register_device+0x757/0xaf0 [ib_core]
>>>>>>>> rxe_register_device+0x2eb/0x390 [rdma_rxe]
>>>>>>>> rxe_net_add+0x83/0xc0 [rdma_rxe]
>>>>>>>> rxe_newlink+0x76/0x90 [rdma_rxe]
>>>>>>>> nldev_newlink+0x245/0x3e0 [ib_core]
>>>>>>>> rdma_nl_rcv_msg+0x2d4/0x790 [ib_core]
>>>>>>>> rdma_nl_rcv+0x1ca/0x3f0 [ib_core]
>>>>>>>> netlink_unicast+0x43b/0x640
>>>>>>>> netlink_sendmsg+0x7eb/0xc40
>>>>>>>> sock_sendmsg+0xe0/0x110
>>>>>>>> __sys_sendto+0x1d7/0x2b0
>>>>>>>> __x64_sys_sendto+0xdd/0x1b0
>>>>>>>> do_syscall_64+0x37/0x80
>>>>>>>> entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>>> There is a separate xarray for each object pool. So this one is
>>>>>>> rxe->pd_pool.xa.xa_lock from rxe_alloc_pd().
>>>>>>>
>>>>>>>> Then xa_lock is acquired in this:
>>>>>>>>
>>>>>>>> {IN-SOFTIRQ-W}:
>>>>>>>>
>>>>>>>> Call Trace:
>>>>>>>> <TASK>
>>>>>>>> dump_stack_lvl+0x44/0x57
>>>>>>>> mark_lock.part.52.cold.79+0x3c/0x46
>>>>>>>> __lock_acquire+0x1565/0x34a0
>>>>>>>> lock_acquire+0x1d2/0x5a0
>>>>>>>> _raw_spin_lock_irqsave+0x42/0x90
>>>>>>>> rxe_pool_get_index+0x72/0x1d0 [rdma_rxe]
>>>>>>>> rxe_get_av+0x168/0x2a0 [rdma_rxe]
>>>>>>>> rxe_requester+0x75b/0x4a90 [rdma_rxe]
>>>>>>>> rxe_do_task+0x134/0x230 [rdma_rxe]
>>>>>>>> tasklet_action_common.isra.12+0x1f7/0x2d0
>>>>>>>> __do_softirq+0x1ea/0xa4c
>>>>>>>> run_ksoftirqd+0x32/0x60
>>>>>>>> smpboot_thread_fn+0x503/0x860
>>>>>>>> kthread+0x29b/0x340
>>>>>>>> ret_from_fork+0x1f/0x30
>>>>>>> And this one is rxe->ah_pool.xa.xa_lock from rxe_requester
>>>>>>> in the process of sending a UD packet from a work request
>>>>>>> which contains the index of the ah.
>>>>>>>
>>>>>>> For your story to work there needs to be an another ah_pool.xa.xa_lock somewhere.
>>>>>>> Let's assume it is there somewhere and it's from (a different) add_to_pool call
>>>>>>> then the add_to_pool_ routine should disable interrupts when it gets the lock
>>>>>>> with spin_lock_xxx. But only for AH objects.
>>>>>>>
>>>>>>> This may be old news.
>>>>>> What do you mean? Please check the call trace in the bug.
>>>>> I mean the trace you show here shows an instance of xa_lock being
>>>>> acquired from the pd pool followed by an instance of xa_lock being
>>>>> acquired from rxe_pool_get_index from the ah pool. They are different
>>>>> locks. They can't deadlock against each other. So there must be
>>>>> some other trace (not shown) that also gets xa_lock from the ah pool.
>>>> Please check the bug report mail. The link is news://nntp.lore.kernel.org:119/CAHj4cs-MT13RiEsWXUAcX_H5jEtjsebuZgSeUcfptNVuELgjYQ@mail.gmail.com
>>>>
>>>> BTW, what is the update about wr crash caused by your xarray patches?
>>>>
>>>> Zhu Yanjun
>>>>
>>>>>> Zhu Yanjun
>>>>>>
>>>>>>>> </TASK>
>>>>>>>>
>>>>>>>> From the above, in the function __rxe_add_to_pool,
>>>>>>>> xa_lock is acquired. Then the function __rxe_add_to_pool
>>>>>>>> is interrupted by softirq. The function
>>>>>>>> rxe_pool_get_index will also acquire xa_lock.
>>>>>>>>
>>>>>>>> Finally, the dead lock appears.
>>>>>>>>
>>>>>>>> [ 296.806097] CPU0
>>>>>>>> [ 296.808550] ----
>>>>>>>> [ 296.811003] lock(&xa->xa_lock#15); <----- __rxe_add_to_pool
>>>>>>>> [ 296.814583] <Interrupt>
>>>>>>>> [ 296.817209] lock(&xa->xa_lock#15); <---- rxe_pool_get_index
>>>>>>>> [ 296.820961]
>>>>>>>> *** DEADLOCK ***
>>>>>>>>
>>>>>>>> Fixes: 3225717f6dfa ("RDMA/rxe: Replace red-black trees by carrays")
>>>>>>>> Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
>>>>>>>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>>>> ---
>>>>>>>> V3->V4: xa_lock_irq locks are used.
>>>>>>>> V2->V3: __rxe_add_to_pool is between spin_lock and spin_unlock, so
>>>>>>>> GFP_ATOMIC is used in __rxe_add_to_pool.
>>>>>>>> V1->V2: Replace GFP_KERNEL with GFP_ATOMIC
>>>>>>>> ---
>>>>>>>> drivers/infiniband/sw/rxe/rxe_pool.c | 20 ++++++++++++++------
>>>>>>>> 1 file changed, 14 insertions(+), 6 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>>> index 87066d04ed18..f1f06dc7e64f 100644
>>>>>>>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>>> @@ -106,7 +106,7 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
>>>>>>>> atomic_set(&pool->num_elem, 0);
>>>>>>>> - xa_init_flags(&pool->xa, XA_FLAGS_ALLOC);
>>>>>>>> + xa_init_flags(&pool->xa, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
>>>>>>>> pool->limit.min = info->min_index;
>>>>>>>> pool->limit.max = info->max_index;
>>>>>>>> }
>>>>>>>> @@ -138,8 +138,10 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>>>>>> elem->obj = obj;
>>>>>>>> kref_init(&elem->ref_cnt);
>>>>>>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>> - &pool->next, GFP_KERNEL);
>>>>>>>> + xa_lock_irq(&pool->xa);
>>>>>>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>> + &pool->next, GFP_KERNEL);
>>>>>>>> + xa_unlock_irq(&pool->xa);
>>>>>>>> if (err)
>>>>>>>> goto err_free;
>>>>>>>> @@ -155,6 +157,7 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>>>>>> int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>>>>>> {
>>>>>>>> int err;
>>>>>>>> + unsigned long flags;
>>>>>>>> if (WARN_ON(pool->flags & RXE_POOL_ALLOC))
>>>>>>>> return -EINVAL;
>>>>>>>> @@ -166,8 +169,10 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>>>>>> elem->obj = (u8 *)elem - pool->elem_offset;
>>>>>>>> kref_init(&elem->ref_cnt);
>>>>>>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>> - &pool->next, GFP_KERNEL);
>>>>>>>> + xa_lock_irqsave(&pool->xa, flags);
>>>>>>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>> + &pool->next, GFP_ATOMIC);
>>>>>>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>>>>>>> if (err)
>>>>>>>> goto err_cnt;
>>>>>>>> @@ -200,8 +205,11 @@ static void rxe_elem_release(struct kref *kref)
>>>>>>>> {
>>>>>>>> struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
>>>>>>>> struct rxe_pool *pool = elem->pool;
>>>>>>>> + unsigned long flags;
>>>>>>>> - xa_erase(&pool->xa, elem->index);
>>>>>>>> + xa_lock_irqsave(&pool->xa, flags);
>>>>>>>> + __xa_erase(&pool->xa, elem->index);
>>>>>>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>>>>>>> if (pool->cleanup)
>>>>>>>> pool->cleanup(elem);
>>> Here is my output. Everything passes there are no bugs or unexpected warnings in the kernel trace.
>>
>> If I understand you correctly, you mean that the bug reported by Zhang Yi does not exist?
>>
>> I can reproduce this bug with rping.
>>
>> You can not reproduce this bug. It does not mean that this bug does not exist.
>>
>> And with rping, I also found another wr NULL bug. From the mail, you can also verify this wr NULL bug.
>>
>> Let us foucus on this wr NULL bug. OK?
>>
>> Zhu Yanjun
>>
>>>
>>> bob@ubuntu-21:~/src/blktests$ sudo ./check -q srp
>>>
>>> srp/001 (Create and remove LUNs) [passed]
>>>
>>> runtime 3.402s ... 2.753s
>>>
>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]time 34.431s ...
>>>
>>> runtime 34.431s ... 34.328s
>>>
>>> srp/003 (File I/O on top of multipath concurrently with logout and login (sq)) [not run]
>>>
>>> legacy device mapper support is missing
>>>
>>> srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-mq)) [not run]
>>>
>>> legacy device mapper support is missing
>>>
>>> srp/005 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]
>>>
>>> runtime 14.332s ... 12.919s
>>>
>>> srp/006 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]
>>>
>>> runtime 13.361s ... 12.959s
>>>
>>> srp/007 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=4M) [passed]
>>>
>>> runtime 14.293s ... 12.912s
>>>
>>> srp/008 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=8M) [passed]
>>>
>>> runtime 13.369s ... 13.165s
>>>
>>> srp/009 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]
>>>
>>> runtime 13.636s ... 14.201s
>>>
>>> srp/010 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]
>>>
>>> runtime 13.361s ... 12.909s
>>>
>>> srp/011 (Block I/O on top of multipath concurrently with logout and login) [passed]
>>>
>>> runtime 33.706s ... 33.571s
>>>
>>> srp/012 (dm-mpath on top of multiple I/O schedulers) [passed]
>>>
>>> runtime 13.592s ... 14.138s
>>>
>>> srp/013 (Direct I/O using a discontiguous buffer) [passed]
>>>
>>> runtime 3.230s ... 3.513s
>>>
>>> srp/014 (Run sg_reset while I/O is ongoing) [passed]
>>>
>>> runtime 33.070s ... 33.059s
>>>
>>> srp/015 (File I/O on top of multipath concurrently with logout and login (mq) using the SoftiWARP (siw) dsrp/015 (File I/O on top of multipath concurrently with logout and login (mq) using the SoftiWARP (siw) driver) [passed].148s ...
>>
>> you are using SoftiWARP (siw)?
>
> not me. it is just the normal behavior of the srp/015 test case. it has always done that. my rdma-core
> does support siw.
Fine.
Let us find the root cause of wr NULL problem.
I revert xarray patches and fell back to original source code.
This wr NULL problem does not exist.
I am working on it.
Hope we can fix this wr NULL problem very soon.
Zhu Yanjun
>
>>
>>>
>>> runtime 35.148s ... 34.974s
>>>
>>> bob@ubuntu-21:~/src/blktests$
>>>
>>> Bob
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem
2022-04-15 7:35 ` Bob Pearson
2022-04-15 7:42 ` Yanjun Zhu
@ 2022-04-15 7:44 ` Bob Pearson
2022-04-15 8:50 ` Yanjun Zhu
1 sibling, 1 reply; 17+ messages in thread
From: Bob Pearson @ 2022-04-15 7:44 UTC (permalink / raw)
To: Yanjun Zhu, jgg, leon, linux-rdma
On 4/15/22 02:35, Bob Pearson wrote:
> On 4/15/22 02:32, Yanjun Zhu wrote:
>>
>> 在 2022/4/15 15:22, Bob Pearson 写道:
>>> On 4/15/22 01:49, Yanjun Zhu wrote:
>>>> 在 2022/4/15 14:35, Bob Pearson 写道:
>>>>> On 4/15/22 00:54, Yanjun Zhu wrote:
>>>>>> 在 2022/4/15 13:37, Bob Pearson 写道:
>>>>>>> On 4/15/22 14:56, yanjun.zhu@linux.dev wrote:
>>>>>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>>>>
>>>>>>>> This is a dead lock problem.
>>>>>>>> The xa_lock first is acquired in this:
>>>>>>>>
>>>>>>>> {SOFTIRQ-ON-W} state was registered at:
>>>>>>>>
>>>>>>>> lock_acquire+0x1d2/0x5a0
>>>>>>>> _raw_spin_lock+0x33/0x80
>>>>>>>> __rxe_add_to_pool+0x183/0x230 [rdma_rxe]
>>>>>>>> __ib_alloc_pd+0xf9/0x550 [ib_core]
>>>>>>>> ib_mad_init_device+0x2d9/0xd20 [ib_core]
>>>>>>>> add_client_context+0x2fa/0x450 [ib_core]
>>>>>>>> enable_device_and_get+0x1b7/0x350 [ib_core]
>>>>>>>> ib_register_device+0x757/0xaf0 [ib_core]
>>>>>>>> rxe_register_device+0x2eb/0x390 [rdma_rxe]
>>>>>>>> rxe_net_add+0x83/0xc0 [rdma_rxe]
>>>>>>>> rxe_newlink+0x76/0x90 [rdma_rxe]
>>>>>>>> nldev_newlink+0x245/0x3e0 [ib_core]
>>>>>>>> rdma_nl_rcv_msg+0x2d4/0x790 [ib_core]
>>>>>>>> rdma_nl_rcv+0x1ca/0x3f0 [ib_core]
>>>>>>>> netlink_unicast+0x43b/0x640
>>>>>>>> netlink_sendmsg+0x7eb/0xc40
>>>>>>>> sock_sendmsg+0xe0/0x110
>>>>>>>> __sys_sendto+0x1d7/0x2b0
>>>>>>>> __x64_sys_sendto+0xdd/0x1b0
>>>>>>>> do_syscall_64+0x37/0x80
>>>>>>>> entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>>> There is a separate xarray for each object pool. So this one is
>>>>>>> rxe->pd_pool.xa.xa_lock from rxe_alloc_pd().
>>>>>>>
>>>>>>>> Then xa_lock is acquired in this:
>>>>>>>>
>>>>>>>> {IN-SOFTIRQ-W}:
>>>>>>>>
>>>>>>>> Call Trace:
>>>>>>>> <TASK>
>>>>>>>> dump_stack_lvl+0x44/0x57
>>>>>>>> mark_lock.part.52.cold.79+0x3c/0x46
>>>>>>>> __lock_acquire+0x1565/0x34a0
>>>>>>>> lock_acquire+0x1d2/0x5a0
>>>>>>>> _raw_spin_lock_irqsave+0x42/0x90
>>>>>>>> rxe_pool_get_index+0x72/0x1d0 [rdma_rxe]
>>>>>>>> rxe_get_av+0x168/0x2a0 [rdma_rxe]
>>>>>>>> rxe_requester+0x75b/0x4a90 [rdma_rxe]
>>>>>>>> rxe_do_task+0x134/0x230 [rdma_rxe]
>>>>>>>> tasklet_action_common.isra.12+0x1f7/0x2d0
>>>>>>>> __do_softirq+0x1ea/0xa4c
>>>>>>>> run_ksoftirqd+0x32/0x60
>>>>>>>> smpboot_thread_fn+0x503/0x860
>>>>>>>> kthread+0x29b/0x340
>>>>>>>> ret_from_fork+0x1f/0x30
>>>>>>> And this one is rxe->ah_pool.xa.xa_lock from rxe_requester
>>>>>>> in the process of sending a UD packet from a work request
>>>>>>> which contains the index of the ah.
>>>>>>>
>>>>>>> For your story to work there needs to be an another ah_pool.xa.xa_lock somewhere.
>>>>>>> Let's assume it is there somewhere and it's from (a different) add_to_pool call
>>>>>>> then the add_to_pool_ routine should disable interrupts when it gets the lock
>>>>>>> with spin_lock_xxx. But only for AH objects.
>>>>>>>
>>>>>>> This may be old news.
>>>>>> What do you mean? Please check the call trace in the bug.
>>>>> I mean the trace you show here shows an instance of xa_lock being
>>>>> acquired from the pd pool followed by an instance of xa_lock being
>>>>> acquired from rxe_pool_get_index from the ah pool. They are different
>>>>> locks. They can't deadlock against each other. So there must be
>>>>> some other trace (not shown) that also gets xa_lock from the ah pool.
>>>> Please check the bug report mail. The link is news://nntp.lore.kernel.org:119/CAHj4cs-MT13RiEsWXUAcX_H5jEtjsebuZgSeUcfptNVuELgjYQ@mail.gmail.com
>>>>
>>>> BTW, what is the update about wr crash caused by your xarray patches?
>>>>
>>>> Zhu Yanjun
>>>>
>>>>>> Zhu Yanjun
>>>>>>
>>>>>>>> </TASK>
>>>>>>>>
>>>>>>>> From the above, in the function __rxe_add_to_pool,
>>>>>>>> xa_lock is acquired. Then the function __rxe_add_to_pool
>>>>>>>> is interrupted by softirq. The function
>>>>>>>> rxe_pool_get_index will also acquire xa_lock.
>>>>>>>>
>>>>>>>> Finally, the dead lock appears.
>>>>>>>>
>>>>>>>> [ 296.806097] CPU0
>>>>>>>> [ 296.808550] ----
>>>>>>>> [ 296.811003] lock(&xa->xa_lock#15); <----- __rxe_add_to_pool
>>>>>>>> [ 296.814583] <Interrupt>
>>>>>>>> [ 296.817209] lock(&xa->xa_lock#15); <---- rxe_pool_get_index
>>>>>>>> [ 296.820961]
>>>>>>>> *** DEADLOCK ***
>>>>>>>>
>>>>>>>> Fixes: 3225717f6dfa ("RDMA/rxe: Replace red-black trees by carrays")
>>>>>>>> Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
>>>>>>>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>>>> ---
>>>>>>>> V3->V4: xa_lock_irq locks are used.
>>>>>>>> V2->V3: __rxe_add_to_pool is between spin_lock and spin_unlock, so
>>>>>>>> GFP_ATOMIC is used in __rxe_add_to_pool.
>>>>>>>> V1->V2: Replace GFP_KERNEL with GFP_ATOMIC
>>>>>>>> ---
>>>>>>>> drivers/infiniband/sw/rxe/rxe_pool.c | 20 ++++++++++++++------
>>>>>>>> 1 file changed, 14 insertions(+), 6 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>>> index 87066d04ed18..f1f06dc7e64f 100644
>>>>>>>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>>> @@ -106,7 +106,7 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
>>>>>>>> atomic_set(&pool->num_elem, 0);
>>>>>>>> - xa_init_flags(&pool->xa, XA_FLAGS_ALLOC);
>>>>>>>> + xa_init_flags(&pool->xa, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
>>>>>>>> pool->limit.min = info->min_index;
>>>>>>>> pool->limit.max = info->max_index;
>>>>>>>> }
>>>>>>>> @@ -138,8 +138,10 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>>>>>> elem->obj = obj;
>>>>>>>> kref_init(&elem->ref_cnt);
>>>>>>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>> - &pool->next, GFP_KERNEL);
>>>>>>>> + xa_lock_irq(&pool->xa);
>>>>>>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>> + &pool->next, GFP_KERNEL);
>>>>>>>> + xa_unlock_irq(&pool->xa);
>>>>>>>> if (err)
>>>>>>>> goto err_free;
>>>>>>>> @@ -155,6 +157,7 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>>>>>> int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>>>>>> {
>>>>>>>> int err;
>>>>>>>> + unsigned long flags;
>>>>>>>> if (WARN_ON(pool->flags & RXE_POOL_ALLOC))
>>>>>>>> return -EINVAL;
>>>>>>>> @@ -166,8 +169,10 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>>>>>> elem->obj = (u8 *)elem - pool->elem_offset;
>>>>>>>> kref_init(&elem->ref_cnt);
>>>>>>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>> - &pool->next, GFP_KERNEL);
>>>>>>>> + xa_lock_irqsave(&pool->xa, flags);
>>>>>>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>> + &pool->next, GFP_ATOMIC);
>>>>>>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>>>>>>> if (err)
>>>>>>>> goto err_cnt;
>>>>>>>> @@ -200,8 +205,11 @@ static void rxe_elem_release(struct kref *kref)
>>>>>>>> {
>>>>>>>> struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
>>>>>>>> struct rxe_pool *pool = elem->pool;
>>>>>>>> + unsigned long flags;
>>>>>>>> - xa_erase(&pool->xa, elem->index);
>>>>>>>> + xa_lock_irqsave(&pool->xa, flags);
>>>>>>>> + __xa_erase(&pool->xa, elem->index);
>>>>>>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>>>>>>> if (pool->cleanup)
>>>>>>>> pool->cleanup(elem);
>>> Here is my output. Everything passes there are no bugs or unexpected warnings in the kernel trace.
>>
>> If I understand you correctly, you mean that the bug reported by Zhang Yi does not exist?
Not any more on my rxe tree.
>>
>> I can reproduce this bug with rping.
My tree does not cause this bug any more in rping. It used to but it was fixed a few days ago.
But there remains a fairly rare race condition in rping which I described in a previous note related
to retry timeouts occuring for RDMA read operations. It is caused by spurious retry timer firing
and something wrong in the error path code that I am working on trying to isolate.
>>
>> You can not reproduce this bug. It does not mean that this bug does not exist.
>>
>> And with rping, I also found another wr NULL bug. From the mail, you can also verify this wr NULL bug.
>>
>> Let us foucus on this wr NULL bug. OK?
>>
>> Zhu Yanjun
>>
>>>
>>> bob@ubuntu-21:~/src/blktests$ sudo ./check -q srp
>>>
>>> srp/001 (Create and remove LUNs) [passed]
>>>
>>> runtime 3.402s ... 2.753s
>>>
>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]time 34.431s ...
>>>
>>> runtime 34.431s ... 34.328s
>>>
>>> srp/003 (File I/O on top of multipath concurrently with logout and login (sq)) [not run]
>>>
>>> legacy device mapper support is missing
>>>
>>> srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-mq)) [not run]
>>>
>>> legacy device mapper support is missing
>>>
>>> srp/005 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]
>>>
>>> runtime 14.332s ... 12.919s
>>>
>>> srp/006 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]
>>>
>>> runtime 13.361s ... 12.959s
>>>
>>> srp/007 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=4M) [passed]
>>>
>>> runtime 14.293s ... 12.912s
>>>
>>> srp/008 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=8M) [passed]
>>>
>>> runtime 13.369s ... 13.165s
>>>
>>> srp/009 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]
>>>
>>> runtime 13.636s ... 14.201s
>>>
>>> srp/010 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]
>>>
>>> runtime 13.361s ... 12.909s
>>>
>>> srp/011 (Block I/O on top of multipath concurrently with logout and login) [passed]
>>>
>>> runtime 33.706s ... 33.571s
>>>
>>> srp/012 (dm-mpath on top of multiple I/O schedulers) [passed]
>>>
>>> runtime 13.592s ... 14.138s
>>>
>>> srp/013 (Direct I/O using a discontiguous buffer) [passed]
>>>
>>> runtime 3.230s ... 3.513s
>>>
>>> srp/014 (Run sg_reset while I/O is ongoing) [passed]
>>>
>>> runtime 33.070s ... 33.059s
>>>
>>> srp/015 (File I/O on top of multipath concurrently with logout and login (mq) using the SoftiWARP (siw) dsrp/015 (File I/O on top of multipath concurrently with logout and login (mq) using the SoftiWARP (siw) driver) [passed].148s ...
>>
>> you are using SoftiWARP (siw)?
>
> not me. it is just the normal behavior of the srp/015 test case. it has always done that. my rdma-core
> does support siw.
>
>>
>>>
>>> runtime 35.148s ... 34.974s
>>>
>>> bob@ubuntu-21:~/src/blktests$
>>>
>>> Bob
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem
2022-04-15 7:42 ` Yanjun Zhu
@ 2022-04-15 7:49 ` Bob Pearson
2022-04-15 7:52 ` Yanjun Zhu
0 siblings, 1 reply; 17+ messages in thread
From: Bob Pearson @ 2022-04-15 7:49 UTC (permalink / raw)
To: Yanjun Zhu, jgg, leon, linux-rdma
On 4/15/22 02:42, Yanjun Zhu wrote:
> 在 2022/4/15 15:35, Bob Pearson 写道:
>> On 4/15/22 02:32, Yanjun Zhu wrote:
>>>
>>> 在 2022/4/15 15:22, Bob Pearson 写道:
>>>> On 4/15/22 01:49, Yanjun Zhu wrote:
>>>>> 在 2022/4/15 14:35, Bob Pearson 写道:
>>>>>> On 4/15/22 00:54, Yanjun Zhu wrote:
>>>>>>> 在 2022/4/15 13:37, Bob Pearson 写道:
>>>>>>>> On 4/15/22 14:56, yanjun.zhu@linux.dev wrote:
>>>>>>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>>>>>
>>>>>>>>> This is a dead lock problem.
>>>>>>>>> The xa_lock first is acquired in this:
>>>>>>>>>
>>>>>>>>> {SOFTIRQ-ON-W} state was registered at:
>>>>>>>>>
>>>>>>>>> lock_acquire+0x1d2/0x5a0
>>>>>>>>> _raw_spin_lock+0x33/0x80
>>>>>>>>> __rxe_add_to_pool+0x183/0x230 [rdma_rxe]
>>>>>>>>> __ib_alloc_pd+0xf9/0x550 [ib_core]
>>>>>>>>> ib_mad_init_device+0x2d9/0xd20 [ib_core]
>>>>>>>>> add_client_context+0x2fa/0x450 [ib_core]
>>>>>>>>> enable_device_and_get+0x1b7/0x350 [ib_core]
>>>>>>>>> ib_register_device+0x757/0xaf0 [ib_core]
>>>>>>>>> rxe_register_device+0x2eb/0x390 [rdma_rxe]
>>>>>>>>> rxe_net_add+0x83/0xc0 [rdma_rxe]
>>>>>>>>> rxe_newlink+0x76/0x90 [rdma_rxe]
>>>>>>>>> nldev_newlink+0x245/0x3e0 [ib_core]
>>>>>>>>> rdma_nl_rcv_msg+0x2d4/0x790 [ib_core]
>>>>>>>>> rdma_nl_rcv+0x1ca/0x3f0 [ib_core]
>>>>>>>>> netlink_unicast+0x43b/0x640
>>>>>>>>> netlink_sendmsg+0x7eb/0xc40
>>>>>>>>> sock_sendmsg+0xe0/0x110
>>>>>>>>> __sys_sendto+0x1d7/0x2b0
>>>>>>>>> __x64_sys_sendto+0xdd/0x1b0
>>>>>>>>> do_syscall_64+0x37/0x80
>>>>>>>>> entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>>>> There is a separate xarray for each object pool. So this one is
>>>>>>>> rxe->pd_pool.xa.xa_lock from rxe_alloc_pd().
>>>>>>>>
>>>>>>>>> Then xa_lock is acquired in this:
>>>>>>>>>
>>>>>>>>> {IN-SOFTIRQ-W}:
>>>>>>>>>
>>>>>>>>> Call Trace:
>>>>>>>>> <TASK>
>>>>>>>>> dump_stack_lvl+0x44/0x57
>>>>>>>>> mark_lock.part.52.cold.79+0x3c/0x46
>>>>>>>>> __lock_acquire+0x1565/0x34a0
>>>>>>>>> lock_acquire+0x1d2/0x5a0
>>>>>>>>> _raw_spin_lock_irqsave+0x42/0x90
>>>>>>>>> rxe_pool_get_index+0x72/0x1d0 [rdma_rxe]
>>>>>>>>> rxe_get_av+0x168/0x2a0 [rdma_rxe]
>>>>>>>>> rxe_requester+0x75b/0x4a90 [rdma_rxe]
>>>>>>>>> rxe_do_task+0x134/0x230 [rdma_rxe]
>>>>>>>>> tasklet_action_common.isra.12+0x1f7/0x2d0
>>>>>>>>> __do_softirq+0x1ea/0xa4c
>>>>>>>>> run_ksoftirqd+0x32/0x60
>>>>>>>>> smpboot_thread_fn+0x503/0x860
>>>>>>>>> kthread+0x29b/0x340
>>>>>>>>> ret_from_fork+0x1f/0x30
>>>>>>>> And this one is rxe->ah_pool.xa.xa_lock from rxe_requester
>>>>>>>> in the process of sending a UD packet from a work request
>>>>>>>> which contains the index of the ah.
>>>>>>>>
>>>>>>>> For your story to work there needs to be an another ah_pool.xa.xa_lock somewhere.
>>>>>>>> Let's assume it is there somewhere and it's from (a different) add_to_pool call
>>>>>>>> then the add_to_pool_ routine should disable interrupts when it gets the lock
>>>>>>>> with spin_lock_xxx. But only for AH objects.
>>>>>>>>
>>>>>>>> This may be old news.
>>>>>>> What do you mean? Please check the call trace in the bug.
>>>>>> I mean the trace you show here shows an instance of xa_lock being
>>>>>> acquired from the pd pool followed by an instance of xa_lock being
>>>>>> acquired from rxe_pool_get_index from the ah pool. They are different
>>>>>> locks. They can't deadlock against each other. So there must be
>>>>>> some other trace (not shown) that also gets xa_lock from the ah pool.
>>>>> Please check the bug report mail. The link is news://nntp.lore.kernel.org:119/CAHj4cs-MT13RiEsWXUAcX_H5jEtjsebuZgSeUcfptNVuELgjYQ@mail.gmail.com
>>>>>
>>>>> BTW, what is the update about wr crash caused by your xarray patches?
>>>>>
>>>>> Zhu Yanjun
>>>>>
>>>>>>> Zhu Yanjun
>>>>>>>
>>>>>>>>> </TASK>
>>>>>>>>>
>>>>>>>>> From the above, in the function __rxe_add_to_pool,
>>>>>>>>> xa_lock is acquired. Then the function __rxe_add_to_pool
>>>>>>>>> is interrupted by softirq. The function
>>>>>>>>> rxe_pool_get_index will also acquire xa_lock.
>>>>>>>>>
>>>>>>>>> Finally, the dead lock appears.
>>>>>>>>>
>>>>>>>>> [ 296.806097] CPU0
>>>>>>>>> [ 296.808550] ----
>>>>>>>>> [ 296.811003] lock(&xa->xa_lock#15); <----- __rxe_add_to_pool
>>>>>>>>> [ 296.814583] <Interrupt>
>>>>>>>>> [ 296.817209] lock(&xa->xa_lock#15); <---- rxe_pool_get_index
>>>>>>>>> [ 296.820961]
>>>>>>>>> *** DEADLOCK ***
>>>>>>>>>
>>>>>>>>> Fixes: 3225717f6dfa ("RDMA/rxe: Replace red-black trees by carrays")
>>>>>>>>> Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
>>>>>>>>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>>>>> ---
>>>>>>>>> V3->V4: xa_lock_irq locks are used.
>>>>>>>>> V2->V3: __rxe_add_to_pool is between spin_lock and spin_unlock, so
>>>>>>>>> GFP_ATOMIC is used in __rxe_add_to_pool.
>>>>>>>>> V1->V2: Replace GFP_KERNEL with GFP_ATOMIC
>>>>>>>>> ---
>>>>>>>>> drivers/infiniband/sw/rxe/rxe_pool.c | 20 ++++++++++++++------
>>>>>>>>> 1 file changed, 14 insertions(+), 6 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>>>> index 87066d04ed18..f1f06dc7e64f 100644
>>>>>>>>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>>>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>>>> @@ -106,7 +106,7 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
>>>>>>>>> atomic_set(&pool->num_elem, 0);
>>>>>>>>> - xa_init_flags(&pool->xa, XA_FLAGS_ALLOC);
>>>>>>>>> + xa_init_flags(&pool->xa, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
>>>>>>>>> pool->limit.min = info->min_index;
>>>>>>>>> pool->limit.max = info->max_index;
>>>>>>>>> }
>>>>>>>>> @@ -138,8 +138,10 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>>>>>>> elem->obj = obj;
>>>>>>>>> kref_init(&elem->ref_cnt);
>>>>>>>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>>> - &pool->next, GFP_KERNEL);
>>>>>>>>> + xa_lock_irq(&pool->xa);
>>>>>>>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>>> + &pool->next, GFP_KERNEL);
>>>>>>>>> + xa_unlock_irq(&pool->xa);
>>>>>>>>> if (err)
>>>>>>>>> goto err_free;
>>>>>>>>> @@ -155,6 +157,7 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>>>>>>> int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>>>>>>> {
>>>>>>>>> int err;
>>>>>>>>> + unsigned long flags;
>>>>>>>>> if (WARN_ON(pool->flags & RXE_POOL_ALLOC))
>>>>>>>>> return -EINVAL;
>>>>>>>>> @@ -166,8 +169,10 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>>>>>>> elem->obj = (u8 *)elem - pool->elem_offset;
>>>>>>>>> kref_init(&elem->ref_cnt);
>>>>>>>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>>> - &pool->next, GFP_KERNEL);
>>>>>>>>> + xa_lock_irqsave(&pool->xa, flags);
>>>>>>>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>>> + &pool->next, GFP_ATOMIC);
>>>>>>>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>>>>>>>> if (err)
>>>>>>>>> goto err_cnt;
>>>>>>>>> @@ -200,8 +205,11 @@ static void rxe_elem_release(struct kref *kref)
>>>>>>>>> {
>>>>>>>>> struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
>>>>>>>>> struct rxe_pool *pool = elem->pool;
>>>>>>>>> + unsigned long flags;
>>>>>>>>> - xa_erase(&pool->xa, elem->index);
>>>>>>>>> + xa_lock_irqsave(&pool->xa, flags);
>>>>>>>>> + __xa_erase(&pool->xa, elem->index);
>>>>>>>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>>>>>>>> if (pool->cleanup)
>>>>>>>>> pool->cleanup(elem);
>>>> Here is my output. Everything passes there are no bugs or unexpected warnings in the kernel trace.
>>>
>>> If I understand you correctly, you mean that the bug reported by Zhang Yi does not exist?
>>>
>>> I can reproduce this bug with rping.
>>>
>>> You can not reproduce this bug. It does not mean that this bug does not exist.
>>>
>>> And with rping, I also found another wr NULL bug. From the mail, you can also verify this wr NULL bug.
>>>
>>> Let us foucus on this wr NULL bug. OK?
>>>
>>> Zhu Yanjun
>>>
>>>>
>>>> bob@ubuntu-21:~/src/blktests$ sudo ./check -q srp
>>>>
>>>> srp/001 (Create and remove LUNs) [passed]
>>>>
>>>> runtime 3.402s ... 2.753s
>>>>
>>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]time 34.431s ...
>>>>
>>>> runtime 34.431s ... 34.328s
>>>>
>>>> srp/003 (File I/O on top of multipath concurrently with logout and login (sq)) [not run]
>>>>
>>>> legacy device mapper support is missing
>>>>
>>>> srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-mq)) [not run]
>>>>
>>>> legacy device mapper support is missing
>>>>
>>>> srp/005 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]
>>>>
>>>> runtime 14.332s ... 12.919s
>>>>
>>>> srp/006 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]
>>>>
>>>> runtime 13.361s ... 12.959s
>>>>
>>>> srp/007 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=4M) [passed]
>>>>
>>>> runtime 14.293s ... 12.912s
>>>>
>>>> srp/008 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=8M) [passed]
>>>>
>>>> runtime 13.369s ... 13.165s
>>>>
>>>> srp/009 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]
>>>>
>>>> runtime 13.636s ... 14.201s
>>>>
>>>> srp/010 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]
>>>>
>>>> runtime 13.361s ... 12.909s
>>>>
>>>> srp/011 (Block I/O on top of multipath concurrently with logout and login) [passed]
>>>>
>>>> runtime 33.706s ... 33.571s
>>>>
>>>> srp/012 (dm-mpath on top of multiple I/O schedulers) [passed]
>>>>
>>>> runtime 13.592s ... 14.138s
>>>>
>>>> srp/013 (Direct I/O using a discontiguous buffer) [passed]
>>>>
>>>> runtime 3.230s ... 3.513s
>>>>
>>>> srp/014 (Run sg_reset while I/O is ongoing) [passed]
>>>>
>>>> runtime 33.070s ... 33.059s
>>>>
>>>> srp/015 (File I/O on top of multipath concurrently with logout and login (mq) using the SoftiWARP (siw) dsrp/015 (File I/O on top of multipath concurrently with logout and login (mq) using the SoftiWARP (siw) driver) [passed].148s ...
>>>
>>> you are using SoftiWARP (siw)?
>>
>> not me. it is just the normal behavior of the srp/015 test case. it has always done that. my rdma-core
>> does support siw.
>
> Fine.
> Let us find the root cause of wr NULL problem.
> I revert xarray patches and fell back to original source code.
> This wr NULL problem does not exist.
> I am working on it.
>
> Hope we can fix this wr NULL problem very soon.
I have to go to bed. But the mr == NULL bug was fixed by the last 10 rxe pool patches.
I am sure it was the 8/10 patch which fixed it. I have never seen it once all the
rxe_pool patches were applied.
>
> Zhu Yanjun
>
>>
>>>
>>>>
>>>> runtime 35.148s ... 34.974s
>>>>
>>>> bob@ubuntu-21:~/src/blktests$
>>>>
>>>> Bob
>>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem
2022-04-15 7:49 ` Bob Pearson
@ 2022-04-15 7:52 ` Yanjun Zhu
0 siblings, 0 replies; 17+ messages in thread
From: Yanjun Zhu @ 2022-04-15 7:52 UTC (permalink / raw)
To: Bob Pearson, jgg, leon, linux-rdma
在 2022/4/15 15:49, Bob Pearson 写道:
> On 4/15/22 02:42, Yanjun Zhu wrote:
>> 在 2022/4/15 15:35, Bob Pearson 写道:
>>> On 4/15/22 02:32, Yanjun Zhu wrote:
>>>> 在 2022/4/15 15:22, Bob Pearson 写道:
>>>>> On 4/15/22 01:49, Yanjun Zhu wrote:
>>>>>> 在 2022/4/15 14:35, Bob Pearson 写道:
>>>>>>> On 4/15/22 00:54, Yanjun Zhu wrote:
>>>>>>>> 在 2022/4/15 13:37, Bob Pearson 写道:
>>>>>>>>> On 4/15/22 14:56, yanjun.zhu@linux.dev wrote:
>>>>>>>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>>>>>>
>>>>>>>>>> This is a dead lock problem.
>>>>>>>>>> The xa_lock first is acquired in this:
>>>>>>>>>>
>>>>>>>>>> {SOFTIRQ-ON-W} state was registered at:
>>>>>>>>>>
>>>>>>>>>> lock_acquire+0x1d2/0x5a0
>>>>>>>>>> _raw_spin_lock+0x33/0x80
>>>>>>>>>> __rxe_add_to_pool+0x183/0x230 [rdma_rxe]
>>>>>>>>>> __ib_alloc_pd+0xf9/0x550 [ib_core]
>>>>>>>>>> ib_mad_init_device+0x2d9/0xd20 [ib_core]
>>>>>>>>>> add_client_context+0x2fa/0x450 [ib_core]
>>>>>>>>>> enable_device_and_get+0x1b7/0x350 [ib_core]
>>>>>>>>>> ib_register_device+0x757/0xaf0 [ib_core]
>>>>>>>>>> rxe_register_device+0x2eb/0x390 [rdma_rxe]
>>>>>>>>>> rxe_net_add+0x83/0xc0 [rdma_rxe]
>>>>>>>>>> rxe_newlink+0x76/0x90 [rdma_rxe]
>>>>>>>>>> nldev_newlink+0x245/0x3e0 [ib_core]
>>>>>>>>>> rdma_nl_rcv_msg+0x2d4/0x790 [ib_core]
>>>>>>>>>> rdma_nl_rcv+0x1ca/0x3f0 [ib_core]
>>>>>>>>>> netlink_unicast+0x43b/0x640
>>>>>>>>>> netlink_sendmsg+0x7eb/0xc40
>>>>>>>>>> sock_sendmsg+0xe0/0x110
>>>>>>>>>> __sys_sendto+0x1d7/0x2b0
>>>>>>>>>> __x64_sys_sendto+0xdd/0x1b0
>>>>>>>>>> do_syscall_64+0x37/0x80
>>>>>>>>>> entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>>>>> There is a separate xarray for each object pool. So this one is
>>>>>>>>> rxe->pd_pool.xa.xa_lock from rxe_alloc_pd().
>>>>>>>>>
>>>>>>>>>> Then xa_lock is acquired in this:
>>>>>>>>>>
>>>>>>>>>> {IN-SOFTIRQ-W}:
>>>>>>>>>>
>>>>>>>>>> Call Trace:
>>>>>>>>>> <TASK>
>>>>>>>>>> dump_stack_lvl+0x44/0x57
>>>>>>>>>> mark_lock.part.52.cold.79+0x3c/0x46
>>>>>>>>>> __lock_acquire+0x1565/0x34a0
>>>>>>>>>> lock_acquire+0x1d2/0x5a0
>>>>>>>>>> _raw_spin_lock_irqsave+0x42/0x90
>>>>>>>>>> rxe_pool_get_index+0x72/0x1d0 [rdma_rxe]
>>>>>>>>>> rxe_get_av+0x168/0x2a0 [rdma_rxe]
>>>>>>>>>> rxe_requester+0x75b/0x4a90 [rdma_rxe]
>>>>>>>>>> rxe_do_task+0x134/0x230 [rdma_rxe]
>>>>>>>>>> tasklet_action_common.isra.12+0x1f7/0x2d0
>>>>>>>>>> __do_softirq+0x1ea/0xa4c
>>>>>>>>>> run_ksoftirqd+0x32/0x60
>>>>>>>>>> smpboot_thread_fn+0x503/0x860
>>>>>>>>>> kthread+0x29b/0x340
>>>>>>>>>> ret_from_fork+0x1f/0x30
>>>>>>>>> And this one is rxe->ah_pool.xa.xa_lock from rxe_requester
>>>>>>>>> in the process of sending a UD packet from a work request
>>>>>>>>> which contains the index of the ah.
>>>>>>>>>
>>>>>>>>> For your story to work there needs to be an another ah_pool.xa.xa_lock somewhere.
>>>>>>>>> Let's assume it is there somewhere and it's from (a different) add_to_pool call
>>>>>>>>> then the add_to_pool_ routine should disable interrupts when it gets the lock
>>>>>>>>> with spin_lock_xxx. But only for AH objects.
>>>>>>>>>
>>>>>>>>> This may be old news.
>>>>>>>> What do you mean? Please check the call trace in the bug.
>>>>>>> I mean the trace you show here shows an instance of xa_lock being
>>>>>>> acquired from the pd pool followed by an instance of xa_lock being
>>>>>>> acquired from rxe_pool_get_index from the ah pool. They are different
>>>>>>> locks. They can't deadlock against each other. So there must be
>>>>>>> some other trace (not shown) that also gets xa_lock from the ah pool.
>>>>>> Please check the bug report mail. The link is news://nntp.lore.kernel.org:119/CAHj4cs-MT13RiEsWXUAcX_H5jEtjsebuZgSeUcfptNVuELgjYQ@mail.gmail.com
>>>>>>
>>>>>> BTW, what is the update about wr crash caused by your xarray patches?
>>>>>>
>>>>>> Zhu Yanjun
>>>>>>
>>>>>>>> Zhu Yanjun
>>>>>>>>
>>>>>>>>>> </TASK>
>>>>>>>>>>
>>>>>>>>>> From the above, in the function __rxe_add_to_pool,
>>>>>>>>>> xa_lock is acquired. Then the function __rxe_add_to_pool
>>>>>>>>>> is interrupted by softirq. The function
>>>>>>>>>> rxe_pool_get_index will also acquire xa_lock.
>>>>>>>>>>
>>>>>>>>>> Finally, the dead lock appears.
>>>>>>>>>>
>>>>>>>>>> [ 296.806097] CPU0
>>>>>>>>>> [ 296.808550] ----
>>>>>>>>>> [ 296.811003] lock(&xa->xa_lock#15); <----- __rxe_add_to_pool
>>>>>>>>>> [ 296.814583] <Interrupt>
>>>>>>>>>> [ 296.817209] lock(&xa->xa_lock#15); <---- rxe_pool_get_index
>>>>>>>>>> [ 296.820961]
>>>>>>>>>> *** DEADLOCK ***
>>>>>>>>>>
>>>>>>>>>> Fixes: 3225717f6dfa ("RDMA/rxe: Replace red-black trees by carrays")
>>>>>>>>>> Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
>>>>>>>>>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>>>>>> ---
>>>>>>>>>> V3->V4: xa_lock_irq locks are used.
>>>>>>>>>> V2->V3: __rxe_add_to_pool is between spin_lock and spin_unlock, so
>>>>>>>>>> GFP_ATOMIC is used in __rxe_add_to_pool.
>>>>>>>>>> V1->V2: Replace GFP_KERNEL with GFP_ATOMIC
>>>>>>>>>> ---
>>>>>>>>>> drivers/infiniband/sw/rxe/rxe_pool.c | 20 ++++++++++++++------
>>>>>>>>>> 1 file changed, 14 insertions(+), 6 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>>>>> index 87066d04ed18..f1f06dc7e64f 100644
>>>>>>>>>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>>>>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>>>>> @@ -106,7 +106,7 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
>>>>>>>>>> atomic_set(&pool->num_elem, 0);
>>>>>>>>>> - xa_init_flags(&pool->xa, XA_FLAGS_ALLOC);
>>>>>>>>>> + xa_init_flags(&pool->xa, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
>>>>>>>>>> pool->limit.min = info->min_index;
>>>>>>>>>> pool->limit.max = info->max_index;
>>>>>>>>>> }
>>>>>>>>>> @@ -138,8 +138,10 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>>>>>>>> elem->obj = obj;
>>>>>>>>>> kref_init(&elem->ref_cnt);
>>>>>>>>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>>>> - &pool->next, GFP_KERNEL);
>>>>>>>>>> + xa_lock_irq(&pool->xa);
>>>>>>>>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>>>> + &pool->next, GFP_KERNEL);
>>>>>>>>>> + xa_unlock_irq(&pool->xa);
>>>>>>>>>> if (err)
>>>>>>>>>> goto err_free;
>>>>>>>>>> @@ -155,6 +157,7 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>>>>>>>> int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>>>>>>>> {
>>>>>>>>>> int err;
>>>>>>>>>> + unsigned long flags;
>>>>>>>>>> if (WARN_ON(pool->flags & RXE_POOL_ALLOC))
>>>>>>>>>> return -EINVAL;
>>>>>>>>>> @@ -166,8 +169,10 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>>>>>>>> elem->obj = (u8 *)elem - pool->elem_offset;
>>>>>>>>>> kref_init(&elem->ref_cnt);
>>>>>>>>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>>>> - &pool->next, GFP_KERNEL);
>>>>>>>>>> + xa_lock_irqsave(&pool->xa, flags);
>>>>>>>>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>>>> + &pool->next, GFP_ATOMIC);
>>>>>>>>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>>>>>>>>> if (err)
>>>>>>>>>> goto err_cnt;
>>>>>>>>>> @@ -200,8 +205,11 @@ static void rxe_elem_release(struct kref *kref)
>>>>>>>>>> {
>>>>>>>>>> struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
>>>>>>>>>> struct rxe_pool *pool = elem->pool;
>>>>>>>>>> + unsigned long flags;
>>>>>>>>>> - xa_erase(&pool->xa, elem->index);
>>>>>>>>>> + xa_lock_irqsave(&pool->xa, flags);
>>>>>>>>>> + __xa_erase(&pool->xa, elem->index);
>>>>>>>>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>>>>>>>>> if (pool->cleanup)
>>>>>>>>>> pool->cleanup(elem);
>>>>> Here is my output. Everything passes there are no bugs or unexpected warnings in the kernel trace.
>>>> If I understand you correctly, you mean that the bug reported by Zhang Yi does not exist?
>>>>
>>>> I can reproduce this bug with rping.
>>>>
>>>> You can not reproduce this bug. It does not mean that this bug does not exist.
>>>>
>>>> And with rping, I also found another wr NULL bug. From the mail, you can also verify this wr NULL bug.
>>>>
>>>> Let us foucus on this wr NULL bug. OK?
>>>>
>>>> Zhu Yanjun
>>>>
>>>>> bob@ubuntu-21:~/src/blktests$ sudo ./check -q srp
>>>>>
>>>>> srp/001 (Create and remove LUNs) [passed]
>>>>>
>>>>> runtime 3.402s ... 2.753s
>>>>>
>>>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]time 34.431s ...
>>>>>
>>>>> runtime 34.431s ... 34.328s
>>>>>
>>>>> srp/003 (File I/O on top of multipath concurrently with logout and login (sq)) [not run]
>>>>>
>>>>> legacy device mapper support is missing
>>>>>
>>>>> srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-mq)) [not run]
>>>>>
>>>>> legacy device mapper support is missing
>>>>>
>>>>> srp/005 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]
>>>>>
>>>>> runtime 14.332s ... 12.919s
>>>>>
>>>>> srp/006 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]
>>>>>
>>>>> runtime 13.361s ... 12.959s
>>>>>
>>>>> srp/007 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=4M) [passed]
>>>>>
>>>>> runtime 14.293s ... 12.912s
>>>>>
>>>>> srp/008 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=8M) [passed]
>>>>>
>>>>> runtime 13.369s ... 13.165s
>>>>>
>>>>> srp/009 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]
>>>>>
>>>>> runtime 13.636s ... 14.201s
>>>>>
>>>>> srp/010 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]
>>>>>
>>>>> runtime 13.361s ... 12.909s
>>>>>
>>>>> srp/011 (Block I/O on top of multipath concurrently with logout and login) [passed]
>>>>>
>>>>> runtime 33.706s ... 33.571s
>>>>>
>>>>> srp/012 (dm-mpath on top of multiple I/O schedulers) [passed]
>>>>>
>>>>> runtime 13.592s ... 14.138s
>>>>>
>>>>> srp/013 (Direct I/O using a discontiguous buffer) [passed]
>>>>>
>>>>> runtime 3.230s ... 3.513s
>>>>>
>>>>> srp/014 (Run sg_reset while I/O is ongoing) [passed]
>>>>>
>>>>> runtime 33.070s ... 33.059s
>>>>>
>>>>> srp/015 (File I/O on top of multipath concurrently with logout and login (mq) using the SoftiWARP (siw) dsrp/015 (File I/O on top of multipath concurrently with logout and login (mq) using the SoftiWARP (siw) driver) [passed].148s ...
>>>> you are using SoftiWARP (siw)?
>>> not me. it is just the normal behavior of the srp/015 test case. it has always done that. my rdma-core
>>> does support siw.
>> Fine.
>> Let us find the root cause of wr NULL problem.
>> I revert xarray patches and fell back to original source code.
>> This wr NULL problem does not exist.
>> I am working on it.
>>
>> Hope we can fix this wr NULL problem very soon.
> I have to go to bed. But the mr == NULL bug was fixed by the last 10 rxe pool patches.
> I am sure it was the 8/10 patch which fixed it.
Please explain the root cause to us and send out the single patch.
Thanks a lot.
Zhu Yanjun
> I have never seen it once all the
> rxe_pool patches were applied.
>
>> Zhu Yanjun
>>
>>>>> runtime 35.148s ... 34.974s
>>>>>
>>>>> bob@ubuntu-21:~/src/blktests$
>>>>>
>>>>> Bob
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem
2022-04-15 7:44 ` Bob Pearson
@ 2022-04-15 8:50 ` Yanjun Zhu
0 siblings, 0 replies; 17+ messages in thread
From: Yanjun Zhu @ 2022-04-15 8:50 UTC (permalink / raw)
To: Bob Pearson, Yanjun Zhu, jgg, leon, linux-rdma
在 2022/4/15 15:44, Bob Pearson 写道:
> On 4/15/22 02:35, Bob Pearson wrote:
>> On 4/15/22 02:32, Yanjun Zhu wrote:
>>>
>>> 在 2022/4/15 15:22, Bob Pearson 写道:
>>>> On 4/15/22 01:49, Yanjun Zhu wrote:
>>>>> 在 2022/4/15 14:35, Bob Pearson 写道:
>>>>>> On 4/15/22 00:54, Yanjun Zhu wrote:
>>>>>>> 在 2022/4/15 13:37, Bob Pearson 写道:
>>>>>>>> On 4/15/22 14:56, yanjun.zhu@linux.dev wrote:
>>>>>>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>>>>>
>>>>>>>>> This is a dead lock problem.
>>>>>>>>> The xa_lock first is acquired in this:
>>>>>>>>>
>>>>>>>>> {SOFTIRQ-ON-W} state was registered at:
>>>>>>>>>
>>>>>>>>> lock_acquire+0x1d2/0x5a0
>>>>>>>>> _raw_spin_lock+0x33/0x80
>>>>>>>>> __rxe_add_to_pool+0x183/0x230 [rdma_rxe]
>>>>>>>>> __ib_alloc_pd+0xf9/0x550 [ib_core]
>>>>>>>>> ib_mad_init_device+0x2d9/0xd20 [ib_core]
>>>>>>>>> add_client_context+0x2fa/0x450 [ib_core]
>>>>>>>>> enable_device_and_get+0x1b7/0x350 [ib_core]
>>>>>>>>> ib_register_device+0x757/0xaf0 [ib_core]
>>>>>>>>> rxe_register_device+0x2eb/0x390 [rdma_rxe]
>>>>>>>>> rxe_net_add+0x83/0xc0 [rdma_rxe]
>>>>>>>>> rxe_newlink+0x76/0x90 [rdma_rxe]
>>>>>>>>> nldev_newlink+0x245/0x3e0 [ib_core]
>>>>>>>>> rdma_nl_rcv_msg+0x2d4/0x790 [ib_core]
>>>>>>>>> rdma_nl_rcv+0x1ca/0x3f0 [ib_core]
>>>>>>>>> netlink_unicast+0x43b/0x640
>>>>>>>>> netlink_sendmsg+0x7eb/0xc40
>>>>>>>>> sock_sendmsg+0xe0/0x110
>>>>>>>>> __sys_sendto+0x1d7/0x2b0
>>>>>>>>> __x64_sys_sendto+0xdd/0x1b0
>>>>>>>>> do_syscall_64+0x37/0x80
>>>>>>>>> entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>>>> There is a separate xarray for each object pool. So this one is
>>>>>>>> rxe->pd_pool.xa.xa_lock from rxe_alloc_pd().
>>>>>>>>
>>>>>>>>> Then xa_lock is acquired in this:
>>>>>>>>>
>>>>>>>>> {IN-SOFTIRQ-W}:
>>>>>>>>>
>>>>>>>>> Call Trace:
>>>>>>>>> <TASK>
>>>>>>>>> dump_stack_lvl+0x44/0x57
>>>>>>>>> mark_lock.part.52.cold.79+0x3c/0x46
>>>>>>>>> __lock_acquire+0x1565/0x34a0
>>>>>>>>> lock_acquire+0x1d2/0x5a0
>>>>>>>>> _raw_spin_lock_irqsave+0x42/0x90
>>>>>>>>> rxe_pool_get_index+0x72/0x1d0 [rdma_rxe]
>>>>>>>>> rxe_get_av+0x168/0x2a0 [rdma_rxe]
>>>>>>>>> rxe_requester+0x75b/0x4a90 [rdma_rxe]
>>>>>>>>> rxe_do_task+0x134/0x230 [rdma_rxe]
>>>>>>>>> tasklet_action_common.isra.12+0x1f7/0x2d0
>>>>>>>>> __do_softirq+0x1ea/0xa4c
>>>>>>>>> run_ksoftirqd+0x32/0x60
>>>>>>>>> smpboot_thread_fn+0x503/0x860
>>>>>>>>> kthread+0x29b/0x340
>>>>>>>>> ret_from_fork+0x1f/0x30
>>>>>>>> And this one is rxe->ah_pool.xa.xa_lock from rxe_requester
>>>>>>>> in the process of sending a UD packet from a work request
>>>>>>>> which contains the index of the ah.
>>>>>>>>
>>>>>>>> For your story to work there needs to be an another ah_pool.xa.xa_lock somewhere.
>>>>>>>> Let's assume it is there somewhere and it's from (a different) add_to_pool call
>>>>>>>> then the add_to_pool_ routine should disable interrupts when it gets the lock
>>>>>>>> with spin_lock_xxx. But only for AH objects.
>>>>>>>>
>>>>>>>> This may be old news.
>>>>>>> What do you mean? Please check the call trace in the bug.
>>>>>> I mean the trace you show here shows an instance of xa_lock being
>>>>>> acquired from the pd pool followed by an instance of xa_lock being
>>>>>> acquired from rxe_pool_get_index from the ah pool. They are different
>>>>>> locks. They can't deadlock against each other. So there must be
>>>>>> some other trace (not shown) that also gets xa_lock from the ah pool.
>>>>> Please check the bug report mail. The link is news://nntp.lore.kernel.org:119/CAHj4cs-MT13RiEsWXUAcX_H5jEtjsebuZgSeUcfptNVuELgjYQ@mail.gmail.com
>>>>>
>>>>> BTW, what is the update about wr crash caused by your xarray patches?
>>>>>
>>>>> Zhu Yanjun
>>>>>
>>>>>>> Zhu Yanjun
>>>>>>>
>>>>>>>>> </TASK>
>>>>>>>>>
>>>>>>>>> From the above, in the function __rxe_add_to_pool,
>>>>>>>>> xa_lock is acquired. Then the function __rxe_add_to_pool
>>>>>>>>> is interrupted by softirq. The function
>>>>>>>>> rxe_pool_get_index will also acquire xa_lock.
>>>>>>>>>
>>>>>>>>> Finally, the dead lock appears.
>>>>>>>>>
>>>>>>>>> [ 296.806097] CPU0
>>>>>>>>> [ 296.808550] ----
>>>>>>>>> [ 296.811003] lock(&xa->xa_lock#15); <----- __rxe_add_to_pool
>>>>>>>>> [ 296.814583] <Interrupt>
>>>>>>>>> [ 296.817209] lock(&xa->xa_lock#15); <---- rxe_pool_get_index
>>>>>>>>> [ 296.820961]
>>>>>>>>> *** DEADLOCK ***
>>>>>>>>>
>>>>>>>>> Fixes: 3225717f6dfa ("RDMA/rxe: Replace red-black trees by carrays")
>>>>>>>>> Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
>>>>>>>>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>>>>> ---
>>>>>>>>> V3->V4: xa_lock_irq locks are used.
>>>>>>>>> V2->V3: __rxe_add_to_pool is between spin_lock and spin_unlock, so
>>>>>>>>> GFP_ATOMIC is used in __rxe_add_to_pool.
>>>>>>>>> V1->V2: Replace GFP_KERNEL with GFP_ATOMIC
>>>>>>>>> ---
>>>>>>>>> drivers/infiniband/sw/rxe/rxe_pool.c | 20 ++++++++++++++------
>>>>>>>>> 1 file changed, 14 insertions(+), 6 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>>>> index 87066d04ed18..f1f06dc7e64f 100644
>>>>>>>>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>>>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>>>>> @@ -106,7 +106,7 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
>>>>>>>>> atomic_set(&pool->num_elem, 0);
>>>>>>>>> - xa_init_flags(&pool->xa, XA_FLAGS_ALLOC);
>>>>>>>>> + xa_init_flags(&pool->xa, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
>>>>>>>>> pool->limit.min = info->min_index;
>>>>>>>>> pool->limit.max = info->max_index;
>>>>>>>>> }
>>>>>>>>> @@ -138,8 +138,10 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>>>>>>> elem->obj = obj;
>>>>>>>>> kref_init(&elem->ref_cnt);
>>>>>>>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>>> - &pool->next, GFP_KERNEL);
>>>>>>>>> + xa_lock_irq(&pool->xa);
>>>>>>>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>>> + &pool->next, GFP_KERNEL);
>>>>>>>>> + xa_unlock_irq(&pool->xa);
>>>>>>>>> if (err)
>>>>>>>>> goto err_free;
>>>>>>>>> @@ -155,6 +157,7 @@ void *rxe_alloc(struct rxe_pool *pool)
>>>>>>>>> int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>>>>>>> {
>>>>>>>>> int err;
>>>>>>>>> + unsigned long flags;
>>>>>>>>> if (WARN_ON(pool->flags & RXE_POOL_ALLOC))
>>>>>>>>> return -EINVAL;
>>>>>>>>> @@ -166,8 +169,10 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>>>>>>>>> elem->obj = (u8 *)elem - pool->elem_offset;
>>>>>>>>> kref_init(&elem->ref_cnt);
>>>>>>>>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>>> - &pool->next, GFP_KERNEL);
>>>>>>>>> + xa_lock_irqsave(&pool->xa, flags);
>>>>>>>>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>>>>>>>>> + &pool->next, GFP_ATOMIC);
>>>>>>>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>>>>>>>> if (err)
>>>>>>>>> goto err_cnt;
>>>>>>>>> @@ -200,8 +205,11 @@ static void rxe_elem_release(struct kref *kref)
>>>>>>>>> {
>>>>>>>>> struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
>>>>>>>>> struct rxe_pool *pool = elem->pool;
>>>>>>>>> + unsigned long flags;
>>>>>>>>> - xa_erase(&pool->xa, elem->index);
>>>>>>>>> + xa_lock_irqsave(&pool->xa, flags);
>>>>>>>>> + __xa_erase(&pool->xa, elem->index);
>>>>>>>>> + xa_unlock_irqrestore(&pool->xa, flags);
>>>>>>>>> if (pool->cleanup)
>>>>>>>>> pool->cleanup(elem);
>>>> Here is my output. Everything passes there are no bugs or unexpected warnings in the kernel trace.
>>>
>>> If I understand you correctly, you mean that the bug reported by Zhang Yi does not exist?
>
> Not any more on my rxe tree.
>>>
>>> I can reproduce this bug with rping.
>
> My tree does not cause this bug any more in rping. It used to but it was fixed a few days ago.
> But there remains a fairly rare race condition in rping which I described in a previous note related
> to retry timeouts occuring for RDMA read operations. It is caused by spurious retry timer firing
> and something wrong in the error path code that I am working on trying to isolate.
Hi, Bob
We should be in the same kernel tree.
Based on linux 5.18-rc2, a wr NULL caused by xarry patches.
To this problem, Do you know the root cause? And can you fix it?
If yes, please show us the root cause and fix.
Zhu Yanjun
>>>
>>> You can not reproduce this bug. It does not mean that this bug does not exist.
>>>
>>> And with rping, I also found another wr NULL bug. From the mail, you can also verify this wr NULL bug.
>>>
>>> Let us foucus on this wr NULL bug. OK?
>>>
>>> Zhu Yanjun
>>>
>>>>
>>>> bob@ubuntu-21:~/src/blktests$ sudo ./check -q srp
>>>>
>>>> srp/001 (Create and remove LUNs) [passed]
>>>>
>>>> runtime 3.402s ... 2.753s
>>>>
>>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]time 34.431s ...
>>>>
>>>> runtime 34.431s ... 34.328s
>>>>
>>>> srp/003 (File I/O on top of multipath concurrently with logout and login (sq)) [not run]
>>>>
>>>> legacy device mapper support is missing
>>>>
>>>> srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-mq)) [not run]
>>>>
>>>> legacy device mapper support is missing
>>>>
>>>> srp/005 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]
>>>>
>>>> runtime 14.332s ... 12.919s
>>>>
>>>> srp/006 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]
>>>>
>>>> runtime 13.361s ... 12.959s
>>>>
>>>> srp/007 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=4M) [passed]
>>>>
>>>> runtime 14.293s ... 12.912s
>>>>
>>>> srp/008 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=8M) [passed]
>>>>
>>>> runtime 13.369s ... 13.165s
>>>>
>>>> srp/009 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]
>>>>
>>>> runtime 13.636s ... 14.201s
>>>>
>>>> srp/010 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]
>>>>
>>>> runtime 13.361s ... 12.909s
>>>>
>>>> srp/011 (Block I/O on top of multipath concurrently with logout and login) [passed]
>>>>
>>>> runtime 33.706s ... 33.571s
>>>>
>>>> srp/012 (dm-mpath on top of multiple I/O schedulers) [passed]
>>>>
>>>> runtime 13.592s ... 14.138s
>>>>
>>>> srp/013 (Direct I/O using a discontiguous buffer) [passed]
>>>>
>>>> runtime 3.230s ... 3.513s
>>>>
>>>> srp/014 (Run sg_reset while I/O is ongoing) [passed]
>>>>
>>>> runtime 33.070s ... 33.059s
>>>>
>>>> srp/015 (File I/O on top of multipath concurrently with logout and login (mq) using the SoftiWARP (siw) dsrp/015 (File I/O on top of multipath concurrently with logout and login (mq) using the SoftiWARP (siw) driver) [passed].148s ...
>>>
>>> you are using SoftiWARP (siw)?
>>
>> not me. it is just the normal behavior of the srp/015 test case. it has always done that. my rdma-core
>> does support siw.
>>
>>>
>>>>
>>>> runtime 35.148s ... 34.974s
>>>>
>>>> bob@ubuntu-21:~/src/blktests$
>>>>
>>>> Bob
>>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem
2022-04-15 5:37 ` Bob Pearson
2022-04-15 5:54 ` Yanjun Zhu
@ 2022-04-15 10:25 ` Yanjun Zhu
2022-04-15 15:19 ` Yanjun Zhu
2 siblings, 0 replies; 17+ messages in thread
From: Yanjun Zhu @ 2022-04-15 10:25 UTC (permalink / raw)
To: Bob Pearson, yanjun.zhu, jgg, leon, linux-rdma
在 2022/4/15 13:37, Bob Pearson 写道:
> On 4/15/22 14:56, yanjun.zhu@linux.dev wrote:
>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>
>> This is a dead lock problem.
>> The xa_lock first is acquired in this:
>>
>> {SOFTIRQ-ON-W} state was registered at:
>>
>> lock_acquire+0x1d2/0x5a0
>> _raw_spin_lock+0x33/0x80
>> __rxe_add_to_pool+0x183/0x230 [rdma_rxe]
>> __ib_alloc_pd+0xf9/0x550 [ib_core]
>> ib_mad_init_device+0x2d9/0xd20 [ib_core]
>> add_client_context+0x2fa/0x450 [ib_core]
>> enable_device_and_get+0x1b7/0x350 [ib_core]
>> ib_register_device+0x757/0xaf0 [ib_core]
>> rxe_register_device+0x2eb/0x390 [rdma_rxe]
>> rxe_net_add+0x83/0xc0 [rdma_rxe]
>> rxe_newlink+0x76/0x90 [rdma_rxe]
>> nldev_newlink+0x245/0x3e0 [ib_core]
>> rdma_nl_rcv_msg+0x2d4/0x790 [ib_core]
>> rdma_nl_rcv+0x1ca/0x3f0 [ib_core]
>> netlink_unicast+0x43b/0x640
>> netlink_sendmsg+0x7eb/0xc40
>> sock_sendmsg+0xe0/0x110
>> __sys_sendto+0x1d7/0x2b0
>> __x64_sys_sendto+0xdd/0x1b0
>> do_syscall_64+0x37/0x80
>> entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> There is a separate xarray for each object pool. So this one is
> rxe->pd_pool.xa.xa_lock from rxe_alloc_pd().
>
>>
>> Then xa_lock is acquired in this:
>>
>> {IN-SOFTIRQ-W}:
>>
>> Call Trace:
>> <TASK>
>> dump_stack_lvl+0x44/0x57
>> mark_lock.part.52.cold.79+0x3c/0x46
>> __lock_acquire+0x1565/0x34a0
>> lock_acquire+0x1d2/0x5a0
>> _raw_spin_lock_irqsave+0x42/0x90
>> rxe_pool_get_index+0x72/0x1d0 [rdma_rxe]
>> rxe_get_av+0x168/0x2a0 [rdma_rxe]
>> rxe_requester+0x75b/0x4a90 [rdma_rxe]
>> rxe_do_task+0x134/0x230 [rdma_rxe]
>> tasklet_action_common.isra.12+0x1f7/0x2d0
>> __do_softirq+0x1ea/0xa4c
>> run_ksoftirqd+0x32/0x60
>> smpboot_thread_fn+0x503/0x860
>> kthread+0x29b/0x340
>> ret_from_fork+0x1f/0x30
>
> And this one is rxe->ah_pool.xa.xa_lock from rxe_requester
> in the process of sending a UD packet from a work request
> which contains the index of the ah.
>
> For your story to work there needs to be an another ah_pool.xa.xa_lock somewhere.
> Let's assume it is there somewhere and it's from (a different) add_to_pool call
> then the add_to_pool_ routine should disable interrupts when it gets the lock
> with spin_lock_xxx. But only for AH objects.
If I understand you correctly, you are suspecting the above call trace.
If yes, please check the bug reported in the maillist.
And please based on 5.18-rc2 to discuss this bug because this bug
occurred in 5.18-rc2. And we are working on 5.18-rc2.
Zhu Yanjun
>
> This may be old news.
>
>> </TASK>
>>
>> From the above, in the function __rxe_add_to_pool,
>> xa_lock is acquired. Then the function __rxe_add_to_pool
>> is interrupted by softirq. The function
>> rxe_pool_get_index will also acquire xa_lock.
>>
>> Finally, the dead lock appears.
>>
>> [ 296.806097] CPU0
>> [ 296.808550] ----
>> [ 296.811003] lock(&xa->xa_lock#15); <----- __rxe_add_to_pool
>> [ 296.814583] <Interrupt>
>> [ 296.817209] lock(&xa->xa_lock#15); <---- rxe_pool_get_index
>> [ 296.820961]
>> *** DEADLOCK ***
>>
>> Fixes: 3225717f6dfa ("RDMA/rxe: Replace red-black trees by carrays")
>> Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>> ---
>> V3->V4: xa_lock_irq locks are used.
>> V2->V3: __rxe_add_to_pool is between spin_lock and spin_unlock, so
>> GFP_ATOMIC is used in __rxe_add_to_pool.
>> V1->V2: Replace GFP_KERNEL with GFP_ATOMIC
>> ---
>> drivers/infiniband/sw/rxe/rxe_pool.c | 20 ++++++++++++++------
>> 1 file changed, 14 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
>> index 87066d04ed18..f1f06dc7e64f 100644
>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
>> @@ -106,7 +106,7 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
>>
>> atomic_set(&pool->num_elem, 0);
>>
>> - xa_init_flags(&pool->xa, XA_FLAGS_ALLOC);
>> + xa_init_flags(&pool->xa, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
>> pool->limit.min = info->min_index;
>> pool->limit.max = info->max_index;
>> }
>> @@ -138,8 +138,10 @@ void *rxe_alloc(struct rxe_pool *pool)
>> elem->obj = obj;
>> kref_init(&elem->ref_cnt);
>>
>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>> - &pool->next, GFP_KERNEL);
>> + xa_lock_irq(&pool->xa);
>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>> + &pool->next, GFP_KERNEL);
>> + xa_unlock_irq(&pool->xa);
>> if (err)
>> goto err_free;
>>
>> @@ -155,6 +157,7 @@ void *rxe_alloc(struct rxe_pool *pool)
>> int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>> {
>> int err;
>> + unsigned long flags;
>>
>> if (WARN_ON(pool->flags & RXE_POOL_ALLOC))
>> return -EINVAL;
>> @@ -166,8 +169,10 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>> elem->obj = (u8 *)elem - pool->elem_offset;
>> kref_init(&elem->ref_cnt);
>>
>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>> - &pool->next, GFP_KERNEL);
>> + xa_lock_irqsave(&pool->xa, flags);
>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>> + &pool->next, GFP_ATOMIC);
>> + xa_unlock_irqrestore(&pool->xa, flags);
>> if (err)
>> goto err_cnt;
>>
>> @@ -200,8 +205,11 @@ static void rxe_elem_release(struct kref *kref)
>> {
>> struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
>> struct rxe_pool *pool = elem->pool;
>> + unsigned long flags;
>>
>> - xa_erase(&pool->xa, elem->index);
>> + xa_lock_irqsave(&pool->xa, flags);
>> + __xa_erase(&pool->xa, elem->index);
>> + xa_unlock_irqrestore(&pool->xa, flags);
>>
>> if (pool->cleanup)
>> pool->cleanup(elem);
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem
2022-04-15 5:37 ` Bob Pearson
2022-04-15 5:54 ` Yanjun Zhu
2022-04-15 10:25 ` Yanjun Zhu
@ 2022-04-15 15:19 ` Yanjun Zhu
2 siblings, 0 replies; 17+ messages in thread
From: Yanjun Zhu @ 2022-04-15 15:19 UTC (permalink / raw)
To: Bob Pearson, yanjun.zhu, jgg, leon, linux-rdma
在 2022/4/15 13:37, Bob Pearson 写道:
> On 4/15/22 14:56, yanjun.zhu@linux.dev wrote:
>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>
>> This is a dead lock problem.
>> The xa_lock first is acquired in this:
>>
>> {SOFTIRQ-ON-W} state was registered at:
>>
>> lock_acquire+0x1d2/0x5a0
>> _raw_spin_lock+0x33/0x80
>> __rxe_add_to_pool+0x183/0x230 [rdma_rxe]
>> __ib_alloc_pd+0xf9/0x550 [ib_core]
>> ib_mad_init_device+0x2d9/0xd20 [ib_core]
>> add_client_context+0x2fa/0x450 [ib_core]
>> enable_device_and_get+0x1b7/0x350 [ib_core]
>> ib_register_device+0x757/0xaf0 [ib_core]
>> rxe_register_device+0x2eb/0x390 [rdma_rxe]
>> rxe_net_add+0x83/0xc0 [rdma_rxe]
>> rxe_newlink+0x76/0x90 [rdma_rxe]
>> nldev_newlink+0x245/0x3e0 [ib_core]
>> rdma_nl_rcv_msg+0x2d4/0x790 [ib_core]
>> rdma_nl_rcv+0x1ca/0x3f0 [ib_core]
>> netlink_unicast+0x43b/0x640
>> netlink_sendmsg+0x7eb/0xc40
>> sock_sendmsg+0xe0/0x110
>> __sys_sendto+0x1d7/0x2b0
>> __x64_sys_sendto+0xdd/0x1b0
>> do_syscall_64+0x37/0x80
>> entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> There is a separate xarray for each object pool. So this one is
> rxe->pd_pool.xa.xa_lock from rxe_alloc_pd().
>
>>
>> Then xa_lock is acquired in this:
>>
>> {IN-SOFTIRQ-W}:
>>
>> Call Trace:
>> <TASK>
>> dump_stack_lvl+0x44/0x57
>> mark_lock.part.52.cold.79+0x3c/0x46
>> __lock_acquire+0x1565/0x34a0
>> lock_acquire+0x1d2/0x5a0
>> _raw_spin_lock_irqsave+0x42/0x90
>> rxe_pool_get_index+0x72/0x1d0 [rdma_rxe]
>> rxe_get_av+0x168/0x2a0 [rdma_rxe]
>> rxe_requester+0x75b/0x4a90 [rdma_rxe]
>> rxe_do_task+0x134/0x230 [rdma_rxe]
>> tasklet_action_common.isra.12+0x1f7/0x2d0
>> __do_softirq+0x1ea/0xa4c
>> run_ksoftirqd+0x32/0x60
>> smpboot_thread_fn+0x503/0x860
>> kthread+0x29b/0x340
>> ret_from_fork+0x1f/0x30
>
> And this one is rxe->ah_pool.xa.xa_lock from rxe_requester
> in the process of sending a UD packet from a work request
> which contains the index of the ah.
>
> For your story to work there needs to be an another ah_pool.xa.xa_lock somewhere.
> Let's assume it is there somewhere and it's from (a different) add_to_pool call
> then the add_to_pool_ routine should disable interrupts when it gets the lock
> with spin_lock_xxx. But only for AH objects.
Fortunately, I can reproduce this problem.
"
Apr 15 19:06:05 kernel: rdma_rxe: drivers/infiniband/sw/rxe/rxe_pool.c
+169, __rxe_add_to_pool, name:ah
Apr 15 19:06:05 kernel: rdma_rxe: drivers/infiniband/sw/rxe/rxe_pool.c
+189, rxe_pool_get_index, name:ah
"
Please check the above logs. Focus on the pool name.
Zhu Yanjun
>
> This may be old news.
>
>> </TASK>
>>
>> From the above, in the function __rxe_add_to_pool,
>> xa_lock is acquired. Then the function __rxe_add_to_pool
>> is interrupted by softirq. The function
>> rxe_pool_get_index will also acquire xa_lock.
>>
>> Finally, the dead lock appears.
>>
>> [ 296.806097] CPU0
>> [ 296.808550] ----
>> [ 296.811003] lock(&xa->xa_lock#15); <----- __rxe_add_to_pool
>> [ 296.814583] <Interrupt>
>> [ 296.817209] lock(&xa->xa_lock#15); <---- rxe_pool_get_index
>> [ 296.820961]
>> *** DEADLOCK ***
>>
>> Fixes: 3225717f6dfa ("RDMA/rxe: Replace red-black trees by carrays")
>> Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>> ---
>> V3->V4: xa_lock_irq locks are used.
>> V2->V3: __rxe_add_to_pool is between spin_lock and spin_unlock, so
>> GFP_ATOMIC is used in __rxe_add_to_pool.
>> V1->V2: Replace GFP_KERNEL with GFP_ATOMIC
>> ---
>> drivers/infiniband/sw/rxe/rxe_pool.c | 20 ++++++++++++++------
>> 1 file changed, 14 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
>> index 87066d04ed18..f1f06dc7e64f 100644
>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
>> @@ -106,7 +106,7 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
>>
>> atomic_set(&pool->num_elem, 0);
>>
>> - xa_init_flags(&pool->xa, XA_FLAGS_ALLOC);
>> + xa_init_flags(&pool->xa, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
>> pool->limit.min = info->min_index;
>> pool->limit.max = info->max_index;
>> }
>> @@ -138,8 +138,10 @@ void *rxe_alloc(struct rxe_pool *pool)
>> elem->obj = obj;
>> kref_init(&elem->ref_cnt);
>>
>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>> - &pool->next, GFP_KERNEL);
>> + xa_lock_irq(&pool->xa);
>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>> + &pool->next, GFP_KERNEL);
>> + xa_unlock_irq(&pool->xa);
>> if (err)
>> goto err_free;
>>
>> @@ -155,6 +157,7 @@ void *rxe_alloc(struct rxe_pool *pool)
>> int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>> {
>> int err;
>> + unsigned long flags;
>>
>> if (WARN_ON(pool->flags & RXE_POOL_ALLOC))
>> return -EINVAL;
>> @@ -166,8 +169,10 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>> elem->obj = (u8 *)elem - pool->elem_offset;
>> kref_init(&elem->ref_cnt);
>>
>> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>> - &pool->next, GFP_KERNEL);
>> + xa_lock_irqsave(&pool->xa, flags);
>> + err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
>> + &pool->next, GFP_ATOMIC);
>> + xa_unlock_irqrestore(&pool->xa, flags);
>> if (err)
>> goto err_cnt;
>>
>> @@ -200,8 +205,11 @@ static void rxe_elem_release(struct kref *kref)
>> {
>> struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
>> struct rxe_pool *pool = elem->pool;
>> + unsigned long flags;
>>
>> - xa_erase(&pool->xa, elem->index);
>> + xa_lock_irqsave(&pool->xa, flags);
>> + __xa_erase(&pool->xa, elem->index);
>> + xa_unlock_irqrestore(&pool->xa, flags);
>>
>> if (pool->cleanup)
>> pool->cleanup(elem);
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem
@ 2022-04-15 19:56 yanjun.zhu
2022-04-15 5:37 ` Bob Pearson
2022-04-15 19:56 ` [PATCH 2/2] RDMA/rxe: Use different xa locks on different path yanjun.zhu
0 siblings, 2 replies; 17+ messages in thread
From: yanjun.zhu @ 2022-04-15 19:56 UTC (permalink / raw)
To: jgg, leon, linux-rdma, yanjun.zhu; +Cc: Yi Zhang
From: Zhu Yanjun <yanjun.zhu@linux.dev>
This is a dead lock problem.
The xa_lock first is acquired in this:
{SOFTIRQ-ON-W} state was registered at:
lock_acquire+0x1d2/0x5a0
_raw_spin_lock+0x33/0x80
__rxe_add_to_pool+0x183/0x230 [rdma_rxe]
__ib_alloc_pd+0xf9/0x550 [ib_core]
ib_mad_init_device+0x2d9/0xd20 [ib_core]
add_client_context+0x2fa/0x450 [ib_core]
enable_device_and_get+0x1b7/0x350 [ib_core]
ib_register_device+0x757/0xaf0 [ib_core]
rxe_register_device+0x2eb/0x390 [rdma_rxe]
rxe_net_add+0x83/0xc0 [rdma_rxe]
rxe_newlink+0x76/0x90 [rdma_rxe]
nldev_newlink+0x245/0x3e0 [ib_core]
rdma_nl_rcv_msg+0x2d4/0x790 [ib_core]
rdma_nl_rcv+0x1ca/0x3f0 [ib_core]
netlink_unicast+0x43b/0x640
netlink_sendmsg+0x7eb/0xc40
sock_sendmsg+0xe0/0x110
__sys_sendto+0x1d7/0x2b0
__x64_sys_sendto+0xdd/0x1b0
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Then xa_lock is acquired in this:
{IN-SOFTIRQ-W}:
Call Trace:
<TASK>
dump_stack_lvl+0x44/0x57
mark_lock.part.52.cold.79+0x3c/0x46
__lock_acquire+0x1565/0x34a0
lock_acquire+0x1d2/0x5a0
_raw_spin_lock_irqsave+0x42/0x90
rxe_pool_get_index+0x72/0x1d0 [rdma_rxe]
rxe_get_av+0x168/0x2a0 [rdma_rxe]
rxe_requester+0x75b/0x4a90 [rdma_rxe]
rxe_do_task+0x134/0x230 [rdma_rxe]
tasklet_action_common.isra.12+0x1f7/0x2d0
__do_softirq+0x1ea/0xa4c
run_ksoftirqd+0x32/0x60
smpboot_thread_fn+0x503/0x860
kthread+0x29b/0x340
ret_from_fork+0x1f/0x30
</TASK>
From the above, in the function __rxe_add_to_pool,
xa_lock is acquired. Then the function __rxe_add_to_pool
is interrupted by softirq. The function
rxe_pool_get_index will also acquire xa_lock.
Finally, the dead lock appears.
[ 296.806097] CPU0
[ 296.808550] ----
[ 296.811003] lock(&xa->xa_lock#15); <----- __rxe_add_to_pool
[ 296.814583] <Interrupt>
[ 296.817209] lock(&xa->xa_lock#15); <---- rxe_pool_get_index
[ 296.820961]
*** DEADLOCK ***
Fixes: 3225717f6dfa ("RDMA/rxe: Replace red-black trees by carrays")
Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
V3->V4: xa_lock_irq locks are used.
V2->V3: __rxe_add_to_pool is between spin_lock and spin_unlock, so
GFP_ATOMIC is used in __rxe_add_to_pool.
V1->V2: Replace GFP_KERNEL with GFP_ATOMIC
---
drivers/infiniband/sw/rxe/rxe_pool.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 87066d04ed18..f1f06dc7e64f 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -106,7 +106,7 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
atomic_set(&pool->num_elem, 0);
- xa_init_flags(&pool->xa, XA_FLAGS_ALLOC);
+ xa_init_flags(&pool->xa, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
pool->limit.min = info->min_index;
pool->limit.max = info->max_index;
}
@@ -138,8 +138,10 @@ void *rxe_alloc(struct rxe_pool *pool)
elem->obj = obj;
kref_init(&elem->ref_cnt);
- err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
- &pool->next, GFP_KERNEL);
+ xa_lock_irq(&pool->xa);
+ err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
+ &pool->next, GFP_KERNEL);
+ xa_unlock_irq(&pool->xa);
if (err)
goto err_free;
@@ -155,6 +157,7 @@ void *rxe_alloc(struct rxe_pool *pool)
int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
{
int err;
+ unsigned long flags;
if (WARN_ON(pool->flags & RXE_POOL_ALLOC))
return -EINVAL;
@@ -166,8 +169,10 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
elem->obj = (u8 *)elem - pool->elem_offset;
kref_init(&elem->ref_cnt);
- err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
- &pool->next, GFP_KERNEL);
+ xa_lock_irqsave(&pool->xa, flags);
+ err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
+ &pool->next, GFP_ATOMIC);
+ xa_unlock_irqrestore(&pool->xa, flags);
if (err)
goto err_cnt;
@@ -200,8 +205,11 @@ static void rxe_elem_release(struct kref *kref)
{
struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
struct rxe_pool *pool = elem->pool;
+ unsigned long flags;
- xa_erase(&pool->xa, elem->index);
+ xa_lock_irqsave(&pool->xa, flags);
+ __xa_erase(&pool->xa, elem->index);
+ xa_unlock_irqrestore(&pool->xa, flags);
if (pool->cleanup)
pool->cleanup(elem);
--
2.27.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 2/2] RDMA/rxe: Use different xa locks on different path
2022-04-15 19:56 [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem yanjun.zhu
2022-04-15 5:37 ` Bob Pearson
@ 2022-04-15 19:56 ` yanjun.zhu
1 sibling, 0 replies; 17+ messages in thread
From: yanjun.zhu @ 2022-04-15 19:56 UTC (permalink / raw)
To: jgg, leon, linux-rdma, yanjun.zhu
From: Zhu Yanjun <yanjun.zhu@linux.dev>
The function __rxe_add_to_pool is called on different path, and the
requirement of the locks is different. The function rxe_create_ah
requires xa_lock_irqsave/irqrestore while others only require xa_lock_irq.
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
drivers/infiniband/sw/rxe/rxe_mw.c | 2 +-
drivers/infiniband/sw/rxe/rxe_pool.c | 20 ++++++++++++++------
drivers/infiniband/sw/rxe/rxe_pool.h | 4 ++--
drivers/infiniband/sw/rxe/rxe_verbs.c | 12 ++++++------
4 files changed, 23 insertions(+), 15 deletions(-)
diff --git a/drivers/infiniband/sw/rxe/rxe_mw.c b/drivers/infiniband/sw/rxe/rxe_mw.c
index c86b2efd58f2..9d72dcc9060d 100644
--- a/drivers/infiniband/sw/rxe/rxe_mw.c
+++ b/drivers/infiniband/sw/rxe/rxe_mw.c
@@ -14,7 +14,7 @@ int rxe_alloc_mw(struct ib_mw *ibmw, struct ib_udata *udata)
rxe_get(pd);
- ret = rxe_add_to_pool(&rxe->mw_pool, mw);
+ ret = rxe_add_to_pool(&rxe->mw_pool, mw, GFP_KERNEL);
if (ret) {
rxe_put(pd);
return ret;
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index f1f06dc7e64f..e64c9433ab0e 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -154,10 +154,9 @@ void *rxe_alloc(struct rxe_pool *pool)
return NULL;
}
-int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
+int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem, gfp_t gfp)
{
int err;
- unsigned long flags;
if (WARN_ON(pool->flags & RXE_POOL_ALLOC))
return -EINVAL;
@@ -169,10 +168,19 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
elem->obj = (u8 *)elem - pool->elem_offset;
kref_init(&elem->ref_cnt);
- xa_lock_irqsave(&pool->xa, flags);
- err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
- &pool->next, GFP_ATOMIC);
- xa_unlock_irqrestore(&pool->xa, flags);
+ if (gfp & GFP_ATOMIC) { /* for rxe_create_ah */
+ unsigned long flags;
+
+ xa_lock_irqsave(&pool->xa, flags);
+ err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem,
+ pool->limit, &pool->next, GFP_ATOMIC);
+ xa_unlock_irqrestore(&pool->xa, flags);
+ } else if (gfp & GFP_KERNEL) {
+ xa_lock_irq(&pool->xa);
+ err = __xa_alloc_cyclic(&pool->xa, &elem->index, elem,
+ pool->limit, &pool->next, GFP_KERNEL);
+ xa_unlock_irq(&pool->xa);
+ }
if (err)
goto err_cnt;
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h
index 24bcc786c1b3..12986622088b 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -62,9 +62,9 @@ void rxe_pool_cleanup(struct rxe_pool *pool);
void *rxe_alloc(struct rxe_pool *pool);
/* connect already allocated object to pool */
-int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem);
+int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem, gfp_t gfp);
-#define rxe_add_to_pool(pool, obj) __rxe_add_to_pool(pool, &(obj)->elem)
+#define rxe_add_to_pool(pool, obj, gfp) __rxe_add_to_pool(pool, &(obj)->elem, gfp)
/* lookup an indexed object from index. takes a reference on object */
void *rxe_pool_get_index(struct rxe_pool *pool, u32 index);
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 67184b0281a0..20b133f857f5 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -108,7 +108,7 @@ static int rxe_alloc_ucontext(struct ib_ucontext *ibuc, struct ib_udata *udata)
struct rxe_dev *rxe = to_rdev(ibuc->device);
struct rxe_ucontext *uc = to_ruc(ibuc);
- return rxe_add_to_pool(&rxe->uc_pool, uc);
+ return rxe_add_to_pool(&rxe->uc_pool, uc, GFP_KERNEL);
}
static void rxe_dealloc_ucontext(struct ib_ucontext *ibuc)
@@ -142,7 +142,7 @@ static int rxe_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
struct rxe_dev *rxe = to_rdev(ibpd->device);
struct rxe_pd *pd = to_rpd(ibpd);
- return rxe_add_to_pool(&rxe->pd_pool, pd);
+ return rxe_add_to_pool(&rxe->pd_pool, pd, GFP_KERNEL);
}
static int rxe_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
@@ -176,7 +176,7 @@ static int rxe_create_ah(struct ib_ah *ibah,
if (err)
return err;
- err = rxe_add_to_pool(&rxe->ah_pool, ah);
+ err = rxe_add_to_pool(&rxe->ah_pool, ah, GFP_ATOMIC);
if (err)
return err;
@@ -299,7 +299,7 @@ static int rxe_create_srq(struct ib_srq *ibsrq, struct ib_srq_init_attr *init,
if (err)
goto err1;
- err = rxe_add_to_pool(&rxe->srq_pool, srq);
+ err = rxe_add_to_pool(&rxe->srq_pool, srq, GFP_KERNEL);
if (err)
goto err1;
@@ -431,7 +431,7 @@ static int rxe_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *init,
qp->is_user = false;
}
- err = rxe_add_to_pool(&rxe->qp_pool, qp);
+ err = rxe_add_to_pool(&rxe->qp_pool, qp, GFP_KERNEL);
if (err)
return err;
@@ -800,7 +800,7 @@ static int rxe_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
if (err)
return err;
- return rxe_add_to_pool(&rxe->cq_pool, cq);
+ return rxe_add_to_pool(&rxe->cq_pool, cq, GFP_KERNEL);
}
static int rxe_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
--
2.27.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem
2022-04-15 6:35 ` Bob Pearson
2022-04-15 6:49 ` Yanjun Zhu
@ 2022-04-20 16:41 ` Jason Gunthorpe
1 sibling, 0 replies; 17+ messages in thread
From: Jason Gunthorpe @ 2022-04-20 16:41 UTC (permalink / raw)
To: Bob Pearson; +Cc: Yanjun Zhu, leon, linux-rdma
On Fri, Apr 15, 2022 at 01:35:58AM -0500, Bob Pearson wrote:
> I mean the trace you show here shows an instance of xa_lock being
> acquired from the pd pool followed by an instance of xa_lock being
> acquired from rxe_pool_get_index from the ah pool. They are different
> locks. They can't deadlock against each other. So there must be
> some other trace (not shown) that also gets xa_lock from the ah pool.
This is because lockdep groups locks by allocation point into the same
'lock class' so as far as lockdep is concerned the AH and PD's are all
the same lock and you'll get reports like the above.
The same issue will show up with AH only, you just need to hit a path
that allocates an AH from a process context, like uverbs.
Jason
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2022-04-20 16:41 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-15 19:56 [PATCHv4 1/2] RDMA/rxe: Fix a dead lock problem yanjun.zhu
2022-04-15 5:37 ` Bob Pearson
2022-04-15 5:54 ` Yanjun Zhu
2022-04-15 6:35 ` Bob Pearson
2022-04-15 6:49 ` Yanjun Zhu
2022-04-15 7:22 ` Bob Pearson
2022-04-15 7:32 ` Yanjun Zhu
2022-04-15 7:35 ` Bob Pearson
2022-04-15 7:42 ` Yanjun Zhu
2022-04-15 7:49 ` Bob Pearson
2022-04-15 7:52 ` Yanjun Zhu
2022-04-15 7:44 ` Bob Pearson
2022-04-15 8:50 ` Yanjun Zhu
2022-04-20 16:41 ` Jason Gunthorpe
2022-04-15 10:25 ` Yanjun Zhu
2022-04-15 15:19 ` Yanjun Zhu
2022-04-15 19:56 ` [PATCH 2/2] RDMA/rxe: Use different xa locks on different path yanjun.zhu
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.