All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bob Pearson <rpearsonhpe@gmail.com>
To: Bart Van Assche <bvanassche@acm.org>,
	"Pearson, Robert B" <robert.pearson2@hpe.com>,
	"jgg@nvidia.com" <jgg@nvidia.com>,
	"zyjzyj2000@gmail.com" <zyjzyj2000@gmail.com>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	"mie@igel.co.jp" <mie@igel.co.jp>,
	Xiao Yang <yangx.jy@fujitsu.com>
Subject: Re: [PATCH for-rc v3 0/6] RDMA/rxe: Various bug fixes.
Date: Sun, 12 Sep 2021 09:41:02 -0500	[thread overview]
Message-ID: <557a5fd9-2a30-5752-d09b-05183ab3c43b@gmail.com> (raw)
In-Reply-To: <918787c7-de06-ef67-80ac-ae2e7643dd61@acm.org>

On 9/10/21 5:07 PM, Bart Van Assche wrote:
> On 9/10/21 2:47 PM, Bob Pearson wrote:
>> OK I checked out the kernel with the SHA number above and applied the patch series
>> and rebuilt and reinstalled the kernel. I checked out v36.0 of rdma-core and rebuilt
>> that. rdma is version 5.9.0 but I doubt that will have any effect. My startup script
>> is
>>
>>      export LD_LIBRARY_PATH=/home/bob/src/rdma-core/build/lib/:/usr/local/lib:/usr/lib
>>
>>
>>
>>      sudo ip link set dev enp0s3 mtu 8500
>>
>>      sudo ip addr add dev enp0s3 fe80::0a00:27ff:fe94:8a69/64
>>
>>      sudo rdma link add rxe0 type rxe netdev enp0s3
>>
>>
>> I am running on a Virtualbox VM instance of Ubuntu 21.04 with 20 cores and 8GB of RAM.
>>
>> The test looks like
>>
>>      sudo ./check -q srp/001
>>
>>      srp/001 (Create and remove LUNs)                             [passed]
>>
>>          runtime  1.174s  ...  1.236s
>>
>> There were no issues.
>>
>> Any guesses what else to look at?
> 
> The test I ran is different. I did not run any of the ip link / ip addr /
> rdma link commands since the blktests scripts already run the rdma link
> command. The bug I reported in my previous email is reproducible and
> triggers a VM halt.
> 
> Are we using the same kernel config? I attached my kernel config to my
> previous email. The source code location of the crash address is as
> follows:
> 
> (gdb) list *(rxe_completer+0x96d)
> 0x228d is in rxe_completer (drivers/infiniband/sw/rxe/rxe_comp.c:149).
> 144              */
> 145             wqe = queue_head(qp->sq.queue, QUEUE_TYPE_FROM_CLIENT);
> 146             *wqe_p = wqe;
> 147
> 148             /* no WQE or requester has not started it yet */
> 149             if (!wqe || wqe->state == wqe_state_posted)
> 150                     return pkt ? COMPST_DONE : COMPST_EXIT;
> 151
> 152             /* WQE does not require an ack */
> 153             if (wqe->state == wqe_state_done)
> 
> The disassembly output is as follows:
> 
> drivers/infiniband/sw/rxe/rxe_comp.c:
> 149             if (!wqe || wqe->state == wqe_state_posted)
>    0x0000000000002277 <+2391>:  test   %r12,%r12
>    0x000000000000227a <+2394>:  je     0x2379 <rxe_completer+2649>
>    0x0000000000002280 <+2400>:  lea    0x94(%r12),%rdi
>    0x0000000000002288 <+2408>:  call   0x228d <rxe_completer+2413>
>    0x000000000000228d <+2413>:  mov    0x94(%r12),%eax
>    0x0000000000002295 <+2421>:  test   %eax,%eax
>    0x0000000000002297 <+2423>:  je     0x237c <rxe_completer+2652>
> 
> So the instruction that triggers the crash is "mov 0x94(%r12),%eax".
> Does consumer_addr() perhaps return an invalid address under certain
> circumstances?
> 
> Thanks,
> 
> Bart.

The most likely cause of this was fixed by a patch submitted 8/20/2021 by Xiao Yang. It is copied here

From: Xiao Yang <yangx.jy@fujitsu.com>
To: <linux-rdma@vger.kernel.org>
Cc: <aglo@umich.edu>, <rpearsonhpe@gmail.com>, <zyjzyj2000@gmail.com>,
	<jgg@nvidia.com>, <leon@kernel.org>,
	Xiao Yang <yangx.jy@fujitsu.com>
Subject: [PATCH] RDMA/rxe: Zero out index member of struct rxe_queue
Date: Fri, 20 Aug 2021 19:15:09 +0800	[thread overview]
Message-ID: <20210820111509.172500-1-yangx.jy@fujitsu.com> (raw)

1) New index member of struct rxe_queue is introduced but not zeroed
   so the initial value of index may be random.
2) Current index is not masked off to index_mask.
In such case, producer_addr() and consumer_addr() will get an invalid
address by the random index and then accessing the invalid address
triggers the following panic:
"BUG: unable to handle page fault for address: ffff9ae2c07a1414"

Fix the issue by using kzalloc() to zero out index member.

Fixes: 5bcf5a59c41e ("RDMA/rxe: Protext kernel index from user space")
Signed-off-by: Xiao Yang <yangx.jy@fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe_queue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_queue.c b/drivers/infiniband/sw/rxe/rxe_queue.c
index 85b812586ed4..72d95398e604 100644
--- a/drivers/infiniband/sw/rxe/rxe_queue.c
+++ b/drivers/infiniband/sw/rxe/rxe_queue.c
@@ -63,7 +63,7 @@ struct rxe_queue *rxe_queue_init(struct rxe_dev *rxe, int *num_elem,
 	if (*num_elem < 0)
 		goto err1;
 
-	q = kmalloc(sizeof(*q), GFP_KERNEL);
+	q = kzalloc(sizeof(*q), GFP_KERNEL);
 	if (!q)
 		goto err1;
 
-- 
2.25.1

If kmalloc returns a dirty block of memory you could get random values in the q index which could
easily give a page fault. Once the rxe driver writes a new value it will be masked before storing
and should always be in the allocated buffer. I am not seeing this error perhaps because I am
running in a VM. I just don't know. It should be added to the other fixes.

Bob

  reply	other threads:[~2021-09-12 14:41 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-09 20:44 [PATCH for-rc v3 0/6] RDMA/rxe: Various bug fixes Bob Pearson
2021-09-09 20:44 ` [PATCH for-rc v3 1/6] RDMA/rxe: Add memory barriers to kernel queues Bob Pearson
2021-09-10  1:19   ` Zhu Yanjun
2021-09-10  4:01     ` Bob Pearson
2021-09-14  6:04   ` 回复: " yangx.jy
2021-09-14 15:47     ` Bob Pearson
2021-09-09 20:44 ` [PATCH for-rc v3 2/6] RDMA/rxe: Fix memory allocation while locked Bob Pearson
2021-09-09 20:44 ` [PATCH for-rc v3 3/6] RDMA/rxe: Cleanup MR status and type enums Bob Pearson
2021-09-09 20:44 ` [PATCH for-rc v3 4/6] RDMA/rxe: Separate HW and SW l/rkeys Bob Pearson
2021-09-09 20:44 ` [PATCH for-rc v3 5/6] RDMA/rxe: Create duplicate mapping tables for FMRs Bob Pearson
2021-09-09 20:44 ` [PATCH for-rc v3 6/6] RDMA/rxe: Only allow invalidate for appropriate MRs Bob Pearson
2021-09-09 21:52 ` [PATCH for-rc v3 0/6] RDMA/rxe: Various bug fixes Bart Van Assche
2021-09-10 19:38   ` Pearson, Robert B
2021-09-10 20:23     ` Bart Van Assche
2021-09-10 21:16       ` Bob Pearson
2021-09-10 21:47       ` Bob Pearson
2021-09-10 21:50         ` Bob Pearson
2021-09-10 22:07         ` Bart Van Assche
2021-09-12 14:41           ` Bob Pearson [this message]
2021-09-14  3:26             ` Bart Van Assche
2021-09-14  4:18               ` Bob Pearson
2021-09-12 14:42           ` Bob Pearson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=557a5fd9-2a30-5752-d09b-05183ab3c43b@gmail.com \
    --to=rpearsonhpe@gmail.com \
    --cc=bvanassche@acm.org \
    --cc=jgg@nvidia.com \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mie@igel.co.jp \
    --cc=robert.pearson2@hpe.com \
    --cc=yangx.jy@fujitsu.com \
    --cc=zyjzyj2000@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.