[PATCHv2 1/1] RDMA/core: avoid kernel NULL pointer error

* [PATCHv2 1/1] RDMA/core: avoid kernel NULL pointer error
@ 2019-11-25  4:24 Zhu Yanjun
  2019-11-25  4:29 ` zhuyj
  2019-11-27 20:07 ` Jason Gunthorpe
  0 siblings, 2 replies; 4+ messages in thread
From: Zhu Yanjun @ 2019-11-25  4:24 UTC (permalink / raw)
  To: dledford, jgg, michael.j.ruhl, ira.weiny, rostedt, leon,
	kamalheib1, zyjzyj2000, linux-rdma

When the interface related with IB device is set to down/up over and
over again, the following call trace will pop out.
"
 Call Trace:
  [<ffffffffa039ff8d>] ib_mad_completion_handler+0x7d/0xa0 [ib_mad]
  [<ffffffff810a1a41>] process_one_work+0x151/0x4b0
  [<ffffffff810a1ec0>] worker_thread+0x120/0x480
  [<ffffffff810a709e>] kthread+0xce/0xf0
  [<ffffffff816e9962>] ret_from_fork+0x42/0x70

 RIP  [<ffffffffa039f926>] ib_mad_recv_done_handler+0x26/0x610 [ib_mad]
"
From vmcore, we can find the following:
"
crash7lates> struct ib_mad_list_head ffff881fb3713400
struct ib_mad_list_head {
  list = {
    next = 0xffff881fb3713800,
    prev = 0xffff881fe01395c0
  },
  mad_queue = 0x0
}
"

Before the call trace, a lot of ib_cancel_mad is sent to the sender.
So it is necessary to check mad_queue in struct ib_mad_list_head to avoid
"kernel NULL pointer" error.

From the new customer report, when there is something wrong with IB HW/FW,
the above call trace will appear. It seems that bad IB HW/FW will cause
this problem.

Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
---
V1->V2: Add new bug symptoms.
---
 drivers/infiniband/core/mad.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 9947d16..43f596c 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -2279,6 +2279,17 @@ static void ib_mad_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 		return;
 	}
 
+	if (unlikely(!mad_list->mad_queue)) {
+		/*
+		 * When the interface related with IB device is set to down/up,
+		 * a lot of ib_cancel_mad packets are sent to the sender. In
+		 * sender, the mad packets are cancelled.  The receiver will
+		 * find mad_queue NULL. If the receiver does not test mad_queue,
+		 * the receiver will crash with "kernel NULL pointer" error.
+		 */
+		return;
+	}
+
 	qp_info = mad_list->mad_queue->qp_info;
 	dequeue_mad(mad_list);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 4+ messages in thread