Linux-RDMA Archive on lore.kernel.org
 help / color / Atom feed
* [PATCHv2 1/1] RDMA/core: avoid kernel NULL pointer error
@ 2019-11-25  4:24 Zhu Yanjun
  2019-11-25  4:29 ` zhuyj
  2019-11-27 20:07 ` Jason Gunthorpe
  0 siblings, 2 replies; 4+ messages in thread
From: Zhu Yanjun @ 2019-11-25  4:24 UTC (permalink / raw)
  To: dledford, jgg, michael.j.ruhl, ira.weiny, rostedt, leon,
	kamalheib1, zyjzyj2000, linux-rdma

When the interface related with IB device is set to down/up over and
over again, the following call trace will pop out.
"
 Call Trace:
  [<ffffffffa039ff8d>] ib_mad_completion_handler+0x7d/0xa0 [ib_mad]
  [<ffffffff810a1a41>] process_one_work+0x151/0x4b0
  [<ffffffff810a1ec0>] worker_thread+0x120/0x480
  [<ffffffff810a709e>] kthread+0xce/0xf0
  [<ffffffff816e9962>] ret_from_fork+0x42/0x70

 RIP  [<ffffffffa039f926>] ib_mad_recv_done_handler+0x26/0x610 [ib_mad]
"
From vmcore, we can find the following:
"
crash7lates> struct ib_mad_list_head ffff881fb3713400
struct ib_mad_list_head {
  list = {
    next = 0xffff881fb3713800,
    prev = 0xffff881fe01395c0
  },
  mad_queue = 0x0
}
"

Before the call trace, a lot of ib_cancel_mad is sent to the sender.
So it is necessary to check mad_queue in struct ib_mad_list_head to avoid
"kernel NULL pointer" error.

From the new customer report, when there is something wrong with IB HW/FW,
the above call trace will appear. It seems that bad IB HW/FW will cause
this problem.

Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
---
V1->V2: Add new bug symptoms.
---
 drivers/infiniband/core/mad.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 9947d16..43f596c 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -2279,6 +2279,17 @@ static void ib_mad_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 		return;
 	}
 
+	if (unlikely(!mad_list->mad_queue)) {
+		/*
+		 * When the interface related with IB device is set to down/up,
+		 * a lot of ib_cancel_mad packets are sent to the sender. In
+		 * sender, the mad packets are cancelled.  The receiver will
+		 * find mad_queue NULL. If the receiver does not test mad_queue,
+		 * the receiver will crash with "kernel NULL pointer" error.
+		 */
+		return;
+	}
+
 	qp_info = mad_list->mad_queue->qp_info;
 	dequeue_mad(mad_list);
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCHv2 1/1] RDMA/core: avoid kernel NULL pointer error
  2019-11-25  4:24 [PATCHv2 1/1] RDMA/core: avoid kernel NULL pointer error Zhu Yanjun
@ 2019-11-25  4:29 ` zhuyj
  2019-11-27 20:07 ` Jason Gunthorpe
  1 sibling, 0 replies; 4+ messages in thread
From: zhuyj @ 2019-11-25  4:29 UTC (permalink / raw)
  To: Zhu Yanjun
  Cc: dledford, jgg, michael.j.ruhl, ira.weiny, rostedt, leon,
	kamalheib1, linux-rdma

Probably this problem is caused by IB HW/FW. When IB device is set to down/up

for several times or IB HW/FW is bad, this similar prolem will appear.

In the future, when the developer confronts this similar problem, he can use

this patch to have a try.

Zhu Yanjun

On Mon, Nov 25, 2019 at 12:14 PM Zhu Yanjun <yanjun.zhu@oracle.com> wrote:
>
> When the interface related with IB device is set to down/up over and
> over again, the following call trace will pop out.
> "
>  Call Trace:
>   [<ffffffffa039ff8d>] ib_mad_completion_handler+0x7d/0xa0 [ib_mad]
>   [<ffffffff810a1a41>] process_one_work+0x151/0x4b0
>   [<ffffffff810a1ec0>] worker_thread+0x120/0x480
>   [<ffffffff810a709e>] kthread+0xce/0xf0
>   [<ffffffff816e9962>] ret_from_fork+0x42/0x70
>
>  RIP  [<ffffffffa039f926>] ib_mad_recv_done_handler+0x26/0x610 [ib_mad]
> "
> From vmcore, we can find the following:
> "
> crash7lates> struct ib_mad_list_head ffff881fb3713400
> struct ib_mad_list_head {
>   list = {
>     next = 0xffff881fb3713800,
>     prev = 0xffff881fe01395c0
>   },
>   mad_queue = 0x0
> }
> "
>
> Before the call trace, a lot of ib_cancel_mad is sent to the sender.
> So it is necessary to check mad_queue in struct ib_mad_list_head to avoid
> "kernel NULL pointer" error.
>
> From the new customer report, when there is something wrong with IB HW/FW,
> the above call trace will appear. It seems that bad IB HW/FW will cause
> this problem.
>
> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
> ---
> V1->V2: Add new bug symptoms.
> ---
>  drivers/infiniband/core/mad.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
> index 9947d16..43f596c 100644
> --- a/drivers/infiniband/core/mad.c
> +++ b/drivers/infiniband/core/mad.c
> @@ -2279,6 +2279,17 @@ static void ib_mad_recv_done(struct ib_cq *cq, struct ib_wc *wc)
>                 return;
>         }
>
> +       if (unlikely(!mad_list->mad_queue)) {
> +               /*
> +                * When the interface related with IB device is set to down/up,
> +                * a lot of ib_cancel_mad packets are sent to the sender. In
> +                * sender, the mad packets are cancelled.  The receiver will
> +                * find mad_queue NULL. If the receiver does not test mad_queue,
> +                * the receiver will crash with "kernel NULL pointer" error.
> +                */
> +               return;
> +       }
> +
>         qp_info = mad_list->mad_queue->qp_info;
>         dequeue_mad(mad_list);
>
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCHv2 1/1] RDMA/core: avoid kernel NULL pointer error
  2019-11-25  4:24 [PATCHv2 1/1] RDMA/core: avoid kernel NULL pointer error Zhu Yanjun
  2019-11-25  4:29 ` zhuyj
@ 2019-11-27 20:07 ` Jason Gunthorpe
  2019-11-28  5:06   ` Zhu Yanjun
  1 sibling, 1 reply; 4+ messages in thread
From: Jason Gunthorpe @ 2019-11-27 20:07 UTC (permalink / raw)
  To: Zhu Yanjun
  Cc: dledford, michael.j.ruhl, ira.weiny, rostedt, leon, kamalheib1,
	zyjzyj2000, linux-rdma

On Sun, Nov 24, 2019 at 11:24:35PM -0500, Zhu Yanjun wrote:
> When the interface related with IB device is set to down/up over and
> over again, the following call trace will pop out.
> "
>  Call Trace:
>   [<ffffffffa039ff8d>] ib_mad_completion_handler+0x7d/0xa0 [ib_mad]
>   [<ffffffff810a1a41>] process_one_work+0x151/0x4b0
>   [<ffffffff810a1ec0>] worker_thread+0x120/0x480
>   [<ffffffff810a709e>] kthread+0xce/0xf0
>   [<ffffffff816e9962>] ret_from_fork+0x42/0x70
> 
>  RIP  [<ffffffffa039f926>] ib_mad_recv_done_handler+0x26/0x610 [ib_mad]
> "
> From vmcore, we can find the following:
> "
> crash7lates> struct ib_mad_list_head ffff881fb3713400
> struct ib_mad_list_head {
>   list = {
>     next = 0xffff881fb3713800,
>     prev = 0xffff881fe01395c0
>   },
>   mad_queue = 0x0
> }
> "
> 
> Before the call trace, a lot of ib_cancel_mad is sent to the sender.
> So it is necessary to check mad_queue in struct ib_mad_list_head to avoid
> "kernel NULL pointer" error.
> 
> From the new customer report, when there is something wrong with IB HW/FW,
> the above call trace will appear. It seems that bad IB HW/FW will cause
> this problem.
> 
> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
> V1->V2: Add new bug symptoms.
>  drivers/infiniband/core/mad.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
> index 9947d16..43f596c 100644
> +++ b/drivers/infiniband/core/mad.c
> @@ -2279,6 +2279,17 @@ static void ib_mad_recv_done(struct ib_cq *cq, struct ib_wc *wc)
>  		return;
>  	}
>  
> +	if (unlikely(!mad_list->mad_queue)) {
> +		/*
> +		 * When the interface related with IB device is set to down/up,
> +		 * a lot of ib_cancel_mad packets are sent to the sender. In
> +		 * sender, the mad packets are cancelled.  The receiver will
> +		 * find mad_queue NULL. If the receiver does not test mad_queue,
> +		 * the receiver will crash with "kernel NULL pointer" error.
> +		 */
> +		return;
> +	}

I feel like this patch was sent already? 

It is not possible for mad_queue to be NULL here without another bug,
so this can't be the right fix.

This is because:

		mad_priv->header.mad_list.mad_queue = recv_queue;
		mad_priv->header.mad_list.cqe.done = ib_mad_recv_done;
		recv_wr.wr_cqe = &mad_priv->header.mad_list.cqe;

And then we do

	struct ib_mad_list_head *mad_list =
		container_of(wc->wr_cqe, struct ib_mad_list_head, cqe);

So there is no point where the mad_list could be legimiately NULL'd
before getting here, something else must be happening, you must figure
out and describe how the NULL is happening.

Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCHv2 1/1] RDMA/core: avoid kernel NULL pointer error
  2019-11-27 20:07 ` Jason Gunthorpe
@ 2019-11-28  5:06   ` Zhu Yanjun
  0 siblings, 0 replies; 4+ messages in thread
From: Zhu Yanjun @ 2019-11-28  5:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Zhu Yanjun, dledford, michael.j.ruhl, ira.weiny, rostedt, leon,
	kamalheib1, linux-rdma

On Thu, Nov 28, 2019 at 4:07 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Sun, Nov 24, 2019 at 11:24:35PM -0500, Zhu Yanjun wrote:
> > When the interface related with IB device is set to down/up over and
> > over again, the following call trace will pop out.
> > "
> >  Call Trace:
> >   [<ffffffffa039ff8d>] ib_mad_completion_handler+0x7d/0xa0 [ib_mad]
> >   [<ffffffff810a1a41>] process_one_work+0x151/0x4b0
> >   [<ffffffff810a1ec0>] worker_thread+0x120/0x480
> >   [<ffffffff810a709e>] kthread+0xce/0xf0
> >   [<ffffffff816e9962>] ret_from_fork+0x42/0x70
> >
> >  RIP  [<ffffffffa039f926>] ib_mad_recv_done_handler+0x26/0x610 [ib_mad]
> > "
> > From vmcore, we can find the following:
> > "
> > crash7lates> struct ib_mad_list_head ffff881fb3713400
> > struct ib_mad_list_head {
> >   list = {
> >     next = 0xffff881fb3713800,
> >     prev = 0xffff881fe01395c0
> >   },
> >   mad_queue = 0x0
> > }
> > "
> >
> > Before the call trace, a lot of ib_cancel_mad is sent to the sender.
> > So it is necessary to check mad_queue in struct ib_mad_list_head to avoid
> > "kernel NULL pointer" error.
> >
> > From the new customer report, when there is something wrong with IB HW/FW,
> > the above call trace will appear. It seems that bad IB HW/FW will cause
> > this problem.
> >
> > Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
> > V1->V2: Add new bug symptoms.
> >  drivers/infiniband/core/mad.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> >
> > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
> > index 9947d16..43f596c 100644
> > +++ b/drivers/infiniband/core/mad.c
> > @@ -2279,6 +2279,17 @@ static void ib_mad_recv_done(struct ib_cq *cq, struct ib_wc *wc)
> >               return;
> >       }
> >
> > +     if (unlikely(!mad_list->mad_queue)) {
> > +             /*
> > +              * When the interface related with IB device is set to down/up,
> > +              * a lot of ib_cancel_mad packets are sent to the sender. In
> > +              * sender, the mad packets are cancelled.  The receiver will
> > +              * find mad_queue NULL. If the receiver does not test mad_queue,
> > +              * the receiver will crash with "kernel NULL pointer" error.
> > +              */
> > +             return;
> > +     }
>
> I feel like this patch was sent already?
>
> It is not possible for mad_queue to be NULL here without another bug,
> so this can't be the right fix.
>
> This is because:
>
>                 mad_priv->header.mad_list.mad_queue = recv_queue;
>                 mad_priv->header.mad_list.cqe.done = ib_mad_recv_done;
>                 recv_wr.wr_cqe = &mad_priv->header.mad_list.cqe;
>
> And then we do
>
>         struct ib_mad_list_head *mad_list =
>                 container_of(wc->wr_cqe, struct ib_mad_list_head, cqe);
>
> So there is no point where the mad_list could be legimiately NULL'd
> before getting here, something else must be happening, you must figure
> out and describe how the NULL is happening.

Yes. From the kernel source code, this bug does not occur. But from
the bug symptoms, it is possible that this bug is caused by the HW/FW.
From the commit logs, in 2 scenarios,  this bug will occur. One is to
set IB interface down/up over and over again, the other is bad IB
device.
After the bad IB device is replaced, this bug did not appear again.

The reason that I sent this patch again is to let the developer
suspect the HW/FW when this bug occurs again. I can not check the
HW/FW since I can not access these IB devices and FW source code.

Just as a reminder, not to expect to merge this patch into mainline.

Zhu Yanjun

>
> Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, back to index

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-25  4:24 [PATCHv2 1/1] RDMA/core: avoid kernel NULL pointer error Zhu Yanjun
2019-11-25  4:29 ` zhuyj
2019-11-27 20:07 ` Jason Gunthorpe
2019-11-28  5:06   ` Zhu Yanjun

Linux-RDMA Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-rdma/0 linux-rdma/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-rdma linux-rdma/ https://lore.kernel.org/linux-rdma \
		linux-rdma@vger.kernel.org
	public-inbox-index linux-rdma

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rdma


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git