bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer

linux-arm-msm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer
@ 2021-08-06  4:20 Paul Davey
  2021-08-06  9:43 ` Loic Poulain
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Davey @ 2021-08-06  4:20 UTC (permalink / raw)
  To: linux-arm-msm

Hi linux-arm-msm list,

We have been using the mhi driver with a Sierra EM9191 5G modem module
and have seen an occasional issue where the kernel would crash with
messages showing "BUG: Bad page state" which we debugged further and
found it was due to mhi_net_ul_callback freeing the same skb multiple
times, further debugging tracked this down to a case where
parse_xfer_event computed a dev_rp from the passed event's ev_tre
which does not lie within the region of valid "in flight" transfers
for the tre_ring.  See the patch below for how this was detected.

I believe that receiving such an event results in the loop which runs
completion events for the transfers to re-run some completion
callbacks as it walks all the way around the ring again to reach the
invalid dev_rp position.  

What could cause parse_xfer_event to receive a transfer event with a
tre pointer which would be outside the range of in-flight transfers?
For example receiving events where the tre pointers do not only
increase or receive a second event of types MHI_EV_CC_OVERFLOW,
MHI_EV_CC_EOB, or MHI_EV_CC_EOT for a previous tre.

The existing mhi driver code appears to assume that transfer events
are received strictly in order such that you can never receive a
transfer completion event for a transfer descriptor outside the
current set of "in flight" transfers in the tre ring (those between
the read pointer and write pointer).

Thanks,
Paul Davey

----8<----

From bf3e3821a90b89758a30cefed662d32a78dcb766 Mon Sep 17 00:00:00 2001
From: Paul Davey <paul.davey@alliedtelesis.co.nz>
Date: Fri, 6 Aug 2021 15:36:05 +1200
Subject: [PATCH] bus: mhi: Detect invalid xfer event dev_rp

Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz>
---
 drivers/bus/mhi/core/main.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c
index c67fd001ded1..238689a5dfc7 100644
--- a/drivers/bus/mhi/core/main.c
+++ b/drivers/bus/mhi/core/main.c
@@ -596,6 +596,7 @@ static int parse_xfer_event(struct mhi_controller *mhi_cntrl,
 		void *dev_rp;
 		struct mhi_buf_info *buf_info;
 		u16 xfer_len;
+		bool rp_valid;

 		if (!is_valid_ring_ptr(tre_ring, ptr)) {
 			dev_err(&mhi_cntrl->mhi_dev->dev,
@@ -609,6 +610,15 @@ static int parse_xfer_event(struct mhi_controller *mhi_cntrl,
 		if (dev_rp >= (tre_ring->base + tre_ring->len))
 			dev_rp = tre_ring->base;

+		if (tre_ring->rp < tre_ring->wp)
+			rp_valid = (dev_rp <= tre_ring->wp && dev_rp > tre_ring->rp);
+		else
+			rp_valid = !(dev_rp <= tre_ring->rp && dev_rp > tre_ring->wp);
+
+		WARN_ON(!rp_valid);
+		if (!rp_valid)
+			goto end_process_tx_event;
+
 		result.dir = mhi_chan->dir;

 		local_rp = tre_ring->rp;
-- 
2.32.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer
  2021-08-06  4:20 bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer Paul Davey
@ 2021-08-06  9:43 ` Loic Poulain
  2021-08-13 22:55   ` Hemant Kumar
  0 siblings, 1 reply; 11+ messages in thread
From: Loic Poulain @ 2021-08-06  9:43 UTC (permalink / raw)
  To: Manivannan Sadhasivam, Hemant Kumar, Bhaumik Bhatt
  Cc: linux-arm-msm, Paul Davey

+ MHI people

On Fri, 6 Aug 2021 at 06:20, Paul Davey <Paul.Davey@alliedtelesis.co.nz> wrote:
>
> Hi linux-arm-msm list,
>
> We have been using the mhi driver with a Sierra EM9191 5G modem module
> and have seen an occasional issue where the kernel would crash with
> messages showing "BUG: Bad page state" which we debugged further and
> found it was due to mhi_net_ul_callback freeing the same skb multiple
> times, further debugging tracked this down to a case where
> parse_xfer_event computed a dev_rp from the passed event's ev_tre
> which does not lie within the region of valid "in flight" transfers
> for the tre_ring.  See the patch below for how this was detected.
>
> I believe that receiving such an event results in the loop which runs
> completion events for the transfers to re-run some completion
> callbacks as it walks all the way around the ring again to reach the
> invalid dev_rp position.
>
> What could cause parse_xfer_event to receive a transfer event with a
> tre pointer which would be outside the range of in-flight transfers?
> For example receiving events where the tre pointers do not only
> increase or receive a second event of types MHI_EV_CC_OVERFLOW,
> MHI_EV_CC_EOB, or MHI_EV_CC_EOT for a previous tre.
>
> The existing mhi driver code appears to assume that transfer events
> are received strictly in order such that you can never receive a
> transfer completion event for a transfer descriptor outside the
> current set of "in flight" transfers in the tre ring (those between
> the read pointer and write pointer).
>
> Thanks,
> Paul Davey
>
> ----8<----
>
>
> From bf3e3821a90b89758a30cefed662d32a78dcb766 Mon Sep 17 00:00:00 2001
> From: Paul Davey <paul.davey@alliedtelesis.co.nz>
> Date: Fri, 6 Aug 2021 15:36:05 +1200
> Subject: [PATCH] bus: mhi: Detect invalid xfer event dev_rp
>
> Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz>
> ---
>  drivers/bus/mhi/core/main.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c
> index c67fd001ded1..238689a5dfc7 100644
> --- a/drivers/bus/mhi/core/main.c
> +++ b/drivers/bus/mhi/core/main.c
> @@ -596,6 +596,7 @@ static int parse_xfer_event(struct mhi_controller *mhi_cntrl,
>                 void *dev_rp;
>                 struct mhi_buf_info *buf_info;
>                 u16 xfer_len;
> +               bool rp_valid;
>
>                 if (!is_valid_ring_ptr(tre_ring, ptr)) {
>                         dev_err(&mhi_cntrl->mhi_dev->dev,
> @@ -609,6 +610,15 @@ static int parse_xfer_event(struct mhi_controller *mhi_cntrl,
>                 if (dev_rp >= (tre_ring->base + tre_ring->len))
>                         dev_rp = tre_ring->base;
>
> +               if (tre_ring->rp < tre_ring->wp)
> +                       rp_valid = (dev_rp <= tre_ring->wp && dev_rp > tre_ring->rp);
> +               else
> +                       rp_valid = !(dev_rp <= tre_ring->rp && dev_rp > tre_ring->wp);
> +
> +               WARN_ON(!rp_valid);
> +               if (!rp_valid)
> +                       goto end_process_tx_event;
> +
>                 result.dir = mhi_chan->dir;
>
>                 local_rp = tre_ring->rp;
> --
> 2.32.0
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer
  2021-08-06  9:43 ` Loic Poulain
@ 2021-08-13 22:55   ` Hemant Kumar
  2021-08-13 23:10     ` Hemant Kumar
  2021-08-15 23:30     ` Paul Davey
  0 siblings, 2 replies; 11+ messages in thread
From: Hemant Kumar @ 2021-08-13 22:55 UTC (permalink / raw)
  To: Loic Poulain, Manivannan Sadhasivam, Bhaumik Bhatt
  Cc: linux-arm-msm, Paul Davey

Hi Paul,

On 8/6/2021 2:43 AM, Loic Poulain wrote:
> + MHI people
> 
> On Fri, 6 Aug 2021 at 06:20, Paul Davey <Paul.Davey@alliedtelesis.co.nz> wrote:
>>
>> Hi linux-arm-msm list,
>>
>> We have been using the mhi driver with a Sierra EM9191 5G modem module
>> and have seen an occasional issue where the kernel would crash with
>> messages showing "BUG: Bad page state" which we debugged further and
>> found it was due to mhi_net_ul_callback freeing the same skb multiple
>> times, further debugging tracked this down to a case where
>> parse_xfer_event computed a dev_rp from the passed event's ev_tre
>> which does not lie within the region of valid "in flight" transfers
>> for the tre_ring.  See the patch below for how this was detected.
>>
>> I believe that receiving such an event results in the loop which runs
>> completion events for the transfers to re-run some completion
>> callbacks as it walks all the way around the ring again to reach the
>> invalid dev_rp position.
Do you have a log which prints the TRE being processed? Basically i am 
trying understand this : by the time you get double free issue, is there 
any pattern with respect to the TRE that is being processed. For example
when host processed the given TRE for the first time with RP1, stale TRE 
was posted by Event RP2 right after RP1

->RP1 [TRE1]
->RP2 [TRE1]

or occurrence of stale TRE event is random?

>>
>> What could cause parse_xfer_event to receive a transfer event with a
>> tre pointer which would be outside the range of in-flight transfers?
>> For example receiving events where the tre pointers do not only
>> increase or receive a second event of types MHI_EV_CC_OVERFLOW,
>> MHI_EV_CC_EOB, or MHI_EV_CC_EOT for a previous tre.
In theory this is not suppose to happen. once a xfer completion event is 
posted on event ring TRE belongs to Host MHI, Device is not suppose to
work on this TRE any more.
>>
>> The existing mhi driver code appears to assume that transfer events
>> are received strictly in order such that you can never receive a
>> transfer completion event for a transfer descriptor outside the
>> current set of "in flight" transfers in the tre ring (those between
>> the read pointer and write pointer).
This assumption is as per MHI spec.

I am checking internally if there is any know issue on device side. This 
model seems to be Qualcomm® Snapdragon™ X55 ?
[..]

Thanks,
Hemant
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum, a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer
  2021-08-13 22:55   ` Hemant Kumar
@ 2021-08-13 23:10     ` Hemant Kumar
  2021-08-16 13:48       ` Jeffrey Hugo
  2021-08-15 23:30     ` Paul Davey
  1 sibling, 1 reply; 11+ messages in thread
From: Hemant Kumar @ 2021-08-13 23:10 UTC (permalink / raw)
  To: Loic Poulain, Manivannan Sadhasivam, Bhaumik Bhatt
  Cc: linux-arm-msm, Paul Davey

One more thing to add

On 8/13/2021 3:55 PM, Hemant Kumar wrote:
> Hi Paul,
> 
> On 8/6/2021 2:43 AM, Loic Poulain wrote:
>> + MHI people
>>
>> On Fri, 6 Aug 2021 at 06:20, Paul Davey 
>> <Paul.Davey@alliedtelesis.co.nz> wrote:
>>>
>>> Hi linux-arm-msm list,
>>>
>>> We have been using the mhi driver with a Sierra EM9191 5G modem module
>>> and have seen an occasional issue where the kernel would crash with
>>> messages showing "BUG: Bad page state" which we debugged further and
>>> found it was due to mhi_net_ul_callback freeing the same skb multiple
>>> times, further debugging tracked this down to a case where
>>> parse_xfer_event computed a dev_rp from the passed event's ev_tre
>>> which does not lie within the region of valid "in flight" transfers
>>> for the tre_ring.  See the patch below for how this was detected.
>>>
>>> I believe that receiving such an event results in the loop which runs
>>> completion events for the transfers to re-run some completion
>>> callbacks as it walks all the way around the ring again to reach the
>>> invalid dev_rp position.
> Do you have a log which prints the TRE being processed? Basically i am 
> trying understand this : by the time you get double free issue, is there 
> any pattern with respect to the TRE that is being processed. For example
> when host processed the given TRE for the first time with RP1, stale TRE 
> was posted by Event RP2 right after RP1
> 
> ->RP1 [TRE1]
> ->RP2 [TRE1]
> 
> or occurrence of stale TRE event is random?
If you can log all the events you are processing, so that we can check 
when second event arrives for already processed TRE, is the transfer 
length same as originally processed TRE or it is different. In case it 
is different length, is the length matching to the TRE which was queue 
but not processed yet. You can print the mhi_queue_skb TRE content while 
queuing skb. How easy to reproduce this issue ? Is this showing up in 
high throughput use case or it is random? any specific step to reproduce 
this issue?

Thanks,
Hemant
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum, a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer
  2021-08-13 22:55   ` Hemant Kumar
  2021-08-13 23:10     ` Hemant Kumar
@ 2021-08-15 23:30     ` Paul Davey
  2021-08-23  6:47       ` Paul Davey
  1 sibling, 1 reply; 11+ messages in thread
From: Paul Davey @ 2021-08-15 23:30 UTC (permalink / raw)
  To: bbhatt, loic.poulain, hemantk, manivannan.sadhasivam; +Cc: linux-arm-msm

Hi Hemant,

On Fri, 2021-08-13 at 15:55 -0700, Hemant Kumar wrote:
> Hi Paul,
> 
> On 8/6/2021 2:43 AM, Loic Poulain wrote:
> > + MHI people
> > 
> > On Fri, 6 Aug 2021 at 06:20, Paul Davey <
> > Paul.Davey@alliedtelesis.co.nz> wrote:
> > > 
> > > Hi linux-arm-msm list,
> > > 
> > > [..]
> 
> Do you have a log which prints the TRE being processed? Basically i
> am 
> trying understand this : by the time you get double free issue, is
> there 
> any pattern with respect to the TRE that is being processed. For
> example
> when host processed the given TRE for the first time with RP1, stale
> TRE 
> was posted by Event RP2 right after RP1
> 
> ->RP1 [TRE1]
> ->RP2 [TRE1]
> 
> or occurrence of stale TRE event is random?

I have not logged all the TRE events yet, the incidence of processing
an event where the dev_rp inferred from the event (ev_tre + 1) is not
within the software tre_ring->rp and tre_ring->wp seems to be random or
at least inconsistent, but I will need to collect more debug to tell
what the sequence of events looks like.

I suspect the double free mostly stems from the fact that if the
computed dev_rp in this function is not between tre_ring->rp and
tre_ring->wp then the only way for the loop to reach the termination
case is to run through the whole ring.

> If you can log all the events you are processing, so that we can
> check 
> when second event arrives for already processed TRE, is the transfer 
> length same as originally processed TRE or it is different. In case
> it 
> is different length, is the length matching to the TRE which was
> queue 
> but not processed yet. You can print the mhi_queue_skb TRE content
> while 
> queuing skb. How easy to reproduce this issue ? Is this showing up
> in 
> high throughput use case or it is random? any specific step to
> reproduce 
> this issue?

I can try to collect a history of the TREs that can be logged when the
event occurs.  

The issue seems somewhat resistant to reproduction I am unsure of all
the factors required for reproduction.  This is during high throughput
testing, we are using the Sierra module's dataloopback mode to test.

The test being used is setting the module into dataloopback mode and
then sending a crafted UDP stream into an ethernet interface on the
device where the destination IP matches the IP address on the mhi
network interface and the source address has a static ARP on that input
interface so the returning traffic will be output again.

A colleague had done some experimentation to see how to make the issue
more likely and it seemed that the combination of the following did so:

 * Setting the IP_HW0_MBIM channel ring lengths to 3000 instead of 128
   while leaving the associated event rings at length 2048.
 * Setting the MHI_MBIM_DEFAULT_MRU to 7500 rather than 3500.

Also, while I thought using the check given in the original email to
avoid processing xfer events with a dev_rp outside the "in-flight"
region would avoid the issue, we have since seen an issue despite this.

In addition to the above I was zeroing out most fields of the buf_info
struct for the TRE before calling the transfer callback and checking if
the cb_buf addr was non NULL before actually calling the callback and
logging if it was ever NULL.  We have seen this even with the default
ring sizes and an MRU of 32768.  Though it is always the upload side
ring that seems to experience it.

> > > [..]
> 
> In theory this is not suppose to happen. once a xfer completion event
> is 
> posted on event ring TRE belongs to Host MHI, Device is not suppose
> to
> work on this TRE any more.

Is there any way it could post these events "out of order"?

> > > [..]
> 
> This assumption is as per MHI spec.
> 
> I am checking internally if there is any know issue on device side.
> This 
> model seems to be Qualcomm® Snapdragon™ X55 ?
> 
I believe this module uses this SoC yes.

Thanks,
Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer
  2021-08-13 23:10     ` Hemant Kumar
@ 2021-08-16 13:48       ` Jeffrey Hugo
  0 siblings, 0 replies; 11+ messages in thread
From: Jeffrey Hugo @ 2021-08-16 13:48 UTC (permalink / raw)
  To: Hemant Kumar, Loic Poulain, Manivannan Sadhasivam, Bhaumik Bhatt
  Cc: linux-arm-msm, Paul Davey

On 8/13/2021 5:10 PM, Hemant Kumar wrote:
> One more thing to add
> 
> On 8/13/2021 3:55 PM, Hemant Kumar wrote:
>> Hi Paul,
>>
>> On 8/6/2021 2:43 AM, Loic Poulain wrote:
>>> + MHI people
>>>
>>> On Fri, 6 Aug 2021 at 06:20, Paul Davey 
>>> <Paul.Davey@alliedtelesis.co.nz> wrote:
>>>>
>>>> Hi linux-arm-msm list,
>>>>
>>>> We have been using the mhi driver with a Sierra EM9191 5G modem module
>>>> and have seen an occasional issue where the kernel would crash with
>>>> messages showing "BUG: Bad page state" which we debugged further and
>>>> found it was due to mhi_net_ul_callback freeing the same skb multiple
>>>> times, further debugging tracked this down to a case where
>>>> parse_xfer_event computed a dev_rp from the passed event's ev_tre
>>>> which does not lie within the region of valid "in flight" transfers
>>>> for the tre_ring.  See the patch below for how this was detected.
>>>>
>>>> I believe that receiving such an event results in the loop which runs
>>>> completion events for the transfers to re-run some completion
>>>> callbacks as it walks all the way around the ring again to reach the
>>>> invalid dev_rp position.
>> Do you have a log which prints the TRE being processed? Basically i am 
>> trying understand this : by the time you get double free issue, is 
>> there any pattern with respect to the TRE that is being processed. For 
>> example
>> when host processed the given TRE for the first time with RP1, stale 
>> TRE was posted by Event RP2 right after RP1
>>
>> ->RP1 [TRE1]
>> ->RP2 [TRE1]
>>
>> or occurrence of stale TRE event is random?
> If you can log all the events you are processing, so that we can check 
> when second event arrives for already processed TRE, is the transfer 
> length same as originally processed TRE or it is different. In case it 
> is different length, is the length matching to the TRE which was queue 
> but not processed yet. You can print the mhi_queue_skb TRE content while 
> queuing skb. How easy to reproduce this issue ? Is this showing up in 
> high throughput use case or it is random? any specific step to reproduce 
> this issue?

I would wonder, what is the codebase being testing?  Are the latest MHI 
patches included?  When we saw something similar on AIC100, it was 
addressed by the sanity check changes I upstreamed.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer
  2021-08-15 23:30     ` Paul Davey
@ 2021-08-23  6:47       ` Paul Davey
  2021-08-26 15:54         ` Jeffrey Hugo
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Davey @ 2021-08-23  6:47 UTC (permalink / raw)
  To: quic_jhugo, bbhatt, loic.poulain, hemantk, manivannan.sadhasivam
  Cc: linux-arm-msm

Hi Hemant, Jeffery

I have some more information after some testing.

> > Do you have a log which prints the TRE being processed? Basically i
> > am 
> > trying understand this : by the time you get double free issue, is
> > there 
> > any pattern with respect to the TRE that is being processed. For
> > example
> > when host processed the given TRE for the first time with RP1,
> > stale
> > TRE 
> > was posted by Event RP2 right after RP1
> > 
> > ->RP1 [TRE1]
> > ->RP2 [TRE1]
> > 
> > or occurrence of stale TRE event is random?
> 
I have now collected some information by adding buffers which record
some of the information desired and searching or printing this
information only when the issue is detected in order to avoid constant
verbose debug information and potential slowdowns.

From this information I can report that when this issue happens two
consecutive transfer completion events occur with the same TRE pointer
in them, I did not record events which are not transfer completion
events or the event ring RP during processing.

So the event is as follows:

mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
77c94780
mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
77c94780

Where the ptr value in the event being parsed is this value. 

I have also seen a case where a completion event would attempt to run
possibly more than once but is not caught by the check I posted in my
initial question. 

I added code to zero out some buf_info fields when processing
completion events and produce logs if the cb_buf that would be passed
to the completion callback is NULL.  

I am investigating this further and will collect similar information as
provided above for this case.

> 
> > If you can log all the events you are processing, so that we can
> > check 
> > when second event arrives for already processed TRE, is the
> > transfer 
> > length same as originally processed TRE or it is different. In case
> > it 
> > is different length, is the length matching to the TRE which was
> > queue 
> > but not processed yet. You can print the mhi_queue_skb TRE content
> > while 
> > queuing skb. How easy to reproduce this issue ? Is this showing up
> > in 
> > high throughput use case or it is random? any specific step to
> > reproduce 
> > this issue?

This TRE address matches one in the buffer of TREs added in gen_tre:

mhi mhi0: (IP_HW0_MBIM-Up) Event TRE addr: 8000000077c94780, ev_len:
5e2, tre_len: 5e2

And here we see the length's match between these.

> I would wonder, what is the codebase being testing?  Are the latest
> MHI 
> patches included?  When we saw something similar on AIC100, it was 
> addressed by the sanity check changes I upstreamed.

This is a bit complicated to answer, the codebase being tested is our
kernel, based off 5.7.19 with the mhi drivers being updated to the
upstream mainline kernel state as of around late June this year by
cherry-picking all commits to relevant paths.  Additionally the change
I submitted to this list for making the driver function on big endian
systems and changes to the PCI driver to customize the channels to the
EM9191 module and some changes to the MBIM net driver MRU for
performance reasons.  Additionally I have added one change to increase
the time that is waited at poweron for the modem to enter ready state
without increasing the timeout as this interfered with latency at
shutdown but I suspect using the upstream patch to resolve the shutdown
latency will remove the need for this and we can just increase the
timeout limit for the modem.

I can confirm that the tested codebase has the following commit
included:
ec32332df764 bus: mhi: core: Sanity check values from remote device
before use

Thanks,
Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer
  2021-08-23  6:47       ` Paul Davey
@ 2021-08-26 15:54         ` Jeffrey Hugo
  2021-08-27  4:51           ` Paul Davey
  0 siblings, 1 reply; 11+ messages in thread
From: Jeffrey Hugo @ 2021-08-26 15:54 UTC (permalink / raw)
  To: Paul Davey, bbhatt, loic.poulain, hemantk, manivannan.sadhasivam
  Cc: linux-arm-msm

On 8/23/2021 12:47 AM, Paul Davey wrote:
> Hi Hemant, Jeffery
> 
> I have some more information after some testing.
> 
>>> Do you have a log which prints the TRE being processed? Basically i
>>> am
>>> trying understand this : by the time you get double free issue, is
>>> there
>>> any pattern with respect to the TRE that is being processed. For
>>> example
>>> when host processed the given TRE for the first time with RP1,
>>> stale
>>> TRE
>>> was posted by Event RP2 right after RP1
>>>
>>> ->RP1 [TRE1]
>>> ->RP2 [TRE1]
>>>
>>> or occurrence of stale TRE event is random?
>>
> I have now collected some information by adding buffers which record
> some of the information desired and searching or printing this
> information only when the issue is detected in order to avoid constant
> verbose debug information and potential slowdowns.
> 
>  From this information I can report that when this issue happens two
> consecutive transfer completion events occur with the same TRE pointer
> in them, I did not record events which are not transfer completion
> events or the event ring RP during processing.
> 
> So the event is as follows:
> 
> mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
> 77c94780
> mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
> 77c94780

This isn't good.  I would suspect that the device is glitching then, 
which should be fixed on the device side, but that doesn't help you here 
and now.

I'm thinking your change is probably a good idea based on this, but I 
have additional questions.

Can you check the address of the completion events in the shared memory 
(basically the event ring) when you see this?  I want to rule out the 
possibility that host is double processing the same event, and this is 
truly a case of the device duplicating an event.

I hope that makes sense to you.

> Where the ptr value in the event being parsed is this value.
> 
> I have also seen a case where a completion event would attempt to run
> possibly more than once but is not caught by the check I posted in my
> initial question.
> 
> I added code to zero out some buf_info fields when processing
> completion events and produce logs if the cb_buf that would be passed
> to the completion callback is NULL.
> 
> I am investigating this further and will collect similar information as
> provided above for this case.
> 
>>
>>> If you can log all the events you are processing, so that we can
>>> check
>>> when second event arrives for already processed TRE, is the
>>> transfer
>>> length same as originally processed TRE or it is different. In case
>>> it
>>> is different length, is the length matching to the TRE which was
>>> queue
>>> but not processed yet. You can print the mhi_queue_skb TRE content
>>> while
>>> queuing skb. How easy to reproduce this issue ? Is this showing up
>>> in
>>> high throughput use case or it is random? any specific step to
>>> reproduce
>>> this issue?
> 
> This TRE address matches one in the buffer of TREs added in gen_tre:
> 
> mhi mhi0: (IP_HW0_MBIM-Up) Event TRE addr: 8000000077c94780, ev_len:
> 5e2, tre_len: 5e2
> 
> And here we see the length's match between these.
> 
>> I would wonder, what is the codebase being testing?  Are the latest
>> MHI
>> patches included?  When we saw something similar on AIC100, it was
>> addressed by the sanity check changes I upstreamed.
> 
> This is a bit complicated to answer, the codebase being tested is our
> kernel, based off 5.7.19 with the mhi drivers being updated to the
> upstream mainline kernel state as of around late June this year by
> cherry-picking all commits to relevant paths.  Additionally the change
> I submitted to this list for making the driver function on big endian
> systems and changes to the PCI driver to customize the channels to the
> EM9191 module and some changes to the MBIM net driver MRU for
> performance reasons.  Additionally I have added one change to increase
> the time that is waited at poweron for the modem to enter ready state
> without increasing the timeout as this interfered with latency at
> shutdown but I suspect using the upstream patch to resolve the shutdown
> latency will remove the need for this and we can just increase the
> timeout limit for the modem.
> 
> I can confirm that the tested codebase has the following commit
> included:
> ec32332df764 bus: mhi: core: Sanity check values from remote device
> before use

Ok.  Sounds like it wouldn't catch the issue you describe above anyways. 
  Thanks for confirming though, so we can eliminate what I saw as a 
possible cause.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer
  2021-08-26 15:54         ` Jeffrey Hugo
@ 2021-08-27  4:51           ` Paul Davey
  2021-09-01  5:17             ` Paul Davey
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Davey @ 2021-08-27  4:51 UTC (permalink / raw)
  To: quic_jhugo, bbhatt, loic.poulain, hemantk, manivannan.sadhasivam
  Cc: linux-arm-msm

On Thu, 2021-08-26 at 09:54 -0600, Jeffrey Hugo wrote:
> On 8/23/2021 12:47 AM, Paul Davey wrote:
> > Hi Hemant, Jeffery
> > 
> > I have some more information after some testing.
> > 
> > > > Do you have a log which prints the TRE being processed?
> > > > Basically i
> > > > am
> > > > trying understand this : by the time you get double free issue,
> > > > is
> > > > there
> > > > any pattern with respect to the TRE that is being processed.
> > > > For
> > > > example
> > > > when host processed the given TRE for the first time with RP1,
> > > > stale
> > > > TRE
> > > > was posted by Event RP2 right after RP1
> > > > 
> > > > ->RP1 [TRE1]
> > > > ->RP2 [TRE1]
> > > > 
> > > > or occurrence of stale TRE event is random?
> > 
> > I have now collected some information by adding buffers which
> > record
> > some of the information desired and searching or printing this
> > information only when the issue is detected in order to avoid
> > constant
> > verbose debug information and potential slowdowns.
> > 
> >  From this information I can report that when this issue happens
> > two
> > consecutive transfer completion events occur with the same TRE
> > pointer
> > in them, I did not record events which are not transfer completion
> > events or the event ring RP during processing.
> > 
> > So the event is as follows:
> > 
> > mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2
> > ptr:
> > 77c94780
> > mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2
> > ptr:
> > 77c94780
> 
> This isn't good.  I would suspect that the device is glitching then, 
> which should be fixed on the device side, but that doesn't help you
> here 
> and now.
> 
> I'm thinking your change is probably a good idea based on this, but
> I 
> have additional questions.
> 
> Can you check the address of the completion events in the shared
> memory 
> (basically the event ring) when you see this?  I want to rule out
> the 
> possibility that host is double processing the same event, and this
> is 
> truly a case of the device duplicating an event.
> 
> I hope that makes sense to you.

I have added recording of the event tre address in my debug collecting
so that should answer this question when the results from that test
come back, which will run over the weekend.

Additionally I have another update with some results that occurred
since my last mail.

Two incidents occurred but I am not sure about their relative timing
the TRE addresses suggest they are not immediately after eachother.

First my check for cb_buf in buf_info is NULL went off but nothing
otherwise untoward seemed to be happening.  I have added a check to
parse_xfer_event to check if the wp in the buf_info matches the TRE
address being processed to try and catch the rings somehow getting out
of sync but aside from that I can only think of concurrency related
issues causing this problem.  Perhaps I should check if the ring in
question is full or something to give any insight into this.

Secondly we saw the following pattern in completion events:

mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
7c4004e0
mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
7c400520
mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
7c4004c0
mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
7c4004b0
mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
7c4004a0

Here we can see that instead of a completion event for 7c4004d0 we have
one for 7c400520 which is significantly ahead of the other point and
from the list of TREs I store in mhi_gen_tre I suspect that 7c400520 is
the next TRE to be used in the TRE ring at this time, as the other
information shows it would be the oldest entry in that list.  I am not
sure what could have caused this but this is a different case to the
modem repeating the same TRE in a completion event.

Thanks,
Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer
  2021-08-27  4:51           ` Paul Davey
@ 2021-09-01  5:17             ` Paul Davey
  2021-09-02 21:49               ` Hemant Kumar
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Davey @ 2021-09-01  5:17 UTC (permalink / raw)
  To: quic_jhugo, bbhatt, loic.poulain, hemantk, manivannan.sadhasivam
  Cc: linux-arm-msm

On Fri, 2021-08-27 at 16:51 +1200, Paul Davey wrote:
> On Thu, 2021-08-26 at 09:54 -0600, Jeffrey Hugo wrote:
> > On 8/23/2021 12:47 AM, Paul Davey wrote:
> > > Hi Hemant, Jeffery
> > > 
> > > I have some more information after some testing.
> > > 
> > > > > Do you have a log which prints the TRE being processed?
> > > > > Basically i
> > > > > am
> > > > > trying understand this : by the time you get double free
> > > > > issue,
> > > > > is
> > > > > there
> > > > > any pattern with respect to the TRE that is being processed.
> > > > > For
> > > > > example
> > > > > when host processed the given TRE for the first time with
> > > > > RP1,
> > > > > stale
> > > > > TRE
> > > > > was posted by Event RP2 right after RP1
> > > > > 
> > > > > ->RP1 [TRE1]
> > > > > ->RP2 [TRE1]
> > > > > 
> > > > > or occurrence of stale TRE event is random?
> > > 
> > > [...]
> > 
> 
> Secondly we saw the following pattern in completion events:
> 
> mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
> 7c4004e0
> mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
> 7c400520
> mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
> 7c4004c0
> mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
> 7c4004b0
> mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
> 7c4004a0
> 
> Here we can see that instead of a completion event for 7c4004d0 we
> have
> one for 7c400520 which is significantly ahead of the other point and
> from the list of TREs I store in mhi_gen_tre I suspect that 7c400520
> is
> the next TRE to be used in the TRE ring at this time, as the other
> information shows it would be the oldest entry in that list.  I am
> not
> sure what could have caused this but this is a different case to the
> modem repeating the same TRE in a completion event.
> 

I have considered the code further, and while I have seen cases of
identical TRE completion events occurring, I do not think these result
in the double free case because if the event is actually the same as
the last one then the new dev_rp which parse_xfer_event will attempt to
advance the local_rp to will already be equal to the local_rp and the
whole loop will be skipped in the first place.  The troublesome
behaviour comes from the case I describe above where a jump to a
"future" TRE's completion event seems to occur followed by a
continuation of the order, this results in the tre_ring rp being
advanced to that future TRE and then the next completion event
following the previous ordered pattern would be before that rp location
and the loop will run through the entire tre_ring to reach the new rp
location. 

I do have another question though, the driver code seems to in some
cases take the mhi_chan->lock when updating the doorbell register, but
not when queueing new transfers.  What is the actual purpose of this
lock and why does it seem so inconsistently used?  Is there any chance
that some of my problems may be the result of queueing new transfers
racing somehow with completion event processing?

Thanks,
Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer
  2021-09-01  5:17             ` Paul Davey
@ 2021-09-02 21:49               ` Hemant Kumar
  0 siblings, 0 replies; 11+ messages in thread
From: Hemant Kumar @ 2021-09-02 21:49 UTC (permalink / raw)
  To: Paul Davey, quic_jhugo, bbhatt, loic.poulain, manivannan.sadhasivam
  Cc: linux-arm-msm

Hi Paul,

On 8/31/2021 10:17 PM, Paul Davey wrote:
> On Fri, 2021-08-27 at 16:51 +1200, Paul Davey wrote:
>> On Thu, 2021-08-26 at 09:54 -0600, Jeffrey Hugo wrote:
>>> On 8/23/2021 12:47 AM, Paul Davey wrote:
>>>> Hi Hemant, Jeffery
>>>>
>>>> I have some more information after some testing.
>>>>
>>>>>> Do you have a log which prints the TRE being processed?
>>>>>> Basically i
>>>>>> am
>>>>>> trying understand this : by the time you get double free
>>>>>> issue,
>>>>>> is
>>>>>> there
>>>>>> any pattern with respect to the TRE that is being processed.
>>>>>> For
>>>>>> example
>>>>>> when host processed the given TRE for the first time with
>>>>>> RP1,
>>>>>> stale
>>>>>> TRE
>>>>>> was posted by Event RP2 right after RP1
>>>>>>
>>>>>> ->RP1 [TRE1]
>>>>>> ->RP2 [TRE1]
>>>>>>
>>>>>> or occurrence of stale TRE event is random?
>>>>
>>>> [...]
>>>
>>
>> Secondly we saw the following pattern in completion events:
>>
>> mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
>> 7c4004e0
>> mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
>> 7c400520
>> mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
>> 7c4004c0
>> mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
>> 7c4004b0
>> mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
>> 7c4004a0
>>
>> Here we can see that instead of a completion event for 7c4004d0 we
>> have
>> one for 7c400520 which is significantly ahead of the other point and
>> from the list of TREs I store in mhi_gen_tre I suspect that 7c400520
>> is
>> the next TRE to be used in the TRE ring at this time, as the other
>> information shows it would be the oldest entry in that list.  I am
>> not
>> sure what could have caused this but this is a different case to the
>> modem repeating the same TRE in a completion event.
>>
> 
> 
> I have considered the code further, and while I have seen cases of
> identical TRE completion events occurring, I do not think these result
> in the double free case because if the event is actually the same as
> the last one then the new dev_rp which parse_xfer_event will attempt to
> advance the local_rp to will already be equal to the local_rp and the
> whole loop will be skipped in the first place.  The troublesome
> behaviour comes from the case I describe above where a jump to a
> "future" TRE's completion event seems to occur followed by a
> continuation of the order, this results in the tre_ring rp being
> advanced to that future TRE and then the next completion event
> following the previous ordered pattern would be before that rp location
> and the loop will run through the entire tre_ring to reach the new rp
> location.
> 
> 
> I do have another question though, the driver code seems to in some
> cases take the mhi_chan->lock when updating the doorbell register, but
> not when queueing new transfers.  What is the actual purpose of this
> lock and why does it seem so inconsistently used?  Is there any chance
> that some of my problems may be the result of queueing new transfers
> racing somehow with completion event processing?
Are you pointing to MHI_EV_CC_OOB and MHI_EV_CC_DB_MODE completion code 
? To prove that race condition you disable burst mode as an experiment

In pci_generic.c
#define MHI_CHANNEL_CONFIG_HW_UL(ch_num, ch_name, el_count, ev_ring)

- .doorbell = MHI_DB_BRST_ENABLE
+ .doorbell = MHI_DB_BRST_DISABLE

I agree that locking is an issue in that case as we are taking 
read/write channel lock in parse_xfer_event and using only pm_lock in 
queue API.

Thanks,
Hemant
> 
> Thanks,
> Paul
> 

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum, a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-09-02 21:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-06  4:20 bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer Paul Davey
2021-08-06  9:43 ` Loic Poulain
2021-08-13 22:55   ` Hemant Kumar
2021-08-13 23:10     ` Hemant Kumar
2021-08-16 13:48       ` Jeffrey Hugo
2021-08-15 23:30     ` Paul Davey
2021-08-23  6:47       ` Paul Davey
2021-08-26 15:54         ` Jeffrey Hugo
2021-08-27  4:51           ` Paul Davey
2021-09-01  5:17             ` Paul Davey
2021-09-02 21:49               ` Hemant Kumar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).