Re: bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer

From: Paul Davey <Paul.Davey@alliedtelesis.co.nz>
To: "quic_jhugo@quicinc.com" <quic_jhugo@quicinc.com>,
	"bbhatt@codeaurora.org" <bbhatt@codeaurora.org>,
	"loic.poulain@linaro.org" <loic.poulain@linaro.org>,
	"hemantk@codeaurora.org" <hemantk@codeaurora.org>,
	"manivannan.sadhasivam@linaro.org"
	<manivannan.sadhasivam@linaro.org>
Cc: "linux-arm-msm@vger.kernel.org" <linux-arm-msm@vger.kernel.org>
Subject: Re: bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer
Date: Mon, 23 Aug 2021 06:47:03 +0000	[thread overview]
Message-ID: <2b1cbecf50a57a229e30d1bff060d0e241e2841a.camel@alliedtelesis.co.nz> (raw)
In-Reply-To: <ce8082f8ecd6bea2961d8841ea6eb1c14b5a34dd.camel@alliedtelesis.co.nz>

Hi Hemant, Jeffery

I have some more information after some testing.

> > Do you have a log which prints the TRE being processed? Basically i
> > am 
> > trying understand this : by the time you get double free issue, is
> > there 
> > any pattern with respect to the TRE that is being processed. For
> > example
> > when host processed the given TRE for the first time with RP1,
> > stale
> > TRE 
> > was posted by Event RP2 right after RP1
> > 
> > ->RP1 [TRE1]
> > ->RP2 [TRE1]
> > 
> > or occurrence of stale TRE event is random?
> 
I have now collected some information by adding buffers which record
some of the information desired and searching or printing this
information only when the issue is detected in order to avoid constant
verbose debug information and potential slowdowns.

From this information I can report that when this issue happens two
consecutive transfer completion events occur with the same TRE pointer
in them, I did not record events which are not transfer completion
events or the event ring RP during processing.

So the event is as follows:

mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
77c94780
mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
77c94780

Where the ptr value in the event being parsed is this value. 

I have also seen a case where a completion event would attempt to run
possibly more than once but is not caught by the check I posted in my
initial question. 

I added code to zero out some buf_info fields when processing
completion events and produce logs if the cb_buf that would be passed
to the completion callback is NULL.  

I am investigating this further and will collect similar information as
provided above for this case.

> 
> > If you can log all the events you are processing, so that we can
> > check 
> > when second event arrives for already processed TRE, is the
> > transfer 
> > length same as originally processed TRE or it is different. In case
> > it 
> > is different length, is the length matching to the TRE which was
> > queue 
> > but not processed yet. You can print the mhi_queue_skb TRE content
> > while 
> > queuing skb. How easy to reproduce this issue ? Is this showing up
> > in 
> > high throughput use case or it is random? any specific step to
> > reproduce 
> > this issue?

This TRE address matches one in the buffer of TREs added in gen_tre:

mhi mhi0: (IP_HW0_MBIM-Up) Event TRE addr: 8000000077c94780, ev_len:
5e2, tre_len: 5e2

And here we see the length's match between these.

> I would wonder, what is the codebase being testing?  Are the latest
> MHI 
> patches included?  When we saw something similar on AIC100, it was 
> addressed by the sanity check changes I upstreamed.

This is a bit complicated to answer, the codebase being tested is our
kernel, based off 5.7.19 with the mhi drivers being updated to the
upstream mainline kernel state as of around late June this year by
cherry-picking all commits to relevant paths.  Additionally the change
I submitted to this list for making the driver function on big endian
systems and changes to the PCI driver to customize the channels to the
EM9191 module and some changes to the MBIM net driver MRU for
performance reasons.  Additionally I have added one change to increase
the time that is waited at poweron for the modem to enter ready state
without increasing the timeout as this interfered with latency at
shutdown but I suspect using the upstream patch to resolve the shutdown
latency will remove the need for this and we can just increase the
timeout limit for the modem.

I can confirm that the tested codebase has the following commit
included:
ec32332df764 bus: mhi: core: Sanity check values from remote device
before use

Thanks,
Paul