Re: BUG report: usb: dwc3: Link TRB triggered an intterupt without IOC being setted

From: alex zheng <tc0721@gmail.com>
To: Mathias Nyman <mathias.nyman@linux.intel.com>
Cc: Felipe Balbi <felipe.balbi@linux.intel.com>,
	David Laight <David.Laight@aculab.com>,
	"linux-usb@vger.kernel.org" <linux-usb@vger.kernel.org>,
	"xiaowei.zheng@dji.com" <xiaowei.zheng@dji.com>
Subject: Re: BUG report: usb: dwc3: Link TRB triggered an intterupt without IOC being setted
Date: Thu, 26 Sep 2019 18:38:32 +0800	[thread overview]
Message-ID: <CADGPSwgJMKfQChfxMNU4S_xv1vfHr7_GY6rGwgeDOVuW6+mpVg@mail.gmail.com> (raw)
In-Reply-To: <52a7b158-ab76-432a-4d2c-7b731dc9c2a2@linux.intel.com>

Hi,

Mathias Nyman <mathias.nyman@linux.intel.com> 于2019年9月26日周四 下午4:19写道：
>
> On 26.9.2019 8.45, Felipe Balbi wrote:
> >
> > Hi,
> >
> > David Laight <David.Laight@ACULAB.COM> writes:
> >> From: Mathias Nyman
> >>> Sent: 25 September 2019 15:48
> >>>
> >>> On 24.9.2019 17.45, alex zheng wrote:
> >>>> Hi Mathias,
> >> ...
> >>> Logs show your transfer ring has four segments, but hardware fails to
> >>> jump from the last segment back to first)
> >>>
> >>> Last TRB (LINK TRB) of each segment points to the next segment,
> >>> last segments link trb points back to first segment.
> >>>
> >>> In your case:
> >>> 0x1d117000 -> 0x1eb09000 -> 0x1eb0a000 -> 0x1dbda000 -> (back to 0x1d117000)
> >>>
> >>> For some reason your hardware doesn't treat the last TRB at the last segment
> >>> as a LINK TRB, instead it just issues a transfer event for it, and continues to
> >>> the next address instead of jumping back to first segment:
> >>
> >> That could be a cache coherency (or flushing (etc)) issue.
>
> The Link TRB is written very early, right after the ring segment is allocated,
> and before any other TRBs. 255 other TRBs were written and handled by hw
> on this segment after this, so not very likely a flushing/cache coherency issue.
>
I  add a flush_cache_all() after queue_trb everytime but it make no
use. It seems
not a flushing/cache coherency issus.

flush like this:
     inc_enq(xhci, ring, more_trbs_coming);

                                                  +
flush_cache_all();

> >
> > XHCI has a HW-configurable maximum number of segments in a ring. AFAICT,
> > xhci driver doesn't take that into consideration today. Perhaps the HW
> > in question doesn't like more than 3 segments.
> >
> > Mathias, what was the register to check this? Do you remember?
> >
>
> I only recall a limit for the event ring in the HSCPARAMS2 register(ERST MAX),
> not for transfer rings.
>
> Other things to look at would be
>
> - check that Toggle Cycle bit is correct for last segments link TRB (incomplete logs)

I dump an other error log, more complete logs see attached
file(transfer_error_0926.cap), in the log:
the error link TRB:
0x1d00dff0: TRB 000000001d068000 status 'Invalid' len 0 slot 0 ep 0
type 'Link' flags e:c
and last segment link TRB:
0x1eb0aff0: TRB 000000001d00d000 status 'Invalid' len 0 slot 0 ep 0
type 'Link' flags e:C

> - some old xHCI hardware needed the Chain bit set in link TRB for some isoc rings
xhci ver is 1.1:
6.888570] c1 46 (kworker/u8:1) xhci-hcd xhci-hcd.0.auto: HCIVERSION: 0x110

> - was ring recently expanded?, usually rings start with only two segments
The extra segments are expanded after raw data test run a while,
especially when the RNDIS test(iperf3) begin to run.

Other info:
1. This issue seems only happened when the raw bulk data test and the
rndis test(other pair endpoints) run at the same time, and happens
more often if we queue trb more quick.
2. The raw bulk data test case is a libusb test use ep4(in) & ep3(out)
to transfer raw bulk data, and I use iperf3(tcp) to test USB rndis.
3. The log file attached only show ep4(in) enqueue/dequeue log for
more readable,
4. More test result show as below:
           1)  run just one raw bulk data test  -->  (always fine)
           2)  run raw rulk data test + rndis test run at the same
time --> (transfer error in 10 minutes)
           3)  run two raw bulk data test run at the same time (with
two pair endpoint) --> (transfer error in 10 minutes)
5. I try to modify the DWC3 hw registers like TX/RX FIFO size,
GTXTHRCFG/GRXTHRCFG , but also did not work.
6. Related interface info:
             8801 I:* If#= 0 Alt= 0 #EPs= 1 Cls=e0(wlcon) Sub=01
Prot=03 Driver=rndis_host
             8802 E:  Ad=82(I) Atr=03(Int.) MxPS=   8 Ivl=32ms
             8803 I:* If#= 1 Alt= 0 #EPs= 2 Cls=0a(data ) Sub=00
Prot=00 Driver=rndis_host        -----> used in rndis test
             8804 E:  Ad=81(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms
             8805 E:  Ad=01(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms
             8809 I:* If#= 3 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=43
Prot=01 Driver=(none)    -----> used in raw bulk test
             8810 E:  Ad=03(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms
             8811 E:  Ad=84(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms
             8820 I:* If#= 7 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=43
Prot=01 Driver=(none)     ----> used in double raw bulk test
             8821 E:  Ad=06(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms
             8822 E:  Ad=88(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms

It seems that there are some conflicts when multiple endpoints work at
the same time on our SOC. Are there any other way can try?

>

> Mathias