Re: BUG report: usb: dwc3: Link TRB triggered an intterupt without IOC being setted

From: alex zheng <tc0721@gmail.com>
To: Mathias Nyman <mathias.nyman@linux.intel.com>,
	Felipe Balbi <felipe.balbi@linux.intel.com>,
	David Laight <David.Laight@aculab.com>
Cc: "linux-usb@vger.kernel.org" <linux-usb@vger.kernel.org>,
	"xiaowei.zheng@dji.com" <xiaowei.zheng@dji.com>
Subject: Re: BUG report: usb: dwc3: Link TRB triggered an intterupt without IOC being setted
Date: Wed, 23 Oct 2019 17:52:37 +0800	[thread overview]
Message-ID: <CADGPSwiCY9=kUpKmcUwAhvCHmvGDSrxoBXEkzgQpEpiakKEv6A@mail.gmail.com> (raw)
In-Reply-To: <CADGPSwhCPvdu=KmQP6RHMJnh292UO0uBAt+KyJqqOWY5DWDc3w@mail.gmail.com>

Hi, all

We found that this is a known issue of synopsys DWC3 USB controller,
when the PARKMODE_SS of DWC3 is enable, the controller may hang or do
wrong TRB schedule in some heavy load conditions.

Setting DISABLE_PARKMODE_SS to 1 can work around this bug.

Thank you for your help.

alex zheng <tc0721@gmail.com> 于2019年9月26日周四 下午7:34写道：
>
> add log file.
>
> alex zheng <tc0721@gmail.com> 于2019年9月26日周四 下午6:38写道：
> >
> > Hi,
> >
> > Mathias Nyman <mathias.nyman@linux.intel.com> 于2019年9月26日周四 下午4:19写道：
> > >
> > > On 26.9.2019 8.45, Felipe Balbi wrote:
> > > >
> > > > Hi,
> > > >
> > > > David Laight <David.Laight@ACULAB.COM> writes:
> > > >> From: Mathias Nyman
> > > >>> Sent: 25 September 2019 15:48
> > > >>>
> > > >>> On 24.9.2019 17.45, alex zheng wrote:
> > > >>>> Hi Mathias,
> > > >> ...
> > > >>> Logs show your transfer ring has four segments, but hardware fails to
> > > >>> jump from the last segment back to first)
> > > >>>
> > > >>> Last TRB (LINK TRB) of each segment points to the next segment,
> > > >>> last segments link trb points back to first segment.
> > > >>>
> > > >>> In your case:
> > > >>> 0x1d117000 -> 0x1eb09000 -> 0x1eb0a000 -> 0x1dbda000 -> (back to 0x1d117000)
> > > >>>
> > > >>> For some reason your hardware doesn't treat the last TRB at the last segment
> > > >>> as a LINK TRB, instead it just issues a transfer event for it, and continues to
> > > >>> the next address instead of jumping back to first segment:
> > > >>
> > > >> That could be a cache coherency (or flushing (etc)) issue.
> > >
> > > The Link TRB is written very early, right after the ring segment is allocated,
> > > and before any other TRBs. 255 other TRBs were written and handled by hw
> > > on this segment after this, so not very likely a flushing/cache coherency issue.
> > >
> > I  add a flush_cache_all() after queue_trb everytime but it make no
> > use. It seems
> > not a flushing/cache coherency issus.
> >
> > flush like this:
> >      inc_enq(xhci, ring, more_trbs_coming);
> >   + flush_cache_all();
> >
> > > >
> > > > XHCI has a HW-configurable maximum number of segments in a ring. AFAICT,
> > > > xhci driver doesn't take that into consideration today. Perhaps the HW
> > > > in question doesn't like more than 3 segments.
> > > >
> > > > Mathias, what was the register to check this? Do you remember?
> > > >
> > >
> > > I only recall a limit for the event ring in the HSCPARAMS2 register(ERST MAX),
> > > not for transfer rings.
> > >
> > > Other things to look at would be
> > >
> > > - check that Toggle Cycle bit is correct for last segments link TRB (incomplete logs)
> >
> > I dump an other error log, more complete logs see attached
> > file(transfer_error_0926.cap), in the log:
> > the error link TRB:
> > 0x1d00dff0: TRB 000000001d068000 status 'Invalid' len 0 slot 0 ep 0
> > type 'Link' flags e:c
> > and last segment link TRB:
> > 0x1eb0aff0: TRB 000000001d00d000 status 'Invalid' len 0 slot 0 ep 0
> > type 'Link' flags e:C
> >
> > > - some old xHCI hardware needed the Chain bit set in link TRB for some isoc rings
> > xhci ver is 1.1:
> > 6.888570] c1 46 (kworker/u8:1) xhci-hcd xhci-hcd.0.auto: HCIVERSION: 0x110
> >
> > > - was ring recently expanded?, usually rings start with only two segments
> > The extra segments are expanded after raw data test run a while,
> > especially when the RNDIS test(iperf3) begin to run.
> >
> > Other info:
> > 1. This issue seems only happened when the raw bulk data test and the
> > rndis test(other pair endpoints) run at the same time, and happens
> > more often if we queue trb more quick.
> > 2. The raw bulk data test case is a libusb test use ep4(in) & ep3(out)
> > to transfer raw bulk data, and I use iperf3(tcp) to test USB rndis.
> > 3. The log file attached only show ep4(in) enqueue/dequeue log for
> > more readable,
> > 4. More test result show as below:
> >    1)  run just one raw bulk data test  -->  (always fine)
> >    2)  run raw rulk data test + rndis test run at the same
> >         time --> (transfer error in 10 minutes)
> >    3)  run two raw bulk data test run at the same time (with
> >         two pair endpoint) --> (transfer error in 10 minutes)
> > 5. I try to modify the DWC3 hw registers like TX/RX FIFO size,
> >     GTXTHRCFG/GRXTHRCFG , but also did not work.
> > 6. Related interface info:
> >     8801 I:* If#= 0 Alt= 0 #EPs= 1 Cls=e0(wlcon) Sub=01
> >     Prot=03 Driver=rndis_host
> >     8802 E:  Ad=82(I) Atr=03(Int.) MxPS=   8 Ivl=32ms
> >     8803 I:* If#= 1 Alt= 0 #EPs= 2 Cls=0a(data ) Sub=00
> >      Prot=00 Driver=rndis_host        -----> used in rndis test
> >      8804 E:  Ad=81(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms
> >      8805 E:  Ad=01(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms
> >      8809 I:* If#= 3 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=43
> > Prot=01 Driver=(none)    -----> used in raw bulk test
> >      8810 E:  Ad=03(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms
> >      8811 E:  Ad=84(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms
> >      8820 I:* If#= 7 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=43
> > Prot=01 Driver=(none)     ----> used in double raw bulk test
> >      8821 E:  Ad=06(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms
> >      8822 E:  Ad=88(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms
> >
> > It seems that there are some conflicts when multiple endpoints work at
> > the same time on our SOC. Are there any other way can try?
> >
> > >
> >
> > > Mathias