Re: [PATCH 1/2] usb: dwc3: core: Introduce dwc3_device_reinit()

From: Roger Quadros <rogerq@ti.com>
To: Felipe Balbi <balbi@kernel.org>, John Youn <John.Youn@synopsys.com>
Cc: "nsekhar@ti.com" <nsekhar@ti.com>,
	"linux-usb@vger.kernel.org" <linux-usb@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Shankar, Abhishek" <a.shankar@ti.com>,
	"B, Ravi" <ravibabu@ti.com>
Subject: Re: [PATCH 1/2] usb: dwc3: core: Introduce dwc3_device_reinit()
Date: Mon, 11 Apr 2016 15:37:24 +0300	[thread overview]
Message-ID: <570B9A84.1020607@ti.com> (raw)
In-Reply-To: <87inzx7q0l.fsf@intel.com>

On 04/04/16 11:10, Felipe Balbi wrote:
> 
> Hi,
> 
> Roger Quadros <rogerq@ti.com> writes:
>>>>>>>>>>>>> We will need this function for a workaround.
>>>>>>>>>>>>> The function issues a softreset only to the device
>>>>>>>>>>>>> controller and performs minimal re-initialization
>>>>>>>>>>>>> so that the device controller can be usable.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As some code is similar to dwc3_core_init() take out
>>>>>>>>>>>>> common code into dwc3_get_gctl_quirks().
>>>>>>>>>>>>>
>>>>>>>>>>>>> We add a new member (prtcap_mode) to struct dwc3 to
>>>>>>>>>>>>> keep track of the current mode in the PRTCAPDIR register.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Roger Quadros <rogerq@ti.com>
>>>>>>>>>>>>
>>>>>>>>>>>> I must say, I don't like this at all :-p There's ONE known silicon which
>>>>>>>>>>>> needs this because of a poor silicon integration which took an IP with a
>>>>>>>>>>>> known erratum where it can't be made to work on lower speeds and STILL
>>>>>>>>>>>> was integrated without a superspeed PHY.
>>>>>>>>>>>>
>>>>>>>>>>>> There's a reason why I never tried to push this upstream myself ;-)
>>>>>>>>>>>>
>>>>>>>>>>>> I'm really thinking we might be better off adding a quirk flag to skip
>>>>>>>>>>>> the metastability workaround and allow this ONE silicon to set the
>>>>>>>>>>>> controller to lower speed.
>>>>>>>>>>>>
>>>>>>>>>>>> John, can you check with your colleagues if we would ever fall into
>>>>>>>>>>>> STAR#9000525659 if we set maximum speed to high speed during driver
>>>>>>>>>>>> probe and never touch it again ? I would assume we don't really fall
>>>>>>>>>>>> into the metastability workaround, right ? We're not doing any sort of
>>>>>>>>>>>> PM for dwc3...
>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Felipe,
>>>>>>>>>
>>>>>>>>> Do you mean to keep DCFG.speed to SS and set dwc->maximum_speed to HS?
>>>>>>>>> I don't see an issue with that as long as we always ignore
>>>>>>>>> dwc->maximum_speed when programming DCFG.speed for all affected
>>>>>>>>> versions of the core. As long as the DCFG.speed = SS, you should not
>>>>>>>>> hit the STAR.
>>>>>>>>
>>>>>>>> I actually mean changing DCFG.speed during driver probe and never
>>>>>>>> touching it again. Would that still cause problems ?
>>>>>>>>
>>>>>>>
>>>>>>> In that case I'm not sure. The engineer who would know is off until
>>>>>>> next week so I'll get back to you as soon as I can talk to him about
>>>>>>> it.
>>>>>>
>>>>>
>>>>> So the engineers said that this issue can occur while set to HS and
>>>>> the run/stop bit is changed so it seems that won't work.
>>>>
>>>> Thanks John.
>>>>
>>>> Felipe,
>>>>
>>>> any suggestion how we can fix this upstream?
>>>
>>> no idea, I don't have a lot of memory about this problem. I really don't
>>> remember the details about this, is there an openly available errata
>>> document which I could read ? /me goes search for it.
>>>
>>> I found [1] which tells me, the following:
>>>
>>>
>>> | i819        | A Device Control Bit Meta-Stability for USB3.0 Controller in USB2.0 Mode   |
>>> |-------------+----------------------------------------------------------------------------|
>>> | Criticality | Medium                                                                     |
>>> |             |                                                                            |
>>> | Descritiion | When USB3.0 controller core is programmed to be a USB 2.0-only device      |
>>> |             | hardware meta-stability on USB_DCTL[31]RUNSTOP bit causing the core to     |
>>> |             | attempt high speed as well as SuperSpeed connection or completely miss     |
>>> |             | the attach request.                                                        |
>>> |             |                                                                            |
>>> | Workaround  | If the requirement is to always function in USB 2.0 mode, there is no      |
>>> |             | workaround.                                                                |
>>> |             | Otherwise, you can always program the USB controller core to be SuperSpeed |
>>> |             | 3.0 capable (USB_DCFG[2:0]DEVSPD = 0x4)                                    |
>>> |             |                                                                            |
>>> | Revisions   | SR 1.1, 2.0                                                                |
>>> | Impacted    |                                                                            |
>>> |-------------+----------------------------------------------------------------------------|
>>>
>>> So, TI's own documentation says that there is _no_ workaround. My
>>
>> We are working on updating that document. Apparently Synopsis has suggested this workaround.
>> pasting verbatim
>>
>> "
>> -          As last step of device configuration we set the RUNSTOP bit in DCTL.
>>
>> -          Once we set the RUNSTOP bit, we need to monitor GDBGLTSSM for 100 ms until one of the two below happens:.
>>
>> 	o   We see the GDBGLTSSM.LTDB_LINK_STATE changing from 4
>>
>> 	o   We receive the USB 2.0 reset interrupt.
>>
>> If none of above happens then we can stop monitoring it.
>>
>> -          If state change from 4 occurs issue a SoftReset thru DCTL.CSftRst and reconfigure Device. This time it is guaranteed that no metastability will occur so no need to do the 100ms timeout.
>> "
> 
> I don't have this text anywhere so I don't know. Is this something TI
> came up with or Synopsys ? Unless I can see a document (preferrably from
> Synopsys) stating this, I can't really accept $subject.

OK. I'll try to find out if there is an official document about this.

> 
> Another question is: if all it takes is an extra SoftReset, why don't we
> just reset it during probe() if max_speed < SUPER and we're running on
> rev < 2.20a ? BTW, which revision of the IP is on AM57x/DRA7x ?

The issue might happen on any Run/Stop transition so not sure if doing it
SoftReset just at probe fixes anything.

On DRA7x it is rev 2.02a.

> 
>>> question is, then: How are you sure that resetting the device actually
>>> solves the issue ? Did you really hit the metastability problem and
>>> noted that it works after a soft-reset ? How did you verify that
>>
>> I don't know if it solves the issue or not. It was suggested by
>> Synopsis to TI's silicon team.
> 
> now that's a bummer ;-)
> 
>> I never hit the metastability problem detection condition in my
>> overnight tests (i.e. LTDB_LINK_STATE != 4).
> 
> overnight is not enough. You need to keep this running for weeks.

how many weeks is acceptable for you? I can run for that long, no problem.
And what if the issue doesn't happen in that time frame, would you still
consider this case?

> 
>> I have verified that things work after a soft-reset by faking that the
>> error happens.
> 
> but if you never hit the actual failure, you have verified that it works
> _without_ the workaround. I mean, if you can't be sure RUN/STOP went
> metastable, you can't really be sure you're working around the Erratum.
> 
>>> Run/Stop was in a metastable state, considering that Run/Stop signal is
>>> not visible outside the die ?
>>
>> LTDB_LINK_STATE != 4 within 100ms or RUNSTOP set is the condition to
>> detect that the issue occurred.
> 
> this doesn't prove anything. This just means that your 100ms timer
> expired. Unless you can verify that Run/Stop is in metastability, you
> cannot be sure this workaround works.
> 
> Did anybody run silicon simulation to verify this case ? It's really the
> only way to be sure.

AFAIK this wasn't reproducible during silicon simulation either.

> 
>>> It seems to me that resetting the IP is just as "dangerous" as setting
>>> the IP to High-speed in the first place. No ?
>>
>> The soft-reset is just a recovery mechanism if that error is ever hit.
> 
> but you don't know if that's a *proper* recovery mechanism because you
> never even *hit* the error.
> 
>> Putting the controller in reset state means it is in a known
>> state. Why do you say it would be dangerous?
> 
> Because you can't predict the systems' behavior. If the flip-flop didn't
> have time to settle into 0 or 1 state, you don't know what the
> combinatorial part of the IP will do with that bogus value. It's truly
> unpredictable. You also cannot know, for sure, that a SoftReset will be
> enough to bring that flip-flop out of metastability.

I'm not an expert in this area and can only follow the advice the Silicon team gives.

> 
>> The original workaround i.e. setting the High-speed instance to
>> Super-speed to avoid this errata is causing another side
>> effect. i.e. erratic interrupts happen and more than 2 seconds delay
> 
> this should have been an expected side-effect when you design a
> SuperSpeed controller without a SuperSpeed PHY and don't properly
> terminate inputs. What you have is a floating PIPE3 interface not
> properly terminated and capturing random noise (basically acting as a
> very poor antenna inside your die). Of course the IP will go bonkers and
> give you "erratic error" interrupts. It has no idea what the hell this
> "PHY" on the PIPE3 interface is doing.

We know that. The damage is already done. :)

> 
>> to enumerations. This problem is more serious and so we have to keep
>> the controller in High-speed and tackle the meta-stability condition
>> if it happens.
> 
> you have to tackle the meta-stability, sure, but we need guarantee that
> $subject IS indeed tackling that problem. Unless there's proof that this
> really solves the meta-stability issue, I won't take $subject. Sorry
> dude, but I don't want regressions elsewhere because of a badly
> validated patch.

I understand. I will see if someone from TI can provide me official documentation
about the workaround.

> 
> Bottomline, it's not enough to _state_ that it solves the problem. You
> need to first *trigger* the issue without the workaround, then apply
> workaround and trigger it again. Then, and only then, you can be certain
> you're fixing the problem.
> 
> After that, we will look into how to make sure this has no impact to
> other users.
> 

OK, Thanks.

cheers,
-roger