RE: bug: usb: gadget: FSL_UDC_CORE Corrupted request list leads to unrecoverable loop.

From: Eugene Bordenkircher <Eugene_Bordenkircher@selinc.com>
To: Thorsten Leemhuis <regressions@leemhuis.info>,
	Joakim Tjernlund <Joakim.Tjernlund@infinera.com>,
	"linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>,
	"linux-usb@vger.kernel.org" <linux-usb@vger.kernel.org>
Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"balbi@kernel.org" <balbi@kernel.org>,
	"leoyang.li@nxp.com" <leoyang.li@nxp.com>
Subject: RE: bug: usb: gadget: FSL_UDC_CORE Corrupted request list leads to unrecoverable loop.
Date: Tue, 16 Nov 2021 19:11:29 +0000	[thread overview]
Message-ID: <MWHPR2201MB15209AA4F2457934BDD3293B91999@MWHPR2201MB1520.namprd22.prod.outlook.com> (raw)
In-Reply-To: <bb5c5d0f-2ae7-8426-0021-baeca8f7dd11@leemhuis.info>

On 02.11.21 22:15, Joakim Tjernlund wrote:
> On Sat, 2021-10-30 at 14:20 +0000, Joakim Tjernlund wrote:
>> On Fri, 2021-10-29 at 17:14 +0000, Eugene Bordenkircher wrote:
>
>>> We've discovered a situation where the FSL udc driver (drivers/usb/gadget/udc/fsl_udc_core.c) will enter a loop iterating over the request queue, but the queue has been corrupted at some point so it loops infinitely.  I believe we have narrowed into the offending code, but we are in need of assistance trying to find an appropriate fix for the problem.  The identified code appears to be in all versions of the Linux kernel the driver exists in.
>>>
>>> The problem appears to be when handling a USB_REQ_GET_STATUS request.  The driver gets this request and then calls the ch9getstatus() function.  In this function, it starts a request by "borrowing" the per device status_req, filling it in, and then queuing it with a call to list_add_tail() to add the request to the endpoint queue.  Right before it exits the function however, it's calling ep0_prime_status(), which is filling out that same status_req structure and then queuing it with another call to list_add_tail() to add the request to the endpoint queue.  This adds two instances of the exact same LIST_HEAD to the endpoint queue, which breaks the list since the prev and next pointers end up pointing to the wrong things.  This ends up causing a hard loop the next time nuke() gets called, which happens on the next setup IRQ.
>>>
>>> I'm not sure what the appropriate fix to this problem is, mostly due to my lack of expertise in USB and this driver stack.  The code has been this way in the kernel for a very long time, which suggests that it has been working, unless USB_REQ_GET_STATUS requests are never made.  This further suggests that there is something else going on that I don't understand.  Deleting the call to ep0_prime_status() and the following ep0stall() call appears, on the surface, to get the device working again, but may have side effects that I'm not seeing.
>>>
>>> I'm hopeful someone in the community can help provide some information on what I may be missing or help come up with a solution to the problem.  A big thank you to anyone who would like to help out.
>>>
>>> Eugene
>>
>> Run into this to a while ago. Found the bug and a few more fixes.
>> This is against 4.19 so you may have to tweak them a bit.
>> Feel free to upstream them.
>>
>>  Jocke
>
> Curious, did my patches help? Good to known once we upgrade as well.
>
>  Jocke

There's good news and bad news.

The good news is that this appears to stop the driver from entering an infinite loop, which prevents the Linux system from locking up and never recovering.  So I'm willing to say we've made the behavior better.

The bad news is that once we get past this point, there is new bad behavior.  What is on top of this driver in our system is the RNDIS gadget driver communicating to a Laptop running Win10 -1809.  Everything appears to work fine with the Linux system until there is a USB disconnect.  After the disconnect, the Linux side appears to continue on just fine, but the Windows side doesn't seem to recognize the disconnect, which causes the USB driver on that side to hang forever and eventually blue screen the box.  This doesn't happen on all machines, just a select few.   I think we can isolate the behavior to a specific antivirus/security software driver that is inserting itself into the USB stack and filtering the disconnect message, but we're still proving that.

I'm about 90% certain this is a different problem and we can call this patchset good, at least for our test setup.  My only hesitation is if the Linux side is sending a set of responses that are confusing the Windows side (specifically this antivirus) or not.  I'd be content calling that a separate defect though and letting this one close up with that patchset.

Eugene