Re: [PATCH] usb: core: Solve race condition in usb_kill_anchored_urbs

From: Eli Billauer <eli.billauer@gmail.com>
To: Oliver Neukum <oneukum@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>,
	gregkh@linuxfoundation.org, linux-usb@vger.kernel.org,
	hdegoede@redhat.com
Subject: Re: [PATCH] usb: core: Solve race condition in usb_kill_anchored_urbs
Date: Tue, 28 Jul 2020 12:47:30 +0300	[thread overview]
Message-ID: <5F1FF432.9060509@gmail.com> (raw)
In-Reply-To: <1595885364.13408.44.camel@suse.de>

Oliver, Alan,

I understand that there's a disagreement in what is allowed or not with 
the anchor API. Effectively that means that I have to assume that driver 
programmers will go either way. I have to admit that my view was that a 
driver must proactively make sure it doesn't submit further URBs to an 
anchor as long as usb_kill_anchored_urbs() runs, through completers or 
otherwise. I formed the current patch accordingly.

To make things trickier, a driver might rely (correctly or not) on that 
usb_kill_urb() makes sure that resubmission of a URB by the completer, 
while usb_kill_urb() is killing it, will fail. Or at least so says the 
description of this function.

And once again, the resubmitted URB will remain untouched if the said 
race condition occurs. So a driver's programmer, who relied on 
usb_kill_urb() to prevent the resubmission, might get the impression 
that he did correctly when testing the driver, but then the kernel panic 
will happen rarely and far from the eye.

Writing an additional API without this problem is beyond the scope of 
this discussion. I'm focused on resolving the problem of the current 
one. The existing API must be safe to use, even if it's planned to phase 
out.

Given the discussion so far, I realized that the resubmission by 
completer case must be handled properly as well. So I suggest modifying 
the patch to something like

do {
     spin_lock_irq(&anchor->lock);
     while (!list_empty(&anchor->urb_list)) {
         /* URB kill loop */
     }
     spin_unlock_irq(&anchor->lock);
} while (unlikely(!usb_anchor_check_wakeup(anchor)));

The do-while loop will almost never make any difference. But it will 
loop like a waiting spinlock in the rare event of the said race 
condition, while the completer callback executes.

And if the completer submitted a URB, it will be removed as well this 
way. Recall that this loops only in the event of a race condition, so it 
will NOT play cat-and-mouse with the completer callback, but rather 
finish this up rather quickly.

And I've dropped the WARN(): If some people consider resubmission of a 
URB to be OK, even while usb_kill_anchored_urbs() is called, no noise 
should be made if that causes a rare but tricky situation.

And since I'm at it, I'll make the same change to 
usb_poison_anchored_urbs(), which suffers from exactly the same problem.

What do you think?

Thanks,
    Eli

On 28/07/20 00:29, Oliver Neukum wrote:
> Am Montag, den 27.07.2020, 10:43 -0400 schrieb Alan Stern:
>    
>> On Mon, Jul 27, 2020 at 03:58:05PM +0200, Oliver Neukum wrote:
>>      
>>> Am Montag, den 27.07.2020, 14:27 +0300 schrieb Eli Billauer:
>>>        
>>>> Hello, Oliver.
>>>>
>>>> On 27/07/20 13:14, Oliver Neukum wrote:
>>>>          
>>>>> That however is really a kludge we cannot have in usbcore.
>>>>> I am afraid as is the patch should_not_  be applied.
>>>>>
>>>>>            
>>>> Could you please explain further why the suggested patch is unsuitable?
>>>>          
>>> Hi,
>>>
>>> certainly.
>>>
>>> 1. timeouts are generally a bad idea, especially if the timeout does
>>> not come out of a spec.
>>>
>>> 2. That involves quoting you:
>>>
>>> Alternatively, if the driver submits URBs to the same anchor while
>>> usb_kill_anchored_urbs() is called, this timeout might be reached. This
>>>        
>> That would be a bug in the driver, though.  In such a situation, a WARN
>> is worth having.
>>      
> Well, it is an inherent race, certainly. You can do it, though. It is
> debatable whether it would ever make sense. Yet it is not a bug in the
> sense of, for example, writing beyond the end of a buffer or submitting
> an URB twofold.
>
>    
>>> could happen, for example, if the completer function that ran in the
>>> racy situation resubmits the URB. If that situation isn't cleared within
>>> 1000ms, it means that there's a URB in the system that the driver isn't
>>> aware of. Maybe that situation is worth more than a WARN.
>>>
>>> That is an entirely valid use case. And a bulk URB may take a potentially
>>> unbounded time to complete.
>>>        
>> It is _not_ a valid use case.  Since usb_kill_anchored_urbs() doesn't'
>> specify whether it will kill URBs that are added to the anchor after it
>> is called (and before it returns), a driver that anchors URBs at such a
>> time is buggy.
>>      
> Yes, if you depend on it. Here we are getting into technicalities.
> The thing is that we are getting into areas where we should not need to
> go if the API were optimal.
>
> What drivers really want is a way to say, kill this group of URBs and
> make sure they stay dead no matter what.
>
>    
>> Maybe this should be mentioned in the kerneldoc for the routine: Drivers
>> must not add URBs to the anchor while the routine is running.
>>      
> True, yet this defeats one of the aims of the API.
>
>    
>>> My failure in this case is simply overengineering.
>>> If this line:
>>>
>>>          usb_unanchor_urb(urb);
>>>
>>> In __usb_hcd_giveback_urb(struct urb *urb) weren't there, the issue
>>> would not exist. I misdesigned the API in automatically unanchoring
>>> a completing URB.
>>> Simply removing it now is no longer possible, so we need to come up with
>>> a more complex solution.
>>>        
>> Given that this timeout-based API is already present and being used in a
>> separate context, I don't see anything wrong with using it here as well.
>>
>>      
> It is unnecessary and results in a much less useful API.
> The true error in its design is that it unconditionally unanchors the
> URBs it gives back. Stop doing that and it becomes much better.
>
> 	Regards
> 		Oliver
>
>
>