On 27.11.20 14:11, Jan Beulich wrote:
> On 25.11.2020 11:51, Juergen Gross wrote:
>> Two cpus entering evtchn_fifo_set_pending() for the same event channel
>> can race in case the first one gets interrupted after setting
>> EVTCHN_FIFO_PENDING and when the other one manages to set
>> EVTCHN_FIFO_LINKED before the first one is testing that bit. This can
>> lead to evtchn_check_pollers() being called before the event is put
>> properly into the queue, resulting eventually in the guest not seeing
>> the event pending and thus blocking forever afterwards.
>>
>> Note that commit 5f2df45ead7c1195 ("xen/evtchn: rework per event channel
>> lock") made the race just more obvious, while the fifo event channel
>> implementation had this race from the beginning when an unmask operation
>> was running in parallel with an event channel send operation.
> 
> I notice you've altered the Fixes: tag, but you still say "from
> the beginning" here.

Oh, indeed. Thanks for spotting.

> 
>> Using a spinlock for the per event channel lock is problematic due to
>> some paths needing to take the lock are called with interrupts off, so
>> the lock would need to disable interrupts, which in turn breaks some
>> use cases related to vm events.
> 
> This reads as if it got put here by mistake. May I suggest to start
> with "Using ... had turned out problematic ..." and then add
> something like "Therefore that lock was switched to an rw one"?

Fine with me.

> 
>> For avoiding this race the queue locking in evtchn_fifo_set_pending()
>> needs to be reworked to cover the test of EVTCHN_FIFO_PENDING,
>> EVTCHN_FIFO_MASKED and EVTCHN_FIFO_LINKED, too. Additionally when an
>> event channel needs to change queues both queues need to be locked
>> initially.
> 
> "... in order to avoid having a window with no lock held at all"?

Yes, this is better.

> 
>> @@ -204,17 +175,67 @@ static void evtchn_fifo_set_pending(struct vcpu *v, struct evtchn *evtchn)
>>           return;
>>       }
>>   
>> +    /*
>> +     * Lock all queues related to the event channel (in case of a queue change
>> +     * this might be two).
>> +     * It is mandatory to do that before setting and testing the PENDING bit
>> +     * and to hold the current queue lock until the event has put into the
> 
> "has been" or "was" I think?

"has been".

> 
> With adjustments along these lines (which I guess could again be
> done while committing) or reasons against supplied
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks.

> 
> One aspect which I wonder whether it wants adding to the description
> is that this change makes a bad situation worse: Back at the time
> per-channel locks were added to avoid the bottleneck of the per-
> domain event lock. While a per-queue lock's scope at least isn't the
> entire domain, their use on the send path still has been preventing
> full parallelism here. And this patch widens the lock holding region.

As the description already says that additional operations are to be
guarded by the lock I think it is rather clear that the lock holding
region is being widened.

OTOH I wouldn't reject such an addition to the commit message if you
think it is required.


Juergen