On 08.02.21 14:09, Julien Grall wrote: > Hi Juergen, > > On 08/02/2021 12:31, Jürgen Groß wrote: >> On 08.02.21 13:16, Julien Grall wrote: >>> >>> >>> On 08/02/2021 12:14, Jürgen Groß wrote: >>>> On 08.02.21 11:40, Julien Grall wrote: >>>>> Hi Juergen, >>>>> >>>>> On 08/02/2021 10:22, Jürgen Groß wrote: >>>>>> On 08.02.21 10:54, Julien Grall wrote: >>>>>>> ... I don't really see how the difference matter here. The idea >>>>>>> is to re-use what's already existing rather than trying to >>>>>>> re-invent the wheel with an extra lock (or whatever we can come up). >>>>>> >>>>>> The difference is that the race is occurring _before_ any IRQ is >>>>>> involved. So I don't see how modification of IRQ handling would help. >>>>> >>>>> Roughly our current IRQ handling flow (handle_eoi_irq()) looks like: >>>>> >>>>> if ( irq in progress ) >>>>> { >>>>>    set IRQS_PENDING >>>>>    return; >>>>> } >>>>> >>>>> do >>>>> { >>>>>    clear IRQS_PENDING >>>>>    handle_irq() >>>>> } while (IRQS_PENDING is set) >>>>> >>>>> IRQ handling flow like handle_fasteoi_irq() looks like: >>>>> >>>>> if ( irq in progress ) >>>>>    return; >>>>> >>>>> handle_irq() >>>>> >>>>> The latter flow would catch "spurious" interrupt and ignore them. >>>>> So it would handle nicely the race when changing the event affinity. >>>> >>>> Sure? Isn't "irq in progress" being reset way before our "lateeoi" is >>>> issued, thus having the same problem again? >>> >>> Sorry I can't parse this. >> >> handle_fasteoi_irq() will do nothing "if ( irq in progress )". When is >> this condition being reset again in order to be able to process another >> IRQ? > It is reset after the handler has been called. See handle_irq_event(). Right. And for us this is too early, as we want the next IRQ being handled only after we have called xen_irq_lateeoi(). > >> I believe this will be the case before our "lateeoi" handling is >> becoming active (more precise: when our IRQ handler is returning to >> handle_fasteoi_irq()), resulting in the possibility of the same race we >> are experiencing now. > > I am a bit confused what you mean by "lateeoi" handling is becoming > active. Can you clarify? See above: the next call of the handler should be allowed only after xen_irq_lateeoi() for the IRQ has been called. If the handler is being called earlier we have the race resulting in the WARN() splats. > Note that are are other IRQ flows existing. We should have a look at > them before trying to fix thing ourself. Fine with me, but it either needs to fit all use cases (interdomain, IPI, real interrupts) or we need to have a per-type IRQ flow. I think we should fix the issue locally first, then we can start to do a thorough rework planning. Its not as if the needed changes with the current flow would be so huge, and I'd really like to have a solution rather sooner than later. Changing the IRQ flow might have other side effects which need to be excluded by thorough testing. > Although, the other issue I can see so far is handle_irq_for_port() will > update info->{eoi_cpu, irq_epoch, eoi_time} without any locking. But it > is not clear this is what you mean by "becoming active". As long as a single event can't be handled on multiple cpus at the same time, there is no locking needed. Juergen