* [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. @ 2015-01-12 16:45 Konrad Rzeszutek Wilk 2015-01-12 17:27 ` Sander Eikelenboom ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Konrad Rzeszutek Wilk @ 2015-01-12 16:45 UTC (permalink / raw) To: jbeulich, andrew.cooper3, xen-devel, linux, malcolm.crossley Cc: Konrad Rzeszutek Wilk There is race when we clear the STATE_SCHED in the softirq - which allows the 'raise_softirq_for' to progress and schedule another dpci. During that time the other CPU could receive an interrupt and calls 'raise_softirq_for' and put the dpci on its per-cpu list. There would be two 'dpci_softirq' running at the same time (on different CPUs) where the dpci state is STATE_RUN (and STATE_SCHED is cleared). This ends up hitting: if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) BUG() Instead of that put the dpci back on the per-cpu list to deal with later. The reason we can get his with this is when an interrupt affinity is set over multiple CPUs. Another potential fix would be to add a guard in the raise_softirq_for to check for 'STATE_RUN' bit being set and not schedule the dpci until that bit has been cleared. Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Reported-by: Malcolm Crossley <malcolm.crossley@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- xen/drivers/passthrough/io.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c index ae050df..9b77334 100644 --- a/xen/drivers/passthrough/io.c +++ b/xen/drivers/passthrough/io.c @@ -804,7 +804,17 @@ static void dpci_softirq(void) d = pirq_dpci->dom; smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */ if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) - BUG(); + { + unsigned long flags; + + /* Put back on the list and retry. */ + local_irq_save(flags); + list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list)); + local_irq_restore(flags); + + raise_softirq(HVM_DPCI_SOFTIRQ); + continue; + } /* * The one who clears STATE_SCHED MUST refcount the domain. */ -- 2.1.0 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-01-12 16:45 [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU Konrad Rzeszutek Wilk @ 2015-01-12 17:27 ` Sander Eikelenboom 2015-01-12 17:35 ` Konrad Rzeszutek Wilk 2015-01-13 10:20 ` Jan Beulich 2 siblings, 0 replies; 27+ messages in thread From: Sander Eikelenboom @ 2015-01-12 17:27 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: andrew.cooper3, malcolm.crossley, jbeulich, xen-devel Monday, January 12, 2015, 5:45:30 PM, you wrote: > There is race when we clear the STATE_SCHED in the softirq > - which allows the 'raise_softirq_for' to progress and > schedule another dpci. During that time the other CPU could > receive an interrupt and calls 'raise_softirq_for' and put > the dpci on its per-cpu list. There would be two 'dpci_softirq' > running at the same time (on different CPUs) where the > dpci state is STATE_RUN (and STATE_SCHED is cleared). This > ends up hitting: > if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) > BUG() > Instead of that put the dpci back on the per-cpu list to deal > with later. > The reason we can get his with this is when an interrupt > affinity is set over multiple CPUs. > Another potential fix would be to add a guard in the raise_softirq_for > to check for 'STATE_RUN' bit being set and not schedule the > dpci until that bit has been cleared. > Reported-by: Sander Eikelenboom <linux@eikelenboom.it> > Reported-by: Malcolm Crossley <malcolm.crossley@citrix.com> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Thanks again for the quick fix Konrad ! Tested it for the last half hour and everything seems fine, so you can also stick on a tested-by if you like. -- Sander > --- > xen/drivers/passthrough/io.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c > index ae050df..9b77334 100644 > --- a/xen/drivers/passthrough/io.c > +++ b/xen/drivers/passthrough/io.c > @@ -804,7 +804,17 @@ static void dpci_softirq(void) > d = pirq_dpci->dom; > smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */ > if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) > - BUG(); > + { > + unsigned long flags; > + > + /* Put back on the list and retry. */ > + local_irq_save(flags); > + list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list)); > + local_irq_restore(flags); > + > + raise_softirq(HVM_DPCI_SOFTIRQ); > + continue; > + } > /* > * The one who clears STATE_SCHED MUST refcount the domain. > */ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-01-12 16:45 [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU Konrad Rzeszutek Wilk 2015-01-12 17:27 ` Sander Eikelenboom @ 2015-01-12 17:35 ` Konrad Rzeszutek Wilk 2015-01-13 10:26 ` Jan Beulich 2015-01-13 10:20 ` Jan Beulich 2 siblings, 1 reply; 27+ messages in thread From: Konrad Rzeszutek Wilk @ 2015-01-12 17:35 UTC (permalink / raw) To: jbeulich, andrew.cooper3, xen-devel, linux, malcolm.crossley On Mon, Jan 12, 2015 at 11:45:30AM -0500, Konrad Rzeszutek Wilk wrote: > There is race when we clear the STATE_SCHED in the softirq > - which allows the 'raise_softirq_for' to progress and > schedule another dpci. During that time the other CPU could > receive an interrupt and calls 'raise_softirq_for' and put > the dpci on its per-cpu list. There would be two 'dpci_softirq' > running at the same time (on different CPUs) where the > dpci state is STATE_RUN (and STATE_SCHED is cleared). This > ends up hitting: > > if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) > BUG() > > Instead of that put the dpci back on the per-cpu list to deal > with later. > > The reason we can get his with this is when an interrupt > affinity is set over multiple CPUs. > > Another potential fix would be to add a guard in the raise_softirq_for > to check for 'STATE_RUN' bit being set and not schedule the > dpci until that bit has been cleared. > > Reported-by: Sander Eikelenboom <linux@eikelenboom.it> > Reported-by: Malcolm Crossley <malcolm.crossley@citrix.com> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > --- > xen/drivers/passthrough/io.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c > index ae050df..9b77334 100644 > --- a/xen/drivers/passthrough/io.c > +++ b/xen/drivers/passthrough/io.c > @@ -804,7 +804,17 @@ static void dpci_softirq(void) > d = pirq_dpci->dom; > smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */ > if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) > - BUG(); > + { > + unsigned long flags; > + > + /* Put back on the list and retry. */ > + local_irq_save(flags); > + list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list)); I chatted with Jan on IRC this, and one worry is that if we add on our per-cpu list an 'dpci' that is running on another CPU - if the other CPU runs 'list_del' it will go BOOM. However, I am not sure if I can come up with a scenario in which this will be triggered - as by the time we get to checking the STATE_RUN condition the list removal has already been done. So adding in on the per-cpu list is OK - and since the STATE_SCHED is set, it guards against double-list addition. CPU1 CPU2 CPU3: softirq_dpci: list_del() from per-cpu-list test-and-set(STATE_RUN) test-and-clear(STATE_SCHED) .. raise_softirq test-and-set(STATE_SCHED) softirq_dpci list_del from per-cpu-list if-test-and-set(STATE_RUN) fails: list_add_tail(..) continue raise_softirq test-and-set(STATE_SCHED) [fails, so no adding] .. softirq_dpci list_del if-test-and-set(STATE_RUN) fails: list_add_tail(..) [OK, we did the list_del] continue.. clear_bit(STATE_RUN) softrqi_dpci list_del() test-and-set(STATE_RUN) test-and-clear(STATE_SCHED) > + local_irq_restore(flags); > + > + raise_softirq(HVM_DPCI_SOFTIRQ); > + continue; > + } > /* > * The one who clears STATE_SCHED MUST refcount the domain. > */ > -- > 2.1.0 > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-01-12 17:35 ` Konrad Rzeszutek Wilk @ 2015-01-13 10:26 ` Jan Beulich 0 siblings, 0 replies; 27+ messages in thread From: Jan Beulich @ 2015-01-13 10:26 UTC (permalink / raw) To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux >>> On 12.01.15 at 18:35, <konrad.wilk@oracle.com> wrote: >> --- a/xen/drivers/passthrough/io.c >> +++ b/xen/drivers/passthrough/io.c >> @@ -804,7 +804,17 @@ static void dpci_softirq(void) >> d = pirq_dpci->dom; >> smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */ >> if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) >> - BUG(); >> + { >> + unsigned long flags; >> + >> + /* Put back on the list and retry. */ >> + local_irq_save(flags); >> + list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list)); > > I chatted with Jan on IRC this, and one worry is that if we add > on our per-cpu list an 'dpci' that is running on another CPU - if > the other CPU runs 'list_del' it will go BOOM. Looking at this again, I don't think this particular case is a concern: The list_del() happens before setting STATE_RUN and clearing STATE_SCHED, and the race window only opens (and the potential list_add() can only happen) after this CPU cleared STATE_SCHED. Jan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-01-12 16:45 [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU Konrad Rzeszutek Wilk 2015-01-12 17:27 ` Sander Eikelenboom 2015-01-12 17:35 ` Konrad Rzeszutek Wilk @ 2015-01-13 10:20 ` Jan Beulich 2015-01-23 1:44 ` Konrad Rzeszutek Wilk 2015-02-02 14:29 ` Konrad Rzeszutek Wilk 2 siblings, 2 replies; 27+ messages in thread From: Jan Beulich @ 2015-01-13 10:20 UTC (permalink / raw) To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux >>> On 12.01.15 at 17:45, <konrad.wilk@oracle.com> wrote: > There is race when we clear the STATE_SCHED in the softirq > - which allows the 'raise_softirq_for' to progress and > schedule another dpci. During that time the other CPU could > receive an interrupt and calls 'raise_softirq_for' and put > the dpci on its per-cpu list. There would be two 'dpci_softirq' > running at the same time (on different CPUs) where the > dpci state is STATE_RUN (and STATE_SCHED is cleared). This > ends up hitting: > > if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) > BUG() > > Instead of that put the dpci back on the per-cpu list to deal > with later. > > The reason we can get his with this is when an interrupt > affinity is set over multiple CPUs. > > Another potential fix would be to add a guard in the raise_softirq_for > to check for 'STATE_RUN' bit being set and not schedule the > dpci until that bit has been cleared. I indeed think this should be investigated, because it would make explicit what ... > --- a/xen/drivers/passthrough/io.c > +++ b/xen/drivers/passthrough/io.c > @@ -804,7 +804,17 @@ static void dpci_softirq(void) > d = pirq_dpci->dom; > smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */ > if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) > - BUG(); > + { > + unsigned long flags; > + > + /* Put back on the list and retry. */ > + local_irq_save(flags); > + list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list)); > + local_irq_restore(flags); > + > + raise_softirq(HVM_DPCI_SOFTIRQ); > + continue; > + } ... this does implicitly - spin until the bad condition cleared. Additionally I think it should be considered whether the bitmap approach of interpreting ->state is the right one, and we don't instead want a clean 3-state (idle, sched, run) model. Jan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-01-13 10:20 ` Jan Beulich @ 2015-01-23 1:44 ` Konrad Rzeszutek Wilk 2015-01-23 9:37 ` Jan Beulich 2015-02-02 14:29 ` Konrad Rzeszutek Wilk 1 sibling, 1 reply; 27+ messages in thread From: Konrad Rzeszutek Wilk @ 2015-01-23 1:44 UTC (permalink / raw) To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux On Tue, Jan 13, 2015 at 10:20:00AM +0000, Jan Beulich wrote: > >>> On 12.01.15 at 17:45, <konrad.wilk@oracle.com> wrote: > > There is race when we clear the STATE_SCHED in the softirq > > - which allows the 'raise_softirq_for' to progress and > > schedule another dpci. During that time the other CPU could > > receive an interrupt and calls 'raise_softirq_for' and put > > the dpci on its per-cpu list. There would be two 'dpci_softirq' > > running at the same time (on different CPUs) where the > > dpci state is STATE_RUN (and STATE_SCHED is cleared). This > > ends up hitting: > > > > if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) > > BUG() > > > > Instead of that put the dpci back on the per-cpu list to deal > > with later. > > > > The reason we can get his with this is when an interrupt > > affinity is set over multiple CPUs. > > > > Another potential fix would be to add a guard in the raise_softirq_for > > to check for 'STATE_RUN' bit being set and not schedule the > > dpci until that bit has been cleared. > > I indeed think this should be investigated, because it would make > explicit what ... > > > --- a/xen/drivers/passthrough/io.c > > +++ b/xen/drivers/passthrough/io.c > > @@ -804,7 +804,17 @@ static void dpci_softirq(void) > > d = pirq_dpci->dom; > > smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */ > > if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) > > - BUG(); > > + { > > + unsigned long flags; > > + > > + /* Put back on the list and retry. */ > > + local_irq_save(flags); > > + list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list)); > > + local_irq_restore(flags); > > + > > + raise_softirq(HVM_DPCI_SOFTIRQ); > > + continue; > > + } > > ... this does implicitly - spin until the bad condition cleared. Here is a patch that does this. I don't yet have an setup to test the failing scenario (working on that). I removed the part in the softirq because with this patch I cannot see a way it would ever get there (in the softirq hitting the BUG). >From e53185762ce184e835121ab4fbf7897568c0cdc4 Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Date: Mon, 12 Jan 2015 11:35:03 -0500 Subject: [PATCH] dpci: Don't schedule the dpci when it is in STATE_RUN on any per-cpu list. There is race when we clear the STATE_SCHED in the softirq - which allows the 'raise_softirq_for' to progress and schedule another dpci. During that time the other CPU could receive an interrupt and calls 'raise_softirq_for' and put the dpci on its per-cpu list. There would be two 'dpci_softirq' running at the same time (on different CPUs) where the dpci state is STATE_RUN (and STATE_SCHED is cleared). This ends up hitting: if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) BUG() The reason we can get his with this is when an interrupt affinity is set over multiple CPUs. Instead of the BUG() we can put the dpci back on the per-cpu list to deal with later (when the softirq are activated again). However since this putting the 'dpci' back on the per-cpu list is basically spin until the bad condition cleared which is not nice. Hence this patch adds an guard in the raise_softirq_for to check for 'STATE_RUN' bit being set and not schedule the dpci until that bit is cleared. We also moved in the 'raise_softirq_for' the insertion of dpci in the list in the case statement to make it easier to read. Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Reported-by: Malcolm Crossley <malcolm.crossley@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- xen/drivers/passthrough/io.c | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c index ae050df..802127a 100644 --- a/xen/drivers/passthrough/io.c +++ b/xen/drivers/passthrough/io.c @@ -64,16 +64,24 @@ static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci) { unsigned long flags; - if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) + switch ( cmpxchg(&pirq_dpci->state, 0, 1 << STATE_SCHED) ) + { + case (1 << STATE_SCHED): + case (1 << STATE_RUN): + case (1 << STATE_RUN) | (1 << STATE_SCHED): return; + case 0: + get_knownalive_domain(pirq_dpci->dom); - get_knownalive_domain(pirq_dpci->dom); + local_irq_save(flags); + list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list)); + local_irq_restore(flags); - local_irq_save(flags); - list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list)); - local_irq_restore(flags); - - raise_softirq(HVM_DPCI_SOFTIRQ); + raise_softirq(HVM_DPCI_SOFTIRQ); + break; + default: + BUG(); + } } /* -- 2.1.0 > > Additionally I think it should be considered whether the bitmap > approach of interpreting ->state is the right one, and we don't > instead want a clean 3-state (idle, sched, run) model. Could you elaborate a bit more please? As in three different unsigned int (or bool_t) that set in what state we are in? > > Jan > ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-01-23 1:44 ` Konrad Rzeszutek Wilk @ 2015-01-23 9:37 ` Jan Beulich 2015-01-23 14:54 ` Konrad Rzeszutek Wilk 2015-03-17 17:44 ` Konrad Rzeszutek Wilk 0 siblings, 2 replies; 27+ messages in thread From: Jan Beulich @ 2015-01-23 9:37 UTC (permalink / raw) To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux >>> On 23.01.15 at 02:44, <konrad.wilk@oracle.com> wrote: > Here is a patch that does this. I don't yet have an setup to test > the failing scenario (working on that). I removed the part in > the softirq because with this patch I cannot see a way it would > ever get there (in the softirq hitting the BUG). Hmm, but how do you now schedule the second instance that needed ... > --- a/xen/drivers/passthrough/io.c > +++ b/xen/drivers/passthrough/io.c > @@ -64,16 +64,24 @@ static void raise_softirq_for(struct hvm_pirq_dpci > *pirq_dpci) > { > unsigned long flags; > > - if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) > + switch ( cmpxchg(&pirq_dpci->state, 0, 1 << STATE_SCHED) ) > + { > + case (1 << STATE_SCHED): > + case (1 << STATE_RUN): ... in this case? >> Additionally I think it should be considered whether the bitmap >> approach of interpreting ->state is the right one, and we don't >> instead want a clean 3-state (idle, sched, run) model. > > Could you elaborate a bit more please? As in three different unsigned int > (or bool_t) that set in what state we are in? An enum { STATE_IDLE, STATE_SCHED, STATE_RUN }. Especially if my comment above turns out to be wrong, you'd have no real need for the SCHED and RUN flags to be set at the same time. Jan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-01-23 9:37 ` Jan Beulich @ 2015-01-23 14:54 ` Konrad Rzeszutek Wilk 2015-03-17 17:44 ` Konrad Rzeszutek Wilk 1 sibling, 0 replies; 27+ messages in thread From: Konrad Rzeszutek Wilk @ 2015-01-23 14:54 UTC (permalink / raw) To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux On Fri, Jan 23, 2015 at 09:37:53AM +0000, Jan Beulich wrote: > >>> On 23.01.15 at 02:44, <konrad.wilk@oracle.com> wrote: > > Here is a patch that does this. I don't yet have an setup to test > > the failing scenario (working on that). I removed the part in > > the softirq because with this patch I cannot see a way it would > > ever get there (in the softirq hitting the BUG). > > Hmm, but how do you now schedule the second instance that needed ... > > > --- a/xen/drivers/passthrough/io.c > > +++ b/xen/drivers/passthrough/io.c > > @@ -64,16 +64,24 @@ static void raise_softirq_for(struct hvm_pirq_dpci > > *pirq_dpci) > > { > > unsigned long flags; > > > > - if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) > > + switch ( cmpxchg(&pirq_dpci->state, 0, 1 << STATE_SCHED) ) > > + { > > + case (1 << STATE_SCHED): > > + case (1 << STATE_RUN): > > ... in this case? We do need that. I was under the false understanding that the old code would just drop it if it was running. However the old code (tasklet only) followed the logic that: a) If it is scheduled (on the list), then don't put it there. b) If it is running (t->running), then don't put it there. However once the function is done (in the do_tasklet_work) we detect that we tried to schedule it (t->scheduled_on is set to non-zero value) - and we put it back on the list. I need to think about this some more. > > >> Additionally I think it should be considered whether the bitmap > >> approach of interpreting ->state is the right one, and we don't > >> instead want a clean 3-state (idle, sched, run) model. > > > > Could you elaborate a bit more please? As in three different unsigned int > > (or bool_t) that set in what state we are in? > > An enum { STATE_IDLE, STATE_SCHED, STATE_RUN }. Especially > if my comment above turns out to be wrong, you'd have no real > need for the SCHED and RUN flags to be set at the same time. Seems we do need these four states: scheduled; running; schedulding while running; and idle. > > Jan > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-01-23 9:37 ` Jan Beulich 2015-01-23 14:54 ` Konrad Rzeszutek Wilk @ 2015-03-17 17:44 ` Konrad Rzeszutek Wilk 2015-03-17 22:16 ` Sander Eikelenboom 2015-03-18 7:41 ` Jan Beulich 1 sibling, 2 replies; 27+ messages in thread From: Konrad Rzeszutek Wilk @ 2015-03-17 17:44 UTC (permalink / raw) To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux > >> Additionally I think it should be considered whether the bitmap > >> approach of interpreting ->state is the right one, and we don't > >> instead want a clean 3-state (idle, sched, run) model. > > > > Could you elaborate a bit more please? As in three different unsigned int > > (or bool_t) that set in what state we are in? > > An enum { STATE_IDLE, STATE_SCHED, STATE_RUN }. Especially > if my comment above turns out to be wrong, you'd have no real > need for the SCHED and RUN flags to be set at the same time. I cobbled what I believe is what you were thinking off. As you can see to preserve the existing functionality such as being able to schedule N amount of interrupt injections for the N interrupts we might get - I modified '->masked' to be an atomic counter. The end result is that we can still live-lock. Unless we: - Drop on the floor the injection of N interrupts and just deliever at max one per VMX_EXIT (and not bother with interrupts arriving when we are in the VMX handler). - Alter the softirq code slightly - to have an variant which will only iterate once over the pending softirq bits per call. (so save an copy of the bitmap on the stack when entering the softirq handler - and use that. We could also xor it against the current to catch any non-duplicate bits being set that we should deal with). Here is the compile, but not run-time tested patch. >From e7d8bcd7c5d32c520554a4ad69c4716246036002 Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Date: Tue, 17 Mar 2015 13:31:52 -0400 Subject: [RFC PATCH] dpci: Switch to tristate instead of bitmap *TODO*: - Writeup. - Tests Suggested-by: Jan Beulich <jbuelich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- xen/drivers/passthrough/io.c | 140 ++++++++++++++++++++++++------------------- xen/include/xen/hvm/irq.h | 4 +- 2 files changed, 82 insertions(+), 62 deletions(-) diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c index ae050df..663e104 100644 --- a/xen/drivers/passthrough/io.c +++ b/xen/drivers/passthrough/io.c @@ -30,42 +30,28 @@ static DEFINE_PER_CPU(struct list_head, dpci_list); /* - * These two bit states help to safely schedule, deschedule, and wait until - * the softirq has finished. - * - * The semantics behind these two bits is as follow: - * - STATE_SCHED - whoever modifies it has to ref-count the domain (->dom). - * - STATE_RUN - only softirq is allowed to set and clear it. If it has - * been set hvm_dirq_assist will RUN with a saved value of the - * 'struct domain' copied from 'pirq_dpci->dom' before STATE_RUN was set. - * - * The usual states are: STATE_SCHED(set) -> STATE_RUN(set) -> - * STATE_SCHED(unset) -> STATE_RUN(unset). - * - * However the states can also diverge such as: STATE_SCHED(set) -> - * STATE_SCHED(unset) -> STATE_RUN(set) -> STATE_RUN(unset). That means - * the 'hvm_dirq_assist' never run and that the softirq did not do any - * ref-counting. - */ - -enum { - STATE_SCHED, - STATE_RUN -}; - -/* * This can be called multiple times, but the softirq is only raised once. - * That is until the STATE_SCHED state has been cleared. The state can be - * cleared by: the 'dpci_softirq' (when it has executed 'hvm_dirq_assist'), - * or by 'pt_pirq_softirq_reset' (which will try to clear the state before + * That is until state is in init. The state can be changed by: + * the 'dpci_softirq' (when it has executed 'hvm_dirq_assist'), + * or by 'pt_pirq_softirq_reset' (which will try to init the state before * the softirq had a chance to run). */ static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci) { unsigned long flags; - if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) + switch ( cmpxchg(&pirq_dpci->state, STATE_INIT, STATE_SCHED) ) + { + case STATE_RUN: + case STATE_SCHED: + /* + * The pirq_dpci->mapping has been incremented to let us know + * how many we have left to do. + */ return; + case STATE_INIT: + break; + } get_knownalive_domain(pirq_dpci->dom); @@ -85,7 +71,7 @@ static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci) */ bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci) { - if ( pirq_dpci->state & ((1 << STATE_RUN) | (1 << STATE_SCHED)) ) + if ( pirq_dpci->state != STATE_INIT ) return 1; /* @@ -109,22 +95,22 @@ static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci) ASSERT(spin_is_locked(&d->event_lock)); - switch ( cmpxchg(&pirq_dpci->state, 1 << STATE_SCHED, 0) ) + switch ( cmpxchg(&pirq_dpci->state, STATE_SCHED, STATE_INIT) ) { - case (1 << STATE_SCHED): + case STATE_SCHED: /* - * We are going to try to de-schedule the softirq before it goes in - * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'. + * We are going to try to de-schedule the softirq before it goes to + * running state. Whoever moves from sched MUST refcount the 'dom'. */ put_domain(d); /* fallthrough. */ - case (1 << STATE_RUN): - case (1 << STATE_RUN) | (1 << STATE_SCHED): + case STATE_RUN: + case STATE_INIT: /* - * The reason it is OK to reset 'dom' when STATE_RUN bit is set is due + * The reason it is OK to reset 'dom' when init or running is set is due * to a shortcut the 'dpci_softirq' implements. It stashes the 'dom' - * in local variable before it sets STATE_RUN - and therefore will not - * dereference '->dom' which would crash. + * in local variable before it sets state to running - and therefore + * will not dereference '->dom' which would crash. */ pirq_dpci->dom = NULL; break; @@ -135,7 +121,7 @@ static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci) * or never initialized it). Note that we hold the lock that * 'hvm_dirq_assist' could be spinning on. */ - pirq_dpci->masked = 0; + atomic_set(&pirq_dpci->masked, 0); } bool_t pt_irq_need_timer(uint32_t flags) @@ -149,7 +135,8 @@ static int pt_irq_guest_eoi(struct domain *d, struct hvm_pirq_dpci *pirq_dpci, if ( __test_and_clear_bit(_HVM_IRQ_DPCI_EOI_LATCH_SHIFT, &pirq_dpci->flags) ) { - pirq_dpci->masked = 0; + ASSERT(atomic_read(&pirq_dpci->masked) <= 1); + atomic_set(&pirq_dpci->masked, 0); pirq_dpci->pending = 0; pirq_guest_eoi(dpci_pirq(pirq_dpci)); } @@ -278,6 +265,8 @@ int pt_irq_create_bind( * As such on every 'pt_irq_create_bind' call we MUST set it. */ pirq_dpci->dom = d; + atomic_set(&pirq_dpci->masked, 0); + pirq_dpci->state = STATE_INIT; /* bind after hvm_irq_dpci is setup to avoid race with irq handler*/ rc = pirq_guest_bind(d->vcpu[0], info, 0); if ( rc == 0 && pt_irq_bind->u.msi.gtable ) @@ -398,6 +387,8 @@ int pt_irq_create_bind( if ( pt_irq_need_timer(pirq_dpci->flags) ) init_timer(&pirq_dpci->timer, pt_irq_time_out, pirq_dpci, 0); /* Deal with gsi for legacy devices */ + atomic_set(&pirq_dpci->masked, 0); + pirq_dpci->state = STATE_INIT; rc = pirq_guest_bind(d->vcpu[0], info, share); if ( unlikely(rc) ) { @@ -576,6 +567,7 @@ bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci) if ( !dpci->flags && !pt_pirq_softirq_active(dpci) ) { dpci->dom = NULL; + dpci->state = STATE_INIT; return 1; } return 0; @@ -617,7 +609,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq) !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) ) return 0; - pirq_dpci->masked = 1; + atomic_inc(&pirq_dpci->masked); raise_softirq_for(pirq_dpci); return 1; } @@ -672,12 +664,13 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector) spin_unlock(&d->event_lock); } -static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci) +static bool_t hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci) { + int more; ASSERT(d->arch.hvm_domain.irq.dpci); spin_lock(&d->event_lock); - if ( test_and_clear_bool(pirq_dpci->masked) ) + while ( (more = !atomic_dec_and_test(&pirq_dpci->masked)) ) { struct pirq *pirq = dpci_pirq(pirq_dpci); const struct dev_intx_gsi_link *digl; @@ -687,17 +680,13 @@ static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci) send_guest_pirq(d, pirq); if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI ) - { - spin_unlock(&d->event_lock); - return; - } + break; } if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI ) { vmsi_deliver_pirq(d, pirq_dpci); - spin_unlock(&d->event_lock); - return; + break; } list_for_each_entry ( digl, &pirq_dpci->digl_list, list ) @@ -710,8 +699,7 @@ static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci) { /* for translated MSI to INTx interrupt, eoi as early as possible */ __msi_pirq_eoi(pirq_dpci); - spin_unlock(&d->event_lock); - return; + break; } /* @@ -725,6 +713,8 @@ static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci) set_timer(&pirq_dpci->timer, NOW() + PT_IRQ_TIME_OUT); } spin_unlock(&d->event_lock); + + return more; } static void __hvm_dpci_eoi(struct domain *d, @@ -781,7 +771,7 @@ unlock: } /* - * Note: 'pt_pirq_softirq_reset' can clear the STATE_SCHED before we get to + * Note: 'pt_pirq_softirq_reset' can move the state to init before we get to * doing it. If that is the case we let 'pt_pirq_softirq_reset' do ref-counting. */ static void dpci_softirq(void) @@ -797,23 +787,53 @@ static void dpci_softirq(void) { struct hvm_pirq_dpci *pirq_dpci; struct domain *d; + bool_t again = 0; pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list); list_del(&pirq_dpci->softirq_list); d = pirq_dpci->dom; smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */ - if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) - BUG(); - /* - * The one who clears STATE_SCHED MUST refcount the domain. - */ - if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) ) + + switch ( cmpxchg(&pirq_dpci->state, STATE_SCHED, STATE_RUN) ) + { + case STATE_INIT: + /* pt_pirq_softirq_reset cleared it. Let it do the ref-counting. */ + continue; + case STATE_RUN: + again = 1; + /* Fall through. */ + case STATE_SCHED: + break; + } + if ( !again ) { - hvm_dirq_assist(d, pirq_dpci); + /* + * The one who changes sched to running MUST refcount the domain. + */ + again = hvm_dirq_assist(d, pirq_dpci); put_domain(d); + switch ( cmpxchg(&pirq_dpci->state, STATE_RUN, STATE_INIT) ) + { + case STATE_SCHED: + case STATE_INIT: + BUG(); + case STATE_RUN: + break; + } + } + if ( again ) + { + unsigned long flags; + + /* Put back on the list and retry. */ + local_irq_save(flags); + list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list)); + local_irq_restore(flags); + + raise_softirq(HVM_DPCI_SOFTIRQ); + continue; } - clear_bit(STATE_RUN, &pirq_dpci->state); } } diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h index 3996f1f..48dbc51 100644 --- a/xen/include/xen/hvm/irq.h +++ b/xen/include/xen/hvm/irq.h @@ -93,8 +93,8 @@ struct hvm_irq_dpci { /* Machine IRQ to guest device/intx mapping. */ struct hvm_pirq_dpci { uint32_t flags; - unsigned int state; - bool_t masked; + enum { STATE_INIT, STATE_SCHED, STATE_RUN } state; + atomic_t masked; uint16_t pending; struct list_head digl_list; struct domain *dom; -- 2.1.0 > > Jan > ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-03-17 17:44 ` Konrad Rzeszutek Wilk @ 2015-03-17 22:16 ` Sander Eikelenboom 2015-03-18 7:41 ` Jan Beulich 1 sibling, 0 replies; 27+ messages in thread From: Sander Eikelenboom @ 2015-03-17 22:16 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: andrew.cooper3, malcolm.crossley, Jan Beulich, xen-devel Tuesday, March 17, 2015, 6:44:54 PM, you wrote: >> >> Additionally I think it should be considered whether the bitmap >> >> approach of interpreting ->state is the right one, and we don't >> >> instead want a clean 3-state (idle, sched, run) model. >> > >> > Could you elaborate a bit more please? As in three different unsigned int >> > (or bool_t) that set in what state we are in? >> >> An enum { STATE_IDLE, STATE_SCHED, STATE_RUN }. Especially >> if my comment above turns out to be wrong, you'd have no real >> need for the SCHED and RUN flags to be set at the same time. > I cobbled what I believe is what you were thinking off. > As you can see to preserve the existing functionality such as > being able to schedule N amount of interrupt injections > for the N interrupts we might get - I modified '->masked' > to be an atomic counter. > The end result is that we can still live-lock. Unless we: > - Drop on the floor the injection of N interrupts and > just deliever at max one per VMX_EXIT (and not bother > with interrupts arriving when we are in the VMX handler). > - Alter the softirq code slightly - to have an variant > which will only iterate once over the pending softirq > bits per call. (so save an copy of the bitmap on the > stack when entering the softirq handler - and use that. > We could also xor it against the current to catch any > non-duplicate bits being set that we should deal with). > Here is the compile, but not run-time tested patch. > From e7d8bcd7c5d32c520554a4ad69c4716246036002 Mon Sep 17 00:00:00 2001 > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Date: Tue, 17 Mar 2015 13:31:52 -0400 > Subject: [RFC PATCH] dpci: Switch to tristate instead of bitmap > *TODO*: > - Writeup. > - Tests Done, and unfortunately it doesn't fly .. Some devices seem to work fine, others don't receive any interrupts shortly after boot like: 40: 3 0 0 0 xen-pirq-ioapic-level cx25821[1] Don't see any crashes or errors though, so it seems to silently lock somewhere. -- Sander > Suggested-by: Jan Beulich <jbuelich@suse.com> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > --- > xen/drivers/passthrough/io.c | 140 ++++++++++++++++++++++++------------------- > xen/include/xen/hvm/irq.h | 4 +- > 2 files changed, 82 insertions(+), 62 deletions(-) > diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c > index ae050df..663e104 100644 > --- a/xen/drivers/passthrough/io.c > +++ b/xen/drivers/passthrough/io.c > @@ -30,42 +30,28 @@ > static DEFINE_PER_CPU(struct list_head, dpci_list); > > /* > - * These two bit states help to safely schedule, deschedule, and wait until > - * the softirq has finished. > - * > - * The semantics behind these two bits is as follow: > - * - STATE_SCHED - whoever modifies it has to ref-count the domain (->dom). > - * - STATE_RUN - only softirq is allowed to set and clear it. If it has > - * been set hvm_dirq_assist will RUN with a saved value of the > - * 'struct domain' copied from 'pirq_dpci->dom' before STATE_RUN was set. > - * > - * The usual states are: STATE_SCHED(set) -> STATE_RUN(set) -> > - * STATE_SCHED(unset) -> STATE_RUN(unset). > - * > - * However the states can also diverge such as: STATE_SCHED(set) -> > - * STATE_SCHED(unset) -> STATE_RUN(set) -> STATE_RUN(unset). That means > - * the 'hvm_dirq_assist' never run and that the softirq did not do any > - * ref-counting. > - */ > - > -enum { > - STATE_SCHED, > - STATE_RUN > -}; > - > -/* > * This can be called multiple times, but the softirq is only raised once. > - * That is until the STATE_SCHED state has been cleared. The state can be > - * cleared by: the 'dpci_softirq' (when it has executed 'hvm_dirq_assist'), > - * or by 'pt_pirq_softirq_reset' (which will try to clear the state before > + * That is until state is in init. The state can be changed by: > + * the 'dpci_softirq' (when it has executed 'hvm_dirq_assist'), > + * or by 'pt_pirq_softirq_reset' (which will try to init the state before > * the softirq had a chance to run). > */ > static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci) > { > unsigned long flags; > > - if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) > + switch ( cmpxchg(&pirq_dpci->state, STATE_INIT, STATE_SCHED) ) > + { > + case STATE_RUN: > + case STATE_SCHED: > + /* > + * The pirq_dpci->mapping has been incremented to let us know > + * how many we have left to do. > + */ > return; > + case STATE_INIT: > + break; > + } > > get_knownalive_domain(pirq_dpci->dom); > > @@ -85,7 +71,7 @@ static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci) > */ > bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci) > { > - if ( pirq_dpci->state & ((1 << STATE_RUN) | (1 << STATE_SCHED)) ) > + if ( pirq_dpci->state != STATE_INIT ) > return 1; > > /* > @@ -109,22 +95,22 @@ static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci) > > ASSERT(spin_is_locked(&d->event_lock)); > > - switch ( cmpxchg(&pirq_dpci->state, 1 << STATE_SCHED, 0) ) > + switch ( cmpxchg(&pirq_dpci->state, STATE_SCHED, STATE_INIT) ) > { > - case (1 << STATE_SCHED): > + case STATE_SCHED: > /* > - * We are going to try to de-schedule the softirq before it goes in > - * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'. > + * We are going to try to de-schedule the softirq before it goes to > + * running state. Whoever moves from sched MUST refcount the 'dom'. > */ > put_domain(d); > /* fallthrough. */ > - case (1 << STATE_RUN): > - case (1 << STATE_RUN) | (1 << STATE_SCHED): > + case STATE_RUN: > + case STATE_INIT: > /* > - * The reason it is OK to reset 'dom' when STATE_RUN bit is set is due > + * The reason it is OK to reset 'dom' when init or running is set is due > * to a shortcut the 'dpci_softirq' implements. It stashes the 'dom' > - * in local variable before it sets STATE_RUN - and therefore will not > - * dereference '->dom' which would crash. > + * in local variable before it sets state to running - and therefore > + * will not dereference '->dom' which would crash. > */ > pirq_dpci->dom = NULL; > break; > @@ -135,7 +121,7 @@ static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci) > * or never initialized it). Note that we hold the lock that > * 'hvm_dirq_assist' could be spinning on. > */ - pirq_dpci->>masked = 0; > + atomic_set(&pirq_dpci->masked, 0); > } > > bool_t pt_irq_need_timer(uint32_t flags) > @@ -149,7 +135,8 @@ static int pt_irq_guest_eoi(struct domain *d, struct hvm_pirq_dpci *pirq_dpci, > if ( __test_and_clear_bit(_HVM_IRQ_DPCI_EOI_LATCH_SHIFT, > &pirq_dpci->flags) ) > { - pirq_dpci->>masked = 0; > + ASSERT(atomic_read(&pirq_dpci->masked) <= 1); > + atomic_set(&pirq_dpci->masked, 0); > pirq_dpci->pending = 0; > pirq_guest_eoi(dpci_pirq(pirq_dpci)); > } > @@ -278,6 +265,8 @@ int pt_irq_create_bind( > * As such on every 'pt_irq_create_bind' call we MUST set it. > */ > pirq_dpci->dom = d; > + atomic_set(&pirq_dpci->masked, 0); > + pirq_dpci->state = STATE_INIT; > /* bind after hvm_irq_dpci is setup to avoid race with irq handler*/ > rc = pirq_guest_bind(d->vcpu[0], info, 0); > if ( rc == 0 && pt_irq_bind->u.msi.gtable ) > @@ -398,6 +387,8 @@ int pt_irq_create_bind( > if ( pt_irq_need_timer(pirq_dpci->flags) ) > init_timer(&pirq_dpci->timer, pt_irq_time_out, pirq_dpci, 0); > /* Deal with gsi for legacy devices */ > + atomic_set(&pirq_dpci->masked, 0); > + pirq_dpci->state = STATE_INIT; > rc = pirq_guest_bind(d->vcpu[0], info, share); > if ( unlikely(rc) ) > { > @@ -576,6 +567,7 @@ bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci) > if ( !dpci->flags && !pt_pirq_softirq_active(dpci) ) > { > dpci->dom = NULL; + dpci->>state = STATE_INIT; > return 1; > } > return 0; > @@ -617,7 +609,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq) > !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) ) > return 0; > - pirq_dpci->>masked = 1; > + atomic_inc(&pirq_dpci->masked); > raise_softirq_for(pirq_dpci); > return 1; > } > @@ -672,12 +664,13 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector) > spin_unlock(&d->event_lock); > } > > -static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci) > +static bool_t hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci) > { > + int more; > ASSERT(d->arch.hvm_domain.irq.dpci); > > spin_lock(&d->event_lock); > - if ( test_and_clear_bool(pirq_dpci->masked) ) > + while ( (more = !atomic_dec_and_test(&pirq_dpci->masked)) ) > { > struct pirq *pirq = dpci_pirq(pirq_dpci); > const struct dev_intx_gsi_link *digl; > @@ -687,17 +680,13 @@ static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci) > send_guest_pirq(d, pirq); > > if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI ) > - { > - spin_unlock(&d->event_lock); > - return; > - } > + break; > } > > if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI ) > { > vmsi_deliver_pirq(d, pirq_dpci); > - spin_unlock(&d->event_lock); > - return; > + break; > } > > list_for_each_entry ( digl, &pirq_dpci->digl_list, list ) > @@ -710,8 +699,7 @@ static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci) > { > /* for translated MSI to INTx interrupt, eoi as early as possible */ > __msi_pirq_eoi(pirq_dpci); > - spin_unlock(&d->event_lock); > - return; > + break; > } > > /* > @@ -725,6 +713,8 @@ static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci) > set_timer(&pirq_dpci->timer, NOW() + PT_IRQ_TIME_OUT); > } > spin_unlock(&d->event_lock); > + > + return more; > } > > static void __hvm_dpci_eoi(struct domain *d, > @@ -781,7 +771,7 @@ unlock: > } > > /* > - * Note: 'pt_pirq_softirq_reset' can clear the STATE_SCHED before we get to > + * Note: 'pt_pirq_softirq_reset' can move the state to init before we get to > * doing it. If that is the case we let 'pt_pirq_softirq_reset' do ref-counting. > */ > static void dpci_softirq(void) > @@ -797,23 +787,53 @@ static void dpci_softirq(void) > { > struct hvm_pirq_dpci *pirq_dpci; > struct domain *d; > + bool_t again = 0; > > pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list); > list_del(&pirq_dpci->softirq_list); > > d = pirq_dpci->dom; > smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */ > - if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) > - BUG(); > - /* > - * The one who clears STATE_SCHED MUST refcount the domain. > - */ > - if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) ) > + > + switch ( cmpxchg(&pirq_dpci->state, STATE_SCHED, STATE_RUN) ) > + { > + case STATE_INIT: > + /* pt_pirq_softirq_reset cleared it. Let it do the ref-counting. */ > + continue; > + case STATE_RUN: > + again = 1; > + /* Fall through. */ > + case STATE_SCHED: > + break; > + } > + if ( !again ) > { > - hvm_dirq_assist(d, pirq_dpci); > + /* > + * The one who changes sched to running MUST refcount the domain. > + */ > + again = hvm_dirq_assist(d, pirq_dpci); > put_domain(d); > + switch ( cmpxchg(&pirq_dpci->state, STATE_RUN, STATE_INIT) ) > + { > + case STATE_SCHED: > + case STATE_INIT: > + BUG(); > + case STATE_RUN: > + break; > + } > + } > + if ( again ) > + { > + unsigned long flags; > + > + /* Put back on the list and retry. */ > + local_irq_save(flags); > + list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list)); > + local_irq_restore(flags); > + > + raise_softirq(HVM_DPCI_SOFTIRQ); > + continue; > } > - clear_bit(STATE_RUN, &pirq_dpci->state); > } > } > > diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h > index 3996f1f..48dbc51 100644 > --- a/xen/include/xen/hvm/irq.h > +++ b/xen/include/xen/hvm/irq.h > @@ -93,8 +93,8 @@ struct hvm_irq_dpci { > /* Machine IRQ to guest device/intx mapping. */ > struct hvm_pirq_dpci { > uint32_t flags; > - unsigned int state; > - bool_t masked; > + enum { STATE_INIT, STATE_SCHED, STATE_RUN } state; > + atomic_t masked; > uint16_t pending; > struct list_head digl_list; > struct domain *dom; ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-03-17 17:44 ` Konrad Rzeszutek Wilk 2015-03-17 22:16 ` Sander Eikelenboom @ 2015-03-18 7:41 ` Jan Beulich 2015-03-18 14:06 ` Konrad Rzeszutek Wilk 1 sibling, 1 reply; 27+ messages in thread From: Jan Beulich @ 2015-03-18 7:41 UTC (permalink / raw) To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux >>> On 17.03.15 at 18:44, <konrad.wilk@oracle.com> wrote: > As you can see to preserve the existing functionality such as > being able to schedule N amount of interrupt injections > for the N interrupts we might get - I modified '->masked' > to be an atomic counter. Why would that be? When an earlier interrupt wasn't fully handled, real hardware wouldn't latch more than one further instance either. > The end result is that we can still live-lock. Unless we: > - Drop on the floor the injection of N interrupts and > just deliever at max one per VMX_EXIT (and not bother > with interrupts arriving when we are in the VMX handler). I'm afraid I again don't see the point here. > - Alter the softirq code slightly - to have an variant > which will only iterate once over the pending softirq > bits per call. (so save an copy of the bitmap on the > stack when entering the softirq handler - and use that. > We could also xor it against the current to catch any > non-duplicate bits being set that we should deal with). That's clearly not an option: The solution should be isolated to DPCI code, i.e. without altering existing behavior in other (more generic) components. Jan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-03-18 7:41 ` Jan Beulich @ 2015-03-18 14:06 ` Konrad Rzeszutek Wilk 2015-03-18 16:43 ` Jan Beulich 0 siblings, 1 reply; 27+ messages in thread From: Konrad Rzeszutek Wilk @ 2015-03-18 14:06 UTC (permalink / raw) To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux On Wed, Mar 18, 2015 at 07:41:55AM +0000, Jan Beulich wrote: > >>> On 17.03.15 at 18:44, <konrad.wilk@oracle.com> wrote: > > As you can see to preserve the existing functionality such as > > being able to schedule N amount of interrupt injections > > for the N interrupts we might get - I modified '->masked' > > to be an atomic counter. > > Why would that be? When an earlier interrupt wasn't fully handled, > real hardware wouldn't latch more than one further instance either. We acknowledge the interrupt in the hypervisor - as in we call ->ack on the handler (which for MSI is an nop anyhow). If the device is misconfigured and keeps on sending burst of interrupts every 10 msec for 1msec we can dead-lock. Either way we should tell the guest about those interrupts. > > > The end result is that we can still live-lock. Unless we: > > - Drop on the floor the injection of N interrupts and > > just deliever at max one per VMX_EXIT (and not bother > > with interrupts arriving when we are in the VMX handler). > > I'm afraid I again don't see the point here. I am basing all of this on the assumption that we have many interrupts for the same device coming it - and we have not been able to tell the guest about it (the guest could be descheduled, too slow, etc) so that it can do what it needs to silence the device. It might be very well an busted device - and if that is the case be that - but we must not dead-lock due to it. > > > - Alter the softirq code slightly - to have an variant > > which will only iterate once over the pending softirq > > bits per call. (so save an copy of the bitmap on the > > stack when entering the softirq handler - and use that. > > We could also xor it against the current to catch any > > non-duplicate bits being set that we should deal with). > > That's clearly not an option: The solution should be isolated > to DPCI code, i.e. without altering existing behavior in other > (more generic) components. > > Jan > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-03-18 14:06 ` Konrad Rzeszutek Wilk @ 2015-03-18 16:43 ` Jan Beulich 2015-03-18 17:00 ` Konrad Rzeszutek Wilk 0 siblings, 1 reply; 27+ messages in thread From: Jan Beulich @ 2015-03-18 16:43 UTC (permalink / raw) To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux >>> On 18.03.15 at 15:06, <konrad.wilk@oracle.com> wrote: > On Wed, Mar 18, 2015 at 07:41:55AM +0000, Jan Beulich wrote: >> >>> On 17.03.15 at 18:44, <konrad.wilk@oracle.com> wrote: >> > As you can see to preserve the existing functionality such as >> > being able to schedule N amount of interrupt injections >> > for the N interrupts we might get - I modified '->masked' >> > to be an atomic counter. >> >> Why would that be? When an earlier interrupt wasn't fully handled, >> real hardware wouldn't latch more than one further instance either. > > We acknowledge the interrupt in the hypervisor - as in we call > ->ack on the handler (which for MSI is an nop anyhow). The case where ->ack is a nop (for the purposes here) is specifically not a problem, as that means we defer ack-ing the LAPIC (hence further instances can't show up). > If the device is misconfigured and keeps on sending burst of > interrupts every 10 msec for 1msec we can dead-lock. How is this different from the hypervisor itself not being fast enough to handle one instance before the next one shows up? I've been trying to reconstruct the rationale for our current treatment of maskable MSI sources (in that we ack them at the LAPIC right away), but so far wasn't really successful (sadly commit 5f4c1bb65e lacks any word of description other than its title). (Ill behaved devices shouldn't be handed to guests anyway.) > Either way we should tell the guest about those interrupts. >> >> > The end result is that we can still live-lock. Unless we: >> > - Drop on the floor the injection of N interrupts and >> > just deliever at max one per VMX_EXIT (and not bother >> > with interrupts arriving when we are in the VMX handler). >> >> I'm afraid I again don't see the point here. > > I am basing all of this on the assumption that we have > many interrupts for the same device coming it - and we have > not been able to tell the guest about it (the guest could > be descheduled, too slow, etc) so that it can do what it > needs to silence the device. But that's the same as with the native hardware case: When there are new interrupt instances before the earlier one was acked, at most one will be seen at the point the interrupt becomes unmasked again. Jan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-03-18 16:43 ` Jan Beulich @ 2015-03-18 17:00 ` Konrad Rzeszutek Wilk 2015-03-19 7:15 ` Jan Beulich 0 siblings, 1 reply; 27+ messages in thread From: Konrad Rzeszutek Wilk @ 2015-03-18 17:00 UTC (permalink / raw) To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux On Wed, Mar 18, 2015 at 04:43:40PM +0000, Jan Beulich wrote: > >>> On 18.03.15 at 15:06, <konrad.wilk@oracle.com> wrote: > > On Wed, Mar 18, 2015 at 07:41:55AM +0000, Jan Beulich wrote: > >> >>> On 17.03.15 at 18:44, <konrad.wilk@oracle.com> wrote: > >> > As you can see to preserve the existing functionality such as > >> > being able to schedule N amount of interrupt injections > >> > for the N interrupts we might get - I modified '->masked' > >> > to be an atomic counter. > >> > >> Why would that be? When an earlier interrupt wasn't fully handled, > >> real hardware wouldn't latch more than one further instance either. > > > > We acknowledge the interrupt in the hypervisor - as in we call > > ->ack on the handler (which for MSI is an nop anyhow). > > The case where ->ack is a nop (for the purposes here) is specifically > not a problem, as that means we defer ack-ing the LAPIC (hence > further instances can't show up). > > > If the device is misconfigured and keeps on sending burst of > > interrupts every 10 msec for 1msec we can dead-lock. > > How is this different from the hypervisor itself not being fast > enough to handle one instance before the next one shows up? If by 'handle' you mean process it to the guest (so update guest vAPIC and so on), then yes - this is exactly the case I am describing. > I've been trying to reconstruct the rationale for our current > treatment of maskable MSI sources (in that we ack them at the > LAPIC right away), but so far wasn't really successful (sadly > commit 5f4c1bb65e lacks any word of description other than > its title). > > (Ill behaved devices shouldn't be handed to guests anyway.) They might become ill-behaved if the guest OS becomes compromised. > > > Either way we should tell the guest about those interrupts. > >> > >> > The end result is that we can still live-lock. Unless we: > >> > - Drop on the floor the injection of N interrupts and > >> > just deliever at max one per VMX_EXIT (and not bother > >> > with interrupts arriving when we are in the VMX handler). > >> > >> I'm afraid I again don't see the point here. > > > > I am basing all of this on the assumption that we have > > many interrupts for the same device coming it - and we have > > not been able to tell the guest about it (the guest could > > be descheduled, too slow, etc) so that it can do what it > > needs to silence the device. > > But that's the same as with the native hardware case: When there > are new interrupt instances before the earlier one was acked, at > most one will be seen at the point the interrupt becomes unmasked > again. Correct. However we split the 'handling' of an interrupt in two stages. First stage is Acking it and activating an softirq to process this dpci. The second stage is running the softirq handler (processing)- and right then we can get interrupted by the same interrupt (we have Acked it - so the device is OK to send another one). The interrupt handler (do_IRQ) will try to tell the softirq to process it. And in here - depending on which flavour of RFC patches I've posted - we could deadlock. The deadlocks arise if we explicitly wait for the softirq to finish in raise_softirq_for - as in we spin in raise_softirq_for for the dpci to be out of running - while we have just stomped over the softirq that was processing the dpci! The live-lock scenario is also possible - if the device sends an interrupt right as dpci_softirq is in hvm_dirq_assist - and it does at such regular intervals that dpci_softirq ends up rescheduling its dpci every time. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-03-18 17:00 ` Konrad Rzeszutek Wilk @ 2015-03-19 7:15 ` Jan Beulich 0 siblings, 0 replies; 27+ messages in thread From: Jan Beulich @ 2015-03-19 7:15 UTC (permalink / raw) To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux >>> On 18.03.15 at 18:00, <konrad.wilk@oracle.com> wrote: > The deadlocks arise if we explicitly wait for the softirq to finish > in raise_softirq_for - as in we spin in raise_softirq_for for the > dpci to be out of running - while we have just stomped over the > softirq that was processing the dpci! Right - I finally see what you mean and why my alternative suggestion was bad. So let me go back look at the patch Sander has been testing for so long another time with this in mind. Jan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-01-13 10:20 ` Jan Beulich 2015-01-23 1:44 ` Konrad Rzeszutek Wilk @ 2015-02-02 14:29 ` Konrad Rzeszutek Wilk 2015-02-02 15:19 ` Jan Beulich 1 sibling, 1 reply; 27+ messages in thread From: Konrad Rzeszutek Wilk @ 2015-02-02 14:29 UTC (permalink / raw) To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux On Tue, Jan 13, 2015 at 10:20:00AM +0000, Jan Beulich wrote: > >>> On 12.01.15 at 17:45, <konrad.wilk@oracle.com> wrote: > > There is race when we clear the STATE_SCHED in the softirq > > - which allows the 'raise_softirq_for' to progress and > > schedule another dpci. During that time the other CPU could > > receive an interrupt and calls 'raise_softirq_for' and put > > the dpci on its per-cpu list. There would be two 'dpci_softirq' > > running at the same time (on different CPUs) where the > > dpci state is STATE_RUN (and STATE_SCHED is cleared). This > > ends up hitting: > > > > if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) > > BUG() > > > > Instead of that put the dpci back on the per-cpu list to deal > > with later. > > > > The reason we can get his with this is when an interrupt > > affinity is set over multiple CPUs. > > > > Another potential fix would be to add a guard in the raise_softirq_for > > to check for 'STATE_RUN' bit being set and not schedule the > > dpci until that bit has been cleared. > > I indeed think this should be investigated, because it would make > explicit what ... > > > --- a/xen/drivers/passthrough/io.c > > +++ b/xen/drivers/passthrough/io.c > > @@ -804,7 +804,17 @@ static void dpci_softirq(void) > > d = pirq_dpci->dom; > > smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */ > > if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) > > - BUG(); > > + { > > + unsigned long flags; > > + > > + /* Put back on the list and retry. */ > > + local_irq_save(flags); > > + list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list)); > > + local_irq_restore(flags); > > + > > + raise_softirq(HVM_DPCI_SOFTIRQ); > > + continue; > > + } > > ... this does implicitly - spin until the bad condition cleared. I still haven't gotten to trying to reproduce the issues Malcolm saw which is easier for me to do (use Intel super-fast storage in a Windows guest) - since I've one of those boxes. However in lieu of that, here is a patch that does pass my testing of SR-IOV, and I believe should work fine. I _think_ it covers what you want it to have Jan, but of course please correct me if I made a mistake in the logic. >From 20766b144e0b11daf3227ae978880bacd686ea2b Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Date: Mon, 12 Jan 2015 11:35:03 -0500 Subject: [PATCH] dpci: schedule properly the dpci when it is in STATE_RUN on any per-cpu list. There is race when we clear the STATE_SCHED in the softirq - which allows the 'raise_softirq_for' to progress and schedule another dpci. During that time the other CPU could receive an interrupt and calls 'raise_softirq_for' and put the dpci on its per-cpu list. There would be two 'dpci_softirq' running at the same time (on different CPUs) where the dpci state is STATE_RUN (and STATE_SCHED is cleared). This ends up hitting: if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) BUG() The reason we can get his with this is when an interrupt affinity is set over multiple CPUs. Instead of the BUG() we can put the dpci back on the per-cpu list to deal with later (when the softirq are activated again). This putting the 'dpci' back on the per-cpu list is an spin until the bad condition clear. We also expand the test-and-set(STATE_SCHED) in raise_softirq_for to detect for 'STATE_RUN' bit being set and schedule the dpci. The cases in which an dpci is NOT going to be scheduled is whenever the 'STATE_SCHED' bit is set (regardless whether STATE_RUN is set or not). Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Reported-by: Malcolm Crossley <malcolm.crossley@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- xen/drivers/passthrough/io.c | 43 ++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 40 insertions(+), 3 deletions(-) diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c index ae050df..b61b73c 100644 --- a/xen/drivers/passthrough/io.c +++ b/xen/drivers/passthrough/io.c @@ -63,10 +63,37 @@ enum { static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci) { unsigned long flags; + unsigned long old, new, val = pirq_dpci->state; - if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) - return; + /* + * This cmpxch is a more clear version of: + * if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) + * return; + * since it also checks for STATE_RUN conditions. + */ + for ( ;; ) + { + new = 1 << STATE_SCHED; + if ( val ) + new |= val; + old = cmpxchg(&pirq_dpci->state, val, new); + switch ( old ) + { + case (1 << STATE_SCHED): + case (1 << STATE_RUN) | (1 << STATE_SCHED): + /* + * Whenever STATE_SCHED is set we MUST not schedule it. + */ + return; + } + /* + * If the 'state' is 0 or (1 << STATE_RUN) we can schedule it. + */ + if ( old == val ) + break; + val = old; + } get_knownalive_domain(pirq_dpci->dom); local_irq_save(flags); @@ -804,7 +831,17 @@ static void dpci_softirq(void) d = pirq_dpci->dom; smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */ if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) - BUG(); + { + unsigned long flags; + + /* Put back on the list and retry later. */ + local_irq_save(flags); + list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list)); + local_irq_restore(flags); + + raise_softirq(HVM_DPCI_SOFTIRQ); + continue; + } /* * The one who clears STATE_SCHED MUST refcount the domain. */ -- 2.1.0 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-02-02 14:29 ` Konrad Rzeszutek Wilk @ 2015-02-02 15:19 ` Jan Beulich 2015-02-02 15:31 ` Konrad Rzeszutek Wilk 0 siblings, 1 reply; 27+ messages in thread From: Jan Beulich @ 2015-02-02 15:19 UTC (permalink / raw) To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux >>> On 02.02.15 at 15:29, <konrad.wilk@oracle.com> wrote: > On Tue, Jan 13, 2015 at 10:20:00AM +0000, Jan Beulich wrote: >> >>> On 12.01.15 at 17:45, <konrad.wilk@oracle.com> wrote: >> > There is race when we clear the STATE_SCHED in the softirq >> > - which allows the 'raise_softirq_for' to progress and >> > schedule another dpci. During that time the other CPU could >> > receive an interrupt and calls 'raise_softirq_for' and put >> > the dpci on its per-cpu list. There would be two 'dpci_softirq' >> > running at the same time (on different CPUs) where the >> > dpci state is STATE_RUN (and STATE_SCHED is cleared). This >> > ends up hitting: >> > >> > if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) >> > BUG() >> > >> > Instead of that put the dpci back on the per-cpu list to deal >> > with later. >> > >> > The reason we can get his with this is when an interrupt >> > affinity is set over multiple CPUs. >> > >> > Another potential fix would be to add a guard in the raise_softirq_for >> > to check for 'STATE_RUN' bit being set and not schedule the >> > dpci until that bit has been cleared. >> >> I indeed think this should be investigated, because it would make >> explicit what ... >> >> > --- a/xen/drivers/passthrough/io.c >> > +++ b/xen/drivers/passthrough/io.c >> > @@ -804,7 +804,17 @@ static void dpci_softirq(void) >> > d = pirq_dpci->dom; >> > smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */ >> > if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) >> > - BUG(); >> > + { >> > + unsigned long flags; >> > + >> > + /* Put back on the list and retry. */ >> > + local_irq_save(flags); >> > + list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list)); >> > + local_irq_restore(flags); >> > + >> > + raise_softirq(HVM_DPCI_SOFTIRQ); >> > + continue; >> > + } >> >> ... this does implicitly - spin until the bad condition cleared. > > I still haven't gotten to trying to reproduce the issues Malcolm > saw which is easier for me to do (use Intel super-fast storage > in a Windows guest) - since I've one of those boxes. > > However in lieu of that, here is a patch that does pass my testing > of SR-IOV, and I believe should work fine. I _think_ it covers > what you want it to have Jan, but of course please correct me > if I made a mistake in the logic. Since the code quoted above is still there in the new patch, the new patch can at best be half of what I suggested. > --- a/xen/drivers/passthrough/io.c > +++ b/xen/drivers/passthrough/io.c > @@ -63,10 +63,37 @@ enum { > static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci) > { > unsigned long flags; > + unsigned long old, new, val = pirq_dpci->state; > > - if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) > - return; > + /* > + * This cmpxch is a more clear version of: > + * if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) > + * return; > + * since it also checks for STATE_RUN conditions. > + */ > + for ( ;; ) > + { > + new = 1 << STATE_SCHED; > + if ( val ) > + new |= val; Why the if()? > > + old = cmpxchg(&pirq_dpci->state, val, new); > + switch ( old ) > + { > + case (1 << STATE_SCHED): > + case (1 << STATE_RUN) | (1 << STATE_SCHED): > + /* > + * Whenever STATE_SCHED is set we MUST not schedule it. > + */ > + return; > + } > + /* > + * If the 'state' is 0 or (1 << STATE_RUN) we can schedule it. > + */ Really? Wasn't the intention to _not_ schedule when STATE_RUN is set? (Also the above is a only line comment, i.e. wants different style.) I.e. with what you do now you could as well keep the old code. Jan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-02-02 15:19 ` Jan Beulich @ 2015-02-02 15:31 ` Konrad Rzeszutek Wilk 2015-02-02 15:48 ` Jan Beulich 0 siblings, 1 reply; 27+ messages in thread From: Konrad Rzeszutek Wilk @ 2015-02-02 15:31 UTC (permalink / raw) To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux On Mon, Feb 02, 2015 at 03:19:33PM +0000, Jan Beulich wrote: > >>> On 02.02.15 at 15:29, <konrad.wilk@oracle.com> wrote: > > On Tue, Jan 13, 2015 at 10:20:00AM +0000, Jan Beulich wrote: > >> >>> On 12.01.15 at 17:45, <konrad.wilk@oracle.com> wrote: > >> > There is race when we clear the STATE_SCHED in the softirq > >> > - which allows the 'raise_softirq_for' to progress and > >> > schedule another dpci. During that time the other CPU could > >> > receive an interrupt and calls 'raise_softirq_for' and put > >> > the dpci on its per-cpu list. There would be two 'dpci_softirq' > >> > running at the same time (on different CPUs) where the > >> > dpci state is STATE_RUN (and STATE_SCHED is cleared). This > >> > ends up hitting: > >> > > >> > if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) > >> > BUG() > >> > > >> > Instead of that put the dpci back on the per-cpu list to deal > >> > with later. > >> > > >> > The reason we can get his with this is when an interrupt > >> > affinity is set over multiple CPUs. > >> > > >> > Another potential fix would be to add a guard in the raise_softirq_for > >> > to check for 'STATE_RUN' bit being set and not schedule the > >> > dpci until that bit has been cleared. > >> > >> I indeed think this should be investigated, because it would make > >> explicit what ... > >> > >> > --- a/xen/drivers/passthrough/io.c > >> > +++ b/xen/drivers/passthrough/io.c > >> > @@ -804,7 +804,17 @@ static void dpci_softirq(void) > >> > d = pirq_dpci->dom; > >> > smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */ > >> > if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) > >> > - BUG(); > >> > + { > >> > + unsigned long flags; > >> > + > >> > + /* Put back on the list and retry. */ > >> > + local_irq_save(flags); > >> > + list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list)); > >> > + local_irq_restore(flags); > >> > + > >> > + raise_softirq(HVM_DPCI_SOFTIRQ); > >> > + continue; > >> > + } > >> > >> ... this does implicitly - spin until the bad condition cleared. > > > > I still haven't gotten to trying to reproduce the issues Malcolm > > saw which is easier for me to do (use Intel super-fast storage > > in a Windows guest) - since I've one of those boxes. > > > > However in lieu of that, here is a patch that does pass my testing > > of SR-IOV, and I believe should work fine. I _think_ it covers > > what you want it to have Jan, but of course please correct me > > if I made a mistake in the logic. > > Since the code quoted above is still there in the new patch, the new > patch can at best be half of what I suggested. > > > --- a/xen/drivers/passthrough/io.c > > +++ b/xen/drivers/passthrough/io.c > > @@ -63,10 +63,37 @@ enum { > > static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci) > > { > > unsigned long flags; > > + unsigned long old, new, val = pirq_dpci->state; > > > > - if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) > > - return; > > + /* > > + * This cmpxch is a more clear version of: > > + * if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) > > + * return; > > + * since it also checks for STATE_RUN conditions. > > + */ > > + for ( ;; ) > > + { > > + new = 1 << STATE_SCHED; > > + if ( val ) > > + new |= val; > > Why the if()? To 'or' the variable new with '1 << STATE_RUN' in case 'val' changed from the first read ('val = pirq_dpci->state') to the moment when we do the cmpxchg. > > > > > + old = cmpxchg(&pirq_dpci->state, val, new); > > + switch ( old ) > > + { > > + case (1 << STATE_SCHED): > > + case (1 << STATE_RUN) | (1 << STATE_SCHED): > > + /* > > + * Whenever STATE_SCHED is set we MUST not schedule it. > > + */ > > + return; > > + } > > + /* > > + * If the 'state' is 0 or (1 << STATE_RUN) we can schedule it. > > + */ > > Really? Wasn't the intention to _not_ schedule when STATE_RUN is > set? (Also the above is a only line comment, i.e. wants different style.) I must confess I must have misread your last review. You said: > Here is a patch that does this. I don't yet have an setup to test > the failing scenario (working on that). I removed the part in > the softirq because with this patch I cannot see a way it would > ever get there (in the softirq hitting the BUG). Hmm, but how do you now schedule the second instance that needed ... > + case (1 << STATE_RUN): ... in this case? which to me implied you still want to schedule an 'dpci' when STATE_RUN is set? My thinking is that we should still schedule an 'dpci' when STATE_RUN is set as that is inline with what the old tasklet code did. It would schedule the tasklet the moment said tasklet was off the global list (but it would not add it in the global list - that would be the job of the tasklet function wrapper to detect and insert said tasklet back on the global list). > > I.e. with what you do now you could as well keep the old code. > > Jan > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-02-02 15:31 ` Konrad Rzeszutek Wilk @ 2015-02-02 15:48 ` Jan Beulich 2015-02-02 17:44 ` Konrad Rzeszutek Wilk 0 siblings, 1 reply; 27+ messages in thread From: Jan Beulich @ 2015-02-02 15:48 UTC (permalink / raw) To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux >>> On 02.02.15 at 16:31, <konrad.wilk@oracle.com> wrote: > On Mon, Feb 02, 2015 at 03:19:33PM +0000, Jan Beulich wrote: >> >>> On 02.02.15 at 15:29, <konrad.wilk@oracle.com> wrote: >> > --- a/xen/drivers/passthrough/io.c >> > +++ b/xen/drivers/passthrough/io.c >> > @@ -63,10 +63,37 @@ enum { >> > static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci) >> > { >> > unsigned long flags; >> > + unsigned long old, new, val = pirq_dpci->state; >> > >> > - if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) >> > - return; >> > + /* >> > + * This cmpxch is a more clear version of: >> > + * if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) >> > + * return; >> > + * since it also checks for STATE_RUN conditions. >> > + */ >> > + for ( ;; ) >> > + { >> > + new = 1 << STATE_SCHED; >> > + if ( val ) >> > + new |= val; >> >> Why the if()? > > To 'or' the variable new with '1 << STATE_RUN' in case 'val' changed from > the first read ('val = pirq_dpci->state') to the moment when > we do the cmpxchg. But if "val" is zero, the | simply will do nothing. >> > + old = cmpxchg(&pirq_dpci->state, val, new); >> > + switch ( old ) >> > + { >> > + case (1 << STATE_SCHED): >> > + case (1 << STATE_RUN) | (1 << STATE_SCHED): >> > + /* >> > + * Whenever STATE_SCHED is set we MUST not schedule it. >> > + */ >> > + return; >> > + } >> > + /* >> > + * If the 'state' is 0 or (1 << STATE_RUN) we can schedule it. >> > + */ >> >> Really? Wasn't the intention to _not_ schedule when STATE_RUN is >> set? (Also the above is a only line comment, i.e. wants different style.) > > I must confess I must have misread your last review. You said: > > > Here is a patch that does this. I don't yet have an setup to test > > the failing scenario (working on that). I removed the part in > > the softirq because with this patch I cannot see a way it would > > ever get there (in the softirq hitting the BUG). > > Hmm, but how do you now schedule the second instance that needed ... > > > + case (1 << STATE_RUN): > > ... in this case? > > which to me implied you still want to schedule an 'dpci' when STATE_RUN is > set? > > My thinking is that we should still schedule an 'dpci' when STATE_RUN is set > as that is inline with what the old tasklet code did. It would > schedule the tasklet the moment said tasklet was off the global list (but > it would not add it in the global list - that would be the job of the > tasklet function wrapper to detect and insert said tasklet back on > the global list). Didn't the original discussion (and issue) revolve around scheduling while STATE_RUN was set? Hence the intention to wait for the flag to clear - but preferably in an explicit rather than implicit (as your current and previous patch do) manner. Jan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-02-02 15:48 ` Jan Beulich @ 2015-02-02 17:44 ` Konrad Rzeszutek Wilk 2015-02-03 8:58 ` Jan Beulich 0 siblings, 1 reply; 27+ messages in thread From: Konrad Rzeszutek Wilk @ 2015-02-02 17:44 UTC (permalink / raw) To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux On Mon, Feb 02, 2015 at 03:48:19PM +0000, Jan Beulich wrote: > >>> On 02.02.15 at 16:31, <konrad.wilk@oracle.com> wrote: > > On Mon, Feb 02, 2015 at 03:19:33PM +0000, Jan Beulich wrote: > >> >>> On 02.02.15 at 15:29, <konrad.wilk@oracle.com> wrote: > >> > --- a/xen/drivers/passthrough/io.c > >> > +++ b/xen/drivers/passthrough/io.c > >> > @@ -63,10 +63,37 @@ enum { > >> > static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci) > >> > { > >> > unsigned long flags; > >> > + unsigned long old, new, val = pirq_dpci->state; > >> > > >> > - if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) > >> > - return; > >> > + /* > >> > + * This cmpxch is a more clear version of: > >> > + * if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) > >> > + * return; > >> > + * since it also checks for STATE_RUN conditions. > >> > + */ > >> > + for ( ;; ) > >> > + { > >> > + new = 1 << STATE_SCHED; > >> > + if ( val ) > >> > + new |= val; > >> > >> Why the if()? > > > > To 'or' the variable new with '1 << STATE_RUN' in case 'val' changed from > > the first read ('val = pirq_dpci->state') to the moment when > > we do the cmpxchg. > > But if "val" is zero, the | simply will do nothing. Correct. Keep in mind that 'new' is set to '1 << STATE_SCHED' at every loop iteration - so it ends up old = cmpxchg(.., 0, 1 << STATE_SCHED) (and old == 0, val == 0, so we end up breaking out of the loop). > > >> > + old = cmpxchg(&pirq_dpci->state, val, new); > >> > + switch ( old ) > >> > + { > >> > + case (1 << STATE_SCHED): > >> > + case (1 << STATE_RUN) | (1 << STATE_SCHED): > >> > + /* > >> > + * Whenever STATE_SCHED is set we MUST not schedule it. > >> > + */ > >> > + return; > >> > + } > >> > + /* > >> > + * If the 'state' is 0 or (1 << STATE_RUN) we can schedule it. > >> > + */ > >> > >> Really? Wasn't the intention to _not_ schedule when STATE_RUN is > >> set? (Also the above is a only line comment, i.e. wants different style.) > > > > I must confess I must have misread your last review. You said: > > > > > Here is a patch that does this. I don't yet have an setup to test > > > the failing scenario (working on that). I removed the part in > > > the softirq because with this patch I cannot see a way it would > > > ever get there (in the softirq hitting the BUG). > > > > Hmm, but how do you now schedule the second instance that needed ... > > > > > + case (1 << STATE_RUN): > > > > ... in this case? > > > > which to me implied you still want to schedule an 'dpci' when STATE_RUN is > > set? > > > > My thinking is that we should still schedule an 'dpci' when STATE_RUN is set > > as that is inline with what the old tasklet code did. It would > > schedule the tasklet the moment said tasklet was off the global list (but > > it would not add it in the global list - that would be the job of the > > tasklet function wrapper to detect and insert said tasklet back on > > the global list). > > Didn't the original discussion (and issue) revolve around scheduling > while STATE_RUN was set? Hence the intention to wait for the flag Yes. > to clear - but preferably in an explicit rather than implicit (as your > current and previous patch do) manner. If we do explicitly we run risk of dead-lock. See below of an draft (not even compiled tested) of what I think you mean. The issue is that 'raise_softirq_for' ends up being called from do_IRQ. And we might have an IRQ coming in just as we are in the dpci_softirq having set the 1 << STATE_SCHED. Our spin-and-wait in raise_softirq_for (in this code below) will spin forever. One way to not get in that quagmire is to well, do what the previous patches did. diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c index ae050df..706a636 100644 --- a/xen/drivers/passthrough/io.c +++ b/xen/drivers/passthrough/io.c @@ -63,10 +63,35 @@ enum { static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci) { unsigned long flags; + unsigned long old, new, val = pirq_dpci->state; - if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) - return; - + for ( ;; ) + { + old = cmpxchg(&pirq_dpci->state, 0, 1 << STATE_SCHED); + switch ( old ) + { + case (1 << STATE_SCHED): + /* + * Whenever STATE_SCHED is set we MUST not schedule it. + */ + return; + case (1 << STATE_RUN) | (1 << STATE_SCHED): + case (1 << STATE_RUN): + /* Getting close to finish. Spin. */ + continue; + } + /* + * If the 'state' is 0 we can schedule it. + */ + if ( old == 0 ) + break; + } get_knownalive_domain(pirq_dpci->dom); local_irq_save(flags); > > Jan > ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-02-02 17:44 ` Konrad Rzeszutek Wilk @ 2015-02-03 8:58 ` Jan Beulich 2015-03-16 17:59 ` Konrad Rzeszutek Wilk 0 siblings, 1 reply; 27+ messages in thread From: Jan Beulich @ 2015-02-03 8:58 UTC (permalink / raw) To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux >>> On 02.02.15 at 18:44, <konrad.wilk@oracle.com> wrote: > On Mon, Feb 02, 2015 at 03:48:19PM +0000, Jan Beulich wrote: >> >>> On 02.02.15 at 16:31, <konrad.wilk@oracle.com> wrote: >> > On Mon, Feb 02, 2015 at 03:19:33PM +0000, Jan Beulich wrote: >> >> >>> On 02.02.15 at 15:29, <konrad.wilk@oracle.com> wrote: >> >> > --- a/xen/drivers/passthrough/io.c >> >> > +++ b/xen/drivers/passthrough/io.c >> >> > @@ -63,10 +63,37 @@ enum { >> >> > static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci) >> >> > { >> >> > unsigned long flags; >> >> > + unsigned long old, new, val = pirq_dpci->state; >> >> > >> >> > - if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) >> >> > - return; >> >> > + /* >> >> > + * This cmpxch is a more clear version of: >> >> > + * if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) >> >> > + * return; >> >> > + * since it also checks for STATE_RUN conditions. >> >> > + */ >> >> > + for ( ;; ) >> >> > + { >> >> > + new = 1 << STATE_SCHED; >> >> > + if ( val ) >> >> > + new |= val; >> >> >> >> Why the if()? >> > >> > To 'or' the variable new with '1 << STATE_RUN' in case 'val' changed from >> > the first read ('val = pirq_dpci->state') to the moment when >> > we do the cmpxchg. >> >> But if "val" is zero, the | simply will do nothing. > > Correct. Keep in mind that 'new' is set to '1 << STATE_SCHED' at every > loop iteration - so it ends up old = cmpxchg(.., 0, 1 << STATE_SCHED) > (and old == 0, val == 0, so we end up breaking out of the loop). Not sure what you're trying to explain to me here. The code you have is equivalent to + new = (1 << STATE_SCHED) | val; no matter what. >> Didn't the original discussion (and issue) revolve around scheduling >> while STATE_RUN was set? Hence the intention to wait for the flag > > Yes. >> to clear - but preferably in an explicit rather than implicit (as your >> current and previous patch do) manner. > > If we do explicitly we run risk of dead-lock. See below of an draft > (not even compiled tested) of what I think you mean. That's no different from the code you proposed before, just that the live-lock is better hidden there: By re-raising a softirq from a softirq handler, you arrange for yourself to be called again right away. > --- a/xen/drivers/passthrough/io.c > +++ b/xen/drivers/passthrough/io.c > @@ -63,10 +63,35 @@ enum { > static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci) > { > unsigned long flags; > + unsigned long old, new, val = pirq_dpci->state; > > - if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) > - return; > - > + for ( ;; ) > + { > + old = cmpxchg(&pirq_dpci->state, 0, 1 << STATE_SCHED); > + switch ( old ) > + { > + case (1 << STATE_SCHED): > + /* > + * Whenever STATE_SCHED is set we MUST not schedule it. > + */ > + return; > + case (1 << STATE_RUN) | (1 << STATE_SCHED): > + case (1 << STATE_RUN): > + /* Getting close to finish. Spin. */ > + continue; > + } > + /* > + * If the 'state' is 0 we can schedule it. > + */ > + if ( old == 0 ) > + break; > + } > get_knownalive_domain(pirq_dpci->dom); > > local_irq_save(flags); Yes, this looks better. And I again wonder whether STATE_* couldn't simply become a tristate - dpci_softirq(), rather than setting STATE_RUN and then clearing STATE_SCHED, could simply cmpxchg(, STATE_SCHED, STATE_RUN), acting as necessary when that operation fails. Jan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-02-03 8:58 ` Jan Beulich @ 2015-03-16 17:59 ` Konrad Rzeszutek Wilk 2015-03-17 8:18 ` Jan Beulich 0 siblings, 1 reply; 27+ messages in thread From: Konrad Rzeszutek Wilk @ 2015-03-16 17:59 UTC (permalink / raw) To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, linux, xen-devel > > If we do explicitly we run risk of dead-lock. See below of an draft > > (not even compiled tested) of what I think you mean. > > That's no different from the code you proposed before, just that > the live-lock is better hidden there: By re-raising a softirq from a > softirq handler, you arrange for yourself to be called again right > away. Interestingly enough this issue of live-lock has been there since the start (when this code was written). The tasklet (so Xen 4.5 and earlier) based implementation allows this: CPU0 CPU1 CPU2 hvm_do_IRQ_dpci() - tasklet_schedule() [t->scheduled_on = 0] if (!t->is_running) put tasklet on per-cpu list. raise softirq . do_softirq [do_tasklet_work] t->scheduled_on = -1 t->is_running = 1 unlock the lock -> call hvm_dirq_assist do_IRQ hvm_do_IRQ_dpci t->scheduled_on = 1; if (!t->is_running).. [it is 1, so skip] do_IRQ hvm_do_IRQ_dpci t->scheduled_on = 2; if (!t->is_running).. [it is 1, so skip] takes the lock if (t->scheduled_on >= 0) tasklet_enqueue: put tasklet on per-cpu list, raise softirq. [do_tasklet_work] .. repeat the story. Inject interrupts right as 'hvm_dirq_assist' it executing. With interrupts happening right as the dpci_work is being called we can cause an exact live-lock. It also suffers from "skipping" interrupts if they span more than on CPU. The implementation I had does not address this issue either (it only picks up the 'second' interrupt while the older picks the 'last' one). Anyhow the patch below makes the code from a behavior standpoint the same as what we have in Xen 4.5 and earlier (with warts). The difference is that the warts are so much faster so you don't seem them :-) > > > --- a/xen/drivers/passthrough/io.c > > +++ b/xen/drivers/passthrough/io.c > > @@ -63,10 +63,35 @@ enum { > > static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci) > > { > > unsigned long flags; > > + unsigned long old, new, val = pirq_dpci->state; > > > > - if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) > > - return; > > - > > + for ( ;; ) > > + { > > + old = cmpxchg(&pirq_dpci->state, 0, 1 << STATE_SCHED); > > + switch ( old ) > > + { > > + case (1 << STATE_SCHED): > > + /* > > + * Whenever STATE_SCHED is set we MUST not schedule it. > > + */ > > + return; > > + case (1 << STATE_RUN) | (1 << STATE_SCHED): > > + case (1 << STATE_RUN): > > + /* Getting close to finish. Spin. */ > > + continue; > > + } > > + /* > > + * If the 'state' is 0 we can schedule it. > > + */ > > + if ( old == 0 ) > > + break; > > + } > > get_knownalive_domain(pirq_dpci->dom); > > > > local_irq_save(flags); > > Yes, this looks better. And I again wonder whether STATE_* > couldn't simply become a tristate - dpci_softirq(), rather than setting > STATE_RUN and then clearing STATE_SCHED, could simply > cmpxchg(, STATE_SCHED, STATE_RUN), acting as necessary when > that operation fails. One issue that needs to be figured out is what to do when we have an interrupt mini-storm and need to queue it up. One option could be to convert the ->mapping to be an simple atomic counter - incremented every time hvm_do_IRQ_dpci is called (and hvm_dirq_assist would have an while loop decrementing this). However as previous reviews demonstrated there are a lot of subtle things that need to be taken care of: - Legacy interrupts and its 'lets-reset mapping to zero' when the interrupt timer has elapsed. - Deal with error states when assigning an PIRQ, or during the window time between creating/destroying loop and having to make sure that the state in the dpci_softirq is sane. That took me two months to get right and only with other folks testing and finding this. While I am totally up for restarting this to make it awesome - I am also aghast at my progress on that and fear that this might take _right_ past the feature freeze window. Hence was wondering if it would just be easier to put this patch in (see above) - with the benfit that folks have an faster interrupt passthrough experience and then I work on another variant of this with tristate cmpxchg and ->mapping atomic counter. Thoughts? > > Jan > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-03-16 17:59 ` Konrad Rzeszutek Wilk @ 2015-03-17 8:18 ` Jan Beulich 2015-03-17 8:42 ` Sander Eikelenboom 0 siblings, 1 reply; 27+ messages in thread From: Jan Beulich @ 2015-03-17 8:18 UTC (permalink / raw) To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux >>> On 16.03.15 at 18:59, <konrad.wilk@oracle.com> wrote: > Hence was wondering if it would just be easier to put > this patch in (see above) - with the benfit that folks have > an faster interrupt passthrough experience and then I work on another > variant of this with tristate cmpxchg and ->mapping atomic counter. Considering how long this issue has been pending I think we really need to get _something_ in (or revert); if this something is the patch in its most recent form, so be it (even if maybe not the simplest of all possible variants). So please submit as a proper non- RFC patch. Jan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-03-17 8:18 ` Jan Beulich @ 2015-03-17 8:42 ` Sander Eikelenboom 2015-03-17 14:54 ` Konrad Rzeszutek Wilk 0 siblings, 1 reply; 27+ messages in thread From: Sander Eikelenboom @ 2015-03-17 8:42 UTC (permalink / raw) To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, xen-devel Tuesday, March 17, 2015, 9:18:32 AM, you wrote: >>>> On 16.03.15 at 18:59, <konrad.wilk@oracle.com> wrote: >> Hence was wondering if it would just be easier to put >> this patch in (see above) - with the benfit that folks have >> an faster interrupt passthrough experience and then I work on another >> variant of this with tristate cmpxchg and ->mapping atomic counter. > Considering how long this issue has been pending I think we really > need to get _something_ in (or revert); if this something is the > patch in its most recent form, so be it (even if maybe not the > simplest of all possible variants). So please submit as a proper non- > RFC patch. > Jan I'm still running with this first simple stopgap patch from Konrad, and it has worked fine for me since. I will see if this new one also "works-for-me", somewhere today :-) -- Sander diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c index ae050df..d1421b0 100644 --- a/xen/drivers/passthrough/io.c +++ b/xen/drivers/passthrough/io.c @@ -804,7 +804,18 @@ static void dpci_softirq(void) d = pirq_dpci->dom; smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */ if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) - BUG(); + { + unsigned long flags; + + /* Put back on the list and retry. */ + local_irq_save(flags); + list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list)); + local_irq_restore(flags); + + raise_softirq(HVM_DPCI_SOFTIRQ); + continue; + } + /* * The one who clears STATE_SCHED MUST refcount the domain. */ ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-03-17 8:42 ` Sander Eikelenboom @ 2015-03-17 14:54 ` Konrad Rzeszutek Wilk 2015-03-17 16:01 ` Jan Beulich 0 siblings, 1 reply; 27+ messages in thread From: Konrad Rzeszutek Wilk @ 2015-03-17 14:54 UTC (permalink / raw) To: Sander Eikelenboom Cc: andrew.cooper3, malcolm.crossley, Jan Beulich, xen-devel [-- Attachment #1: Type: text/plain, Size: 3639 bytes --] On Tue, Mar 17, 2015 at 09:42:21AM +0100, Sander Eikelenboom wrote: > > Tuesday, March 17, 2015, 9:18:32 AM, you wrote: > > >>>> On 16.03.15 at 18:59, <konrad.wilk@oracle.com> wrote: > >> Hence was wondering if it would just be easier to put > >> this patch in (see above) - with the benfit that folks have > >> an faster interrupt passthrough experience and then I work on another > >> variant of this with tristate cmpxchg and ->mapping atomic counter. > > > Considering how long this issue has been pending I think we really > > need to get _something_ in (or revert); if this something is the > > patch in its most recent form, so be it (even if maybe not the > > simplest of all possible variants). So please submit as a proper non- > > RFC patch. > > > Jan > > I'm still running with this first simple stopgap patch from Konrad, > and it has worked fine for me since. I believe the patch that Sander and Malcom had been running is the best candidate. The other ones I had been fiddling with - such as the one attached here - I cannot make myself comfortable that it will not hit a dead-lock. On Intel hardware the softirq is called from the vmx_resume - which means that the whole 'interrupt guest' and deliever the event code happens during the VMEXIT to VMENTER time. But that does not preclude another interrupt destined for this same vCPU to come right in as we are progressing through the softirqs - and dead-lock: in the vmx_resume stack we are in hvm_dirq_assist (called from dpci_softirq) and haven't cleared the STATE_SHED, while in the IRQ stack we spin in the raise_sofitrq_for for the STATE_SCHED to be cleared. An dead-lock avoidance could be added to save the CPU value of the softirq that is executing the dpci. And then 'raise_softirq_for' can check that and bail out if (smp_processor_id == dpci_pirq->cpu). Naturlly this means being very careful _where_ we initialize the 'cpu' to -1, etc - which brings back to carefully work out the corner cases and make sure we do the right thing - which can take time. The re-using the 'dpci' on the per-cpu list is doing the same exact thing that older tasklet code was doing. That is : If the function assigned to the tasklet was running - the softirq that ran said function (hvm_dirq_assist) would be responsible for putting the tasklet back on the per-cpu list. This would allow to have an running tasklet and an 'to-be-scheduled' tasklet at the same time. And that is what we need. I will post an proper patch and also add Tested-by from Malcom and Sander on it - as it did fix their test-cases and is unmodified (except an updated comment) from what theytested in 2014. > > I will see if this new one also "works-for-me", somewhere today :-) > > -- > Sander > > > > diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c > index ae050df..d1421b0 100644 > --- a/xen/drivers/passthrough/io.c > +++ b/xen/drivers/passthrough/io.c > @@ -804,7 +804,18 @@ static void dpci_softirq(void) > d = pirq_dpci->dom; > smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */ > if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) > - BUG(); > + { > + unsigned long flags; > + > + /* Put back on the list and retry. */ > + local_irq_save(flags); > + list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list)); > + local_irq_restore(flags); > + > + raise_softirq(HVM_DPCI_SOFTIRQ); > + continue; > + } > + > /* > * The one who clears STATE_SCHED MUST refcount the domain. > */ > [-- Attachment #2: 0001-dpci-when-scheduling-spin-until-STATE_RUN-or-STATE_S.patch --] [-- Type: text/plain, Size: 3759 bytes --] >From 6b32dccfbe00518d3ca9cd94d19a6e007b2645d9 Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Date: Tue, 17 Mar 2015 09:46:09 -0400 Subject: [PATCH] dpci: when scheduling spin until STATE_RUN or STATE_SCHED has been cleared. There is race when we clear the STATE_SCHED in the softirq - which allows the 'raise_softirq_for' (on another CPU) to schedule the dpci. Specifically this can happen whenthe other CPU receives an interrupt, calls 'raise_softirq_for', and puts the dpci on its per-cpu list (same dpci structure). There would be two 'dpci_softirq' running at the same time (on different CPUs) where on one CPU it would be executing hvm_dirq_assist (so had cleared STATE_SCHED and set STATE_RUN) and on the other CPU it is trying to call: if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) ) BUG(); Since STATE_RUN is already set it would end badly. The reason we can get his with this is when an interrupt affinity is set over multiple CPUs. Potential solutions: a) Instead of the BUG() we can put the dpci back on the per-cpu list to deal with later (when the softirq are activated again). This putting the 'dpci' back on the per-cpu list is an spin until the bad condition clears. b) We could also expand the test-and-set(STATE_SCHED) in raise_softirq_for to detect for 'STATE_RUN' bit being set and schedule the dpci in a more safe manner (delay it). The dpci would stil not be scheduled when STATE_SCHED bit was set. c) This patch explores a third option - we will only schedule the dpci when the state is cleared (no STATE_SCHED and no STATE_RUN). We will spin if STATE_RUN is set (as it is in progress and will finish). If the STATE_SCHED is set (so hasn't run yet) we won't try to spin and just exit. This can cause an dead-lock if the interrupt comes when we are processing the dpci in the softirq. Interestingly the old ('tasklet') code used an a) mechanism. If the function assigned to the tasklet was running - the softirq that ran said function (hvm_dirq_assist) would be responsible for putting the tasklet back on the per-cpu list. This would allow to have an running tasklet and an 'to-be-scheduled' tasklet at the same time. This solution moves this 'to-be-scheduled' job to be done in 'raise_softirq_for' (instead of the 'softirq'). Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Reported-by: Malcolm Crossley <malcolm.crossley@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- xen/drivers/passthrough/io.c | 28 +++++++++++++++++++++++++--- 1 file changed, 25 insertions(+), 3 deletions(-) diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c index ae050df..9c30ebb 100644 --- a/xen/drivers/passthrough/io.c +++ b/xen/drivers/passthrough/io.c @@ -63,10 +63,32 @@ enum { static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci) { unsigned long flags; + unsigned long old; - if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) - return; - + /* + * This cmpxchg spins until the state is zero (unused). + */ + for ( ;; ) + { + old = cmpxchg(&pirq_dpci->state, 0, 1 << STATE_SCHED); + switch ( old ) + { + case (1 << STATE_SCHED): + /* + * Whenever STATE_SCHED is set we MUST not schedule it. + */ + return; + case (1 << STATE_RUN) | (1 << STATE_SCHED): + case (1 << STATE_RUN): + /* Getting close to finish. Spin. */ + continue; + } + /* + * If the 'state' is 0 (not in use) we can schedule it. + */ + if ( old == 0 ) + break; + } get_knownalive_domain(pirq_dpci->dom); local_irq_save(flags); -- 2.1.0 [-- Attachment #3: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-03-17 14:54 ` Konrad Rzeszutek Wilk @ 2015-03-17 16:01 ` Jan Beulich 2015-03-17 16:09 ` Konrad Rzeszutek Wilk 0 siblings, 1 reply; 27+ messages in thread From: Jan Beulich @ 2015-03-17 16:01 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: andrew.cooper3, malcolm.crossley, xen-devel, Sander Eikelenboom >>> On 17.03.15 at 15:54, <konrad.wilk@oracle.com> wrote: > On Tue, Mar 17, 2015 at 09:42:21AM +0100, Sander Eikelenboom wrote: >> I'm still running with this first simple stopgap patch from Konrad, >> and it has worked fine for me since. > > I believe the patch that Sander and Malcom had been running is the best > candidate. That's the one Sander had quoted I suppose? I don't think this is any better in terms of live locking, and we went quite some hoops to get to something that looked more like a fix than a quick workaround. (If there's nothing we can agree to, we'll have to revert as we did for 4.5.) Jan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU. 2015-03-17 16:01 ` Jan Beulich @ 2015-03-17 16:09 ` Konrad Rzeszutek Wilk 0 siblings, 0 replies; 27+ messages in thread From: Konrad Rzeszutek Wilk @ 2015-03-17 16:09 UTC (permalink / raw) To: Jan Beulich Cc: andrew.cooper3, malcolm.crossley, xen-devel, Sander Eikelenboom On Tue, Mar 17, 2015 at 04:01:49PM +0000, Jan Beulich wrote: > >>> On 17.03.15 at 15:54, <konrad.wilk@oracle.com> wrote: > > On Tue, Mar 17, 2015 at 09:42:21AM +0100, Sander Eikelenboom wrote: > >> I'm still running with this first simple stopgap patch from Konrad, > >> and it has worked fine for me since. > > > > I believe the patch that Sander and Malcom had been running is the best > > candidate. > > That's the one Sander had quoted I suppose? I don't think this is Correct. > any better in terms of live locking, and we went quite some hoops > to get to something that looked more like a fix than a quick > workaround. (If there's nothing we can agree to, we'll have to > revert as we did for 4.5.) The live-locking does get broken (other softirqs get activated which moves things along). Keep in mind that the live-locking scenario exists already in Xen 4.x with the tasklet implementation. > > Jan > ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2015-03-19 7:16 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-01-12 16:45 [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU Konrad Rzeszutek Wilk 2015-01-12 17:27 ` Sander Eikelenboom 2015-01-12 17:35 ` Konrad Rzeszutek Wilk 2015-01-13 10:26 ` Jan Beulich 2015-01-13 10:20 ` Jan Beulich 2015-01-23 1:44 ` Konrad Rzeszutek Wilk 2015-01-23 9:37 ` Jan Beulich 2015-01-23 14:54 ` Konrad Rzeszutek Wilk 2015-03-17 17:44 ` Konrad Rzeszutek Wilk 2015-03-17 22:16 ` Sander Eikelenboom 2015-03-18 7:41 ` Jan Beulich 2015-03-18 14:06 ` Konrad Rzeszutek Wilk 2015-03-18 16:43 ` Jan Beulich 2015-03-18 17:00 ` Konrad Rzeszutek Wilk 2015-03-19 7:15 ` Jan Beulich 2015-02-02 14:29 ` Konrad Rzeszutek Wilk 2015-02-02 15:19 ` Jan Beulich 2015-02-02 15:31 ` Konrad Rzeszutek Wilk 2015-02-02 15:48 ` Jan Beulich 2015-02-02 17:44 ` Konrad Rzeszutek Wilk 2015-02-03 8:58 ` Jan Beulich 2015-03-16 17:59 ` Konrad Rzeszutek Wilk 2015-03-17 8:18 ` Jan Beulich 2015-03-17 8:42 ` Sander Eikelenboom 2015-03-17 14:54 ` Konrad Rzeszutek Wilk 2015-03-17 16:01 ` Jan Beulich 2015-03-17 16:09 ` Konrad Rzeszutek Wilk
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.