All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
@ 2015-01-12 16:45 Konrad Rzeszutek Wilk
  2015-01-12 17:27 ` Sander Eikelenboom
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-01-12 16:45 UTC (permalink / raw)
  To: jbeulich, andrew.cooper3, xen-devel, linux, malcolm.crossley
  Cc: Konrad Rzeszutek Wilk

There is race when we clear the STATE_SCHED in the softirq
- which allows the 'raise_softirq_for' to progress and
schedule another dpci. During that time the other CPU could
receive an interrupt and calls 'raise_softirq_for' and put
the dpci on its per-cpu list. There would be two 'dpci_softirq'
running at the same time (on different CPUs) where the
dpci state is STATE_RUN (and STATE_SCHED is cleared). This
ends up hitting:

 if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
	BUG()

Instead of that put the dpci back on the per-cpu list to deal
with later.

The reason we can get his with this is when an interrupt
affinity is set over multiple CPUs.

Another potential fix would be to add a guard in the raise_softirq_for
to check for 'STATE_RUN' bit being set and not schedule the
dpci until that bit has been cleared.

Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Reported-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/drivers/passthrough/io.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index ae050df..9b77334 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -804,7 +804,17 @@ static void dpci_softirq(void)
         d = pirq_dpci->dom;
         smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
         if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
-            BUG();
+        {
+            unsigned long flags;
+
+            /* Put back on the list and retry. */
+            local_irq_save(flags);
+            list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
+            local_irq_restore(flags);
+
+            raise_softirq(HVM_DPCI_SOFTIRQ);
+            continue;
+        }
         /*
          * The one who clears STATE_SCHED MUST refcount the domain.
          */
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-01-12 16:45 [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU Konrad Rzeszutek Wilk
@ 2015-01-12 17:27 ` Sander Eikelenboom
  2015-01-12 17:35 ` Konrad Rzeszutek Wilk
  2015-01-13 10:20 ` Jan Beulich
  2 siblings, 0 replies; 27+ messages in thread
From: Sander Eikelenboom @ 2015-01-12 17:27 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: andrew.cooper3, malcolm.crossley, jbeulich, xen-devel


Monday, January 12, 2015, 5:45:30 PM, you wrote:

> There is race when we clear the STATE_SCHED in the softirq
> - which allows the 'raise_softirq_for' to progress and
> schedule another dpci. During that time the other CPU could
> receive an interrupt and calls 'raise_softirq_for' and put
> the dpci on its per-cpu list. There would be two 'dpci_softirq'
> running at the same time (on different CPUs) where the
> dpci state is STATE_RUN (and STATE_SCHED is cleared). This
> ends up hitting:

>  if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
>         BUG()

> Instead of that put the dpci back on the per-cpu list to deal
> with later.

> The reason we can get his with this is when an interrupt
> affinity is set over multiple CPUs.

> Another potential fix would be to add a guard in the raise_softirq_for
> to check for 'STATE_RUN' bit being set and not schedule the
> dpci until that bit has been cleared.

> Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
> Reported-by: Malcolm Crossley <malcolm.crossley@citrix.com>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Thanks again for the quick fix Konrad !

Tested it for the last half hour and everything seems fine,
so you can also stick on a tested-by if you like.

--
Sander



> ---
>  xen/drivers/passthrough/io.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)

> diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> index ae050df..9b77334 100644
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -804,7 +804,17 @@ static void dpci_softirq(void)
>          d = pirq_dpci->dom;
>          smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
>          if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> -            BUG();
> +        {
> +            unsigned long flags;
> +
> +            /* Put back on the list and retry. */
> +            local_irq_save(flags);
> +            list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
> +            local_irq_restore(flags);
> +
> +            raise_softirq(HVM_DPCI_SOFTIRQ);
> +            continue;
> +        }
>          /*
>           * The one who clears STATE_SCHED MUST refcount the domain.
>           */

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-01-12 16:45 [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU Konrad Rzeszutek Wilk
  2015-01-12 17:27 ` Sander Eikelenboom
@ 2015-01-12 17:35 ` Konrad Rzeszutek Wilk
  2015-01-13 10:26   ` Jan Beulich
  2015-01-13 10:20 ` Jan Beulich
  2 siblings, 1 reply; 27+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-01-12 17:35 UTC (permalink / raw)
  To: jbeulich, andrew.cooper3, xen-devel, linux, malcolm.crossley

On Mon, Jan 12, 2015 at 11:45:30AM -0500, Konrad Rzeszutek Wilk wrote:
> There is race when we clear the STATE_SCHED in the softirq
> - which allows the 'raise_softirq_for' to progress and
> schedule another dpci. During that time the other CPU could
> receive an interrupt and calls 'raise_softirq_for' and put
> the dpci on its per-cpu list. There would be two 'dpci_softirq'
> running at the same time (on different CPUs) where the
> dpci state is STATE_RUN (and STATE_SCHED is cleared). This
> ends up hitting:
> 
>  if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> 	BUG()
> 
> Instead of that put the dpci back on the per-cpu list to deal
> with later.
> 
> The reason we can get his with this is when an interrupt
> affinity is set over multiple CPUs.
> 
> Another potential fix would be to add a guard in the raise_softirq_for
> to check for 'STATE_RUN' bit being set and not schedule the
> dpci until that bit has been cleared.
> 
> Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
> Reported-by: Malcolm Crossley <malcolm.crossley@citrix.com>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  xen/drivers/passthrough/io.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> index ae050df..9b77334 100644
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -804,7 +804,17 @@ static void dpci_softirq(void)
>          d = pirq_dpci->dom;
>          smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
>          if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> -            BUG();
> +        {
> +            unsigned long flags;
> +
> +            /* Put back on the list and retry. */
> +            local_irq_save(flags);
> +            list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));

I chatted with Jan on IRC this, and one worry is that if we add
on our per-cpu list an 'dpci' that is running on another CPU - if
the other CPU runs 'list_del' it will go BOOM.

However, I am not sure if I can come up with a scenario in which
this will be triggered - as by the time we get to checking
the STATE_RUN condition the list removal has already been done.
So adding in on the per-cpu list is OK - and since the STATE_SCHED
is set, it guards against double-list addition.


CPU1					CPU2					CPU3:
softirq_dpci:
  list_del() from per-cpu-list
  test-and-set(STATE_RUN)
  test-and-clear(STATE_SCHED)
   .. 
					raise_softirq
						test-and-set(STATE_SCHED)
					softirq_dpci
						list_del from per-cpu-list
						if-test-and-set(STATE_RUN) fails:
							list_add_tail(..)
							continue
										raise_softirq
											test-and-set(STATE_SCHED)
											[fails, so no adding]
  ..
					softirq_dpci
						list_del
						if-test-and-set(STATE_RUN) fails:
							list_add_tail(..) [OK, we did the list_del]
							continue..
  clear_bit(STATE_RUN)
					softrqi_dpci
						list_del()
  						test-and-set(STATE_RUN)
						test-and-clear(STATE_SCHED)


> +            local_irq_restore(flags);
> +
> +            raise_softirq(HVM_DPCI_SOFTIRQ);
> +            continue;
> +        }
>          /*
>           * The one who clears STATE_SCHED MUST refcount the domain.
>           */
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-01-12 16:45 [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU Konrad Rzeszutek Wilk
  2015-01-12 17:27 ` Sander Eikelenboom
  2015-01-12 17:35 ` Konrad Rzeszutek Wilk
@ 2015-01-13 10:20 ` Jan Beulich
  2015-01-23  1:44   ` Konrad Rzeszutek Wilk
  2015-02-02 14:29   ` Konrad Rzeszutek Wilk
  2 siblings, 2 replies; 27+ messages in thread
From: Jan Beulich @ 2015-01-13 10:20 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux

>>> On 12.01.15 at 17:45, <konrad.wilk@oracle.com> wrote:
> There is race when we clear the STATE_SCHED in the softirq
> - which allows the 'raise_softirq_for' to progress and
> schedule another dpci. During that time the other CPU could
> receive an interrupt and calls 'raise_softirq_for' and put
> the dpci on its per-cpu list. There would be two 'dpci_softirq'
> running at the same time (on different CPUs) where the
> dpci state is STATE_RUN (and STATE_SCHED is cleared). This
> ends up hitting:
> 
>  if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> 	BUG()
> 
> Instead of that put the dpci back on the per-cpu list to deal
> with later.
> 
> The reason we can get his with this is when an interrupt
> affinity is set over multiple CPUs.
> 
> Another potential fix would be to add a guard in the raise_softirq_for
> to check for 'STATE_RUN' bit being set and not schedule the
> dpci until that bit has been cleared.

I indeed think this should be investigated, because it would make
explicit what ...

> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -804,7 +804,17 @@ static void dpci_softirq(void)
>          d = pirq_dpci->dom;
>          smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
>          if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> -            BUG();
> +        {
> +            unsigned long flags;
> +
> +            /* Put back on the list and retry. */
> +            local_irq_save(flags);
> +            list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
> +            local_irq_restore(flags);
> +
> +            raise_softirq(HVM_DPCI_SOFTIRQ);
> +            continue;
> +        }

... this does implicitly - spin until the bad condition cleared.

Additionally I think it should be considered whether the bitmap
approach of interpreting ->state is the right one, and we don't
instead want a clean 3-state (idle, sched, run) model.

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-01-12 17:35 ` Konrad Rzeszutek Wilk
@ 2015-01-13 10:26   ` Jan Beulich
  0 siblings, 0 replies; 27+ messages in thread
From: Jan Beulich @ 2015-01-13 10:26 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux

>>> On 12.01.15 at 18:35, <konrad.wilk@oracle.com> wrote:
>> --- a/xen/drivers/passthrough/io.c
>> +++ b/xen/drivers/passthrough/io.c
>> @@ -804,7 +804,17 @@ static void dpci_softirq(void)
>>          d = pirq_dpci->dom;
>>          smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
>>          if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
>> -            BUG();
>> +        {
>> +            unsigned long flags;
>> +
>> +            /* Put back on the list and retry. */
>> +            local_irq_save(flags);
>> +            list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
> 
> I chatted with Jan on IRC this, and one worry is that if we add
> on our per-cpu list an 'dpci' that is running on another CPU - if
> the other CPU runs 'list_del' it will go BOOM.

Looking at this again, I don't think this particular case is a concern:
The list_del() happens before setting STATE_RUN and clearing
STATE_SCHED, and the race window only opens (and the potential
list_add() can only happen) after this CPU cleared STATE_SCHED.

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-01-13 10:20 ` Jan Beulich
@ 2015-01-23  1:44   ` Konrad Rzeszutek Wilk
  2015-01-23  9:37     ` Jan Beulich
  2015-02-02 14:29   ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 27+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-01-23  1:44 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux

On Tue, Jan 13, 2015 at 10:20:00AM +0000, Jan Beulich wrote:
> >>> On 12.01.15 at 17:45, <konrad.wilk@oracle.com> wrote:
> > There is race when we clear the STATE_SCHED in the softirq
> > - which allows the 'raise_softirq_for' to progress and
> > schedule another dpci. During that time the other CPU could
> > receive an interrupt and calls 'raise_softirq_for' and put
> > the dpci on its per-cpu list. There would be two 'dpci_softirq'
> > running at the same time (on different CPUs) where the
> > dpci state is STATE_RUN (and STATE_SCHED is cleared). This
> > ends up hitting:
> > 
> >  if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> > 	BUG()
> > 
> > Instead of that put the dpci back on the per-cpu list to deal
> > with later.
> > 
> > The reason we can get his with this is when an interrupt
> > affinity is set over multiple CPUs.
> > 
> > Another potential fix would be to add a guard in the raise_softirq_for
> > to check for 'STATE_RUN' bit being set and not schedule the
> > dpci until that bit has been cleared.
> 
> I indeed think this should be investigated, because it would make
> explicit what ...
> 
> > --- a/xen/drivers/passthrough/io.c
> > +++ b/xen/drivers/passthrough/io.c
> > @@ -804,7 +804,17 @@ static void dpci_softirq(void)
> >          d = pirq_dpci->dom;
> >          smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
> >          if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> > -            BUG();
> > +        {
> > +            unsigned long flags;
> > +
> > +            /* Put back on the list and retry. */
> > +            local_irq_save(flags);
> > +            list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
> > +            local_irq_restore(flags);
> > +
> > +            raise_softirq(HVM_DPCI_SOFTIRQ);
> > +            continue;
> > +        }
> 
> ... this does implicitly - spin until the bad condition cleared.

Here is a patch that does this. I don't yet have an setup to test
the failing scenario (working on that). I removed the part in
the softirq because with this patch I cannot see a way it would
ever get there (in the softirq hitting the BUG). 

>From e53185762ce184e835121ab4fbf7897568c0cdc4 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Mon, 12 Jan 2015 11:35:03 -0500
Subject: [PATCH] dpci: Don't schedule the dpci when it is in STATE_RUN on any
 per-cpu list.

There is race when we clear the STATE_SCHED in the softirq
- which allows the 'raise_softirq_for' to progress and
schedule another dpci. During that time the other CPU could
receive an interrupt and calls 'raise_softirq_for' and put
the dpci on its per-cpu list. There would be two 'dpci_softirq'
running at the same time (on different CPUs) where the
dpci state is STATE_RUN (and STATE_SCHED is cleared). This
ends up hitting:

 if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
	BUG()

The reason we can get his with this is when an interrupt
affinity is set over multiple CPUs.

Instead of the BUG() we can put the dpci back on the per-cpu
list to deal with later (when the softirq are activated again).
However since this putting the 'dpci' back on the per-cpu list
is basically spin until the bad condition cleared which is
not nice.

Hence this patch adds an guard in the raise_softirq_for to
check for 'STATE_RUN' bit being set and not schedule the dpci
until that bit is cleared.

We also moved in the 'raise_softirq_for' the insertion of
dpci in the list in the case statement to make it easier
to read.

Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Reported-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/drivers/passthrough/io.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index ae050df..802127a 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -64,16 +64,24 @@ static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
 {
     unsigned long flags;
 
-    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
+    switch ( cmpxchg(&pirq_dpci->state, 0, 1 << STATE_SCHED) )
+    {
+    case (1 << STATE_SCHED):
+    case (1 << STATE_RUN):
+    case (1 << STATE_RUN) | (1 << STATE_SCHED):
         return;
+    case 0:
+        get_knownalive_domain(pirq_dpci->dom);
 
-    get_knownalive_domain(pirq_dpci->dom);
+        local_irq_save(flags);
+        list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
+        local_irq_restore(flags);
 
-    local_irq_save(flags);
-    list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
-    local_irq_restore(flags);
-
-    raise_softirq(HVM_DPCI_SOFTIRQ);
+        raise_softirq(HVM_DPCI_SOFTIRQ);
+        break;
+    default:
+        BUG();
+    }
 }
 
 /*
-- 
2.1.0

> 
> Additionally I think it should be considered whether the bitmap
> approach of interpreting ->state is the right one, and we don't
> instead want a clean 3-state (idle, sched, run) model.

Could you elaborate a bit more please? As in three different unsigned int
(or bool_t) that set in what state we are in?
> 
> Jan
> 

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-01-23  1:44   ` Konrad Rzeszutek Wilk
@ 2015-01-23  9:37     ` Jan Beulich
  2015-01-23 14:54       ` Konrad Rzeszutek Wilk
  2015-03-17 17:44       ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 27+ messages in thread
From: Jan Beulich @ 2015-01-23  9:37 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux

>>> On 23.01.15 at 02:44, <konrad.wilk@oracle.com> wrote:
> Here is a patch that does this. I don't yet have an setup to test
> the failing scenario (working on that). I removed the part in
> the softirq because with this patch I cannot see a way it would
> ever get there (in the softirq hitting the BUG). 

Hmm, but how do you now schedule the second instance that needed ...

> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -64,16 +64,24 @@ static void raise_softirq_for(struct hvm_pirq_dpci 
> *pirq_dpci)
>  {
>      unsigned long flags;
>  
> -    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
> +    switch ( cmpxchg(&pirq_dpci->state, 0, 1 << STATE_SCHED) )
> +    {
> +    case (1 << STATE_SCHED):
> +    case (1 << STATE_RUN):

... in this case?

>> Additionally I think it should be considered whether the bitmap
>> approach of interpreting ->state is the right one, and we don't
>> instead want a clean 3-state (idle, sched, run) model.
> 
> Could you elaborate a bit more please? As in three different unsigned int
> (or bool_t) that set in what state we are in?

An enum { STATE_IDLE, STATE_SCHED, STATE_RUN }. Especially
if my comment above turns out to be wrong, you'd have no real
need for the SCHED and RUN flags to be set at the same time.

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-01-23  9:37     ` Jan Beulich
@ 2015-01-23 14:54       ` Konrad Rzeszutek Wilk
  2015-03-17 17:44       ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 27+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-01-23 14:54 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux

On Fri, Jan 23, 2015 at 09:37:53AM +0000, Jan Beulich wrote:
> >>> On 23.01.15 at 02:44, <konrad.wilk@oracle.com> wrote:
> > Here is a patch that does this. I don't yet have an setup to test
> > the failing scenario (working on that). I removed the part in
> > the softirq because with this patch I cannot see a way it would
> > ever get there (in the softirq hitting the BUG). 
> 
> Hmm, but how do you now schedule the second instance that needed ...
> 
> > --- a/xen/drivers/passthrough/io.c
> > +++ b/xen/drivers/passthrough/io.c
> > @@ -64,16 +64,24 @@ static void raise_softirq_for(struct hvm_pirq_dpci 
> > *pirq_dpci)
> >  {
> >      unsigned long flags;
> >  
> > -    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
> > +    switch ( cmpxchg(&pirq_dpci->state, 0, 1 << STATE_SCHED) )
> > +    {
> > +    case (1 << STATE_SCHED):
> > +    case (1 << STATE_RUN):
> 
> ... in this case?

We do need that. I was under the false understanding that the old
code would just drop it if it was running. However the old code
(tasklet only) followed the logic that:
 a) If it is scheduled (on the list), then don't put it there.
 b) If it is running (t->running), then don't put it there. However
    once the function is done (in the do_tasklet_work)  we detect
    that we tried to schedule it (t->scheduled_on is set to non-zero value)
    - and we put it back on the list.

I need to think about this some more.
> 
> >> Additionally I think it should be considered whether the bitmap
> >> approach of interpreting ->state is the right one, and we don't
> >> instead want a clean 3-state (idle, sched, run) model.
> > 
> > Could you elaborate a bit more please? As in three different unsigned int
> > (or bool_t) that set in what state we are in?
> 
> An enum { STATE_IDLE, STATE_SCHED, STATE_RUN }. Especially
> if my comment above turns out to be wrong, you'd have no real
> need for the SCHED and RUN flags to be set at the same time.

Seems we do need these four states: scheduled; running; schedulding while running; and idle.

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-01-13 10:20 ` Jan Beulich
  2015-01-23  1:44   ` Konrad Rzeszutek Wilk
@ 2015-02-02 14:29   ` Konrad Rzeszutek Wilk
  2015-02-02 15:19     ` Jan Beulich
  1 sibling, 1 reply; 27+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-02-02 14:29 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux

On Tue, Jan 13, 2015 at 10:20:00AM +0000, Jan Beulich wrote:
> >>> On 12.01.15 at 17:45, <konrad.wilk@oracle.com> wrote:
> > There is race when we clear the STATE_SCHED in the softirq
> > - which allows the 'raise_softirq_for' to progress and
> > schedule another dpci. During that time the other CPU could
> > receive an interrupt and calls 'raise_softirq_for' and put
> > the dpci on its per-cpu list. There would be two 'dpci_softirq'
> > running at the same time (on different CPUs) where the
> > dpci state is STATE_RUN (and STATE_SCHED is cleared). This
> > ends up hitting:
> > 
> >  if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> > 	BUG()
> > 
> > Instead of that put the dpci back on the per-cpu list to deal
> > with later.
> > 
> > The reason we can get his with this is when an interrupt
> > affinity is set over multiple CPUs.
> > 
> > Another potential fix would be to add a guard in the raise_softirq_for
> > to check for 'STATE_RUN' bit being set and not schedule the
> > dpci until that bit has been cleared.
> 
> I indeed think this should be investigated, because it would make
> explicit what ...
> 
> > --- a/xen/drivers/passthrough/io.c
> > +++ b/xen/drivers/passthrough/io.c
> > @@ -804,7 +804,17 @@ static void dpci_softirq(void)
> >          d = pirq_dpci->dom;
> >          smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
> >          if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> > -            BUG();
> > +        {
> > +            unsigned long flags;
> > +
> > +            /* Put back on the list and retry. */
> > +            local_irq_save(flags);
> > +            list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
> > +            local_irq_restore(flags);
> > +
> > +            raise_softirq(HVM_DPCI_SOFTIRQ);
> > +            continue;
> > +        }
> 
> ... this does implicitly - spin until the bad condition cleared.

I still haven't gotten to trying to reproduce the issues Malcolm
saw which is easier for me to do (use Intel super-fast storage
in a Windows guest) - since I've one of those boxes.

However in lieu of that, here is a patch that does pass my testing
of SR-IOV, and I believe should work fine.  I _think_ it covers
what you want it to have Jan, but of course please correct me
if I made a mistake in the logic.

>From 20766b144e0b11daf3227ae978880bacd686ea2b Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Mon, 12 Jan 2015 11:35:03 -0500
Subject: [PATCH] dpci: schedule properly the dpci when it is in STATE_RUN on
 any per-cpu list.

There is race when we clear the STATE_SCHED in the softirq
- which allows the 'raise_softirq_for' to progress and
schedule another dpci. During that time the other CPU could
receive an interrupt and calls 'raise_softirq_for' and put
the dpci on its per-cpu list. There would be two 'dpci_softirq'
running at the same time (on different CPUs) where the
dpci state is STATE_RUN (and STATE_SCHED is cleared). This
ends up hitting:

 if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
	BUG()

The reason we can get his with this is when an interrupt
affinity is set over multiple CPUs.

Instead of the BUG() we can put the dpci back on the per-cpu
list to deal with later (when the softirq are activated again).
This putting the 'dpci' back on the per-cpu list is an spin
until the bad condition clear.

We also expand the test-and-set(STATE_SCHED) in raise_softirq_for
to detect for 'STATE_RUN' bit being set and schedule the dpci.
The cases in which an dpci is NOT going to be scheduled is
whenever the 'STATE_SCHED' bit is set (regardless whether
STATE_RUN is set or not).

Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Reported-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/drivers/passthrough/io.c | 43 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 40 insertions(+), 3 deletions(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index ae050df..b61b73c 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -63,10 +63,37 @@ enum {
 static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
 {
     unsigned long flags;
+    unsigned long old, new, val = pirq_dpci->state;
 
-    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
-        return;
+    /*
+     * This cmpxch is a more clear version of:
+     * if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
+     *         return;
+     * since it also checks for STATE_RUN conditions.
+     */
+    for ( ;; )
+    {
+        new = 1 << STATE_SCHED;
+        if ( val )
+            new |= val;
 
+        old = cmpxchg(&pirq_dpci->state, val, new);
+        switch ( old )
+        {
+        case (1 << STATE_SCHED):
+        case (1 << STATE_RUN) | (1 << STATE_SCHED):
+            /*
+             * Whenever STATE_SCHED is set we MUST not schedule it.
+             */
+            return;
+        }
+        /*
+         * If the 'state' is 0 or (1 << STATE_RUN) we can schedule it.
+         */
+        if ( old == val )
+            break;
+        val = old;
+    }
     get_knownalive_domain(pirq_dpci->dom);
 
     local_irq_save(flags);
@@ -804,7 +831,17 @@ static void dpci_softirq(void)
         d = pirq_dpci->dom;
         smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
         if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
-            BUG();
+        {
+            unsigned long flags;
+
+            /* Put back on the list and retry later. */
+            local_irq_save(flags);
+            list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
+            local_irq_restore(flags);
+
+            raise_softirq(HVM_DPCI_SOFTIRQ);
+            continue;
+        }
         /*
          * The one who clears STATE_SCHED MUST refcount the domain.
          */
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-02-02 14:29   ` Konrad Rzeszutek Wilk
@ 2015-02-02 15:19     ` Jan Beulich
  2015-02-02 15:31       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2015-02-02 15:19 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux

>>> On 02.02.15 at 15:29, <konrad.wilk@oracle.com> wrote:
> On Tue, Jan 13, 2015 at 10:20:00AM +0000, Jan Beulich wrote:
>> >>> On 12.01.15 at 17:45, <konrad.wilk@oracle.com> wrote:
>> > There is race when we clear the STATE_SCHED in the softirq
>> > - which allows the 'raise_softirq_for' to progress and
>> > schedule another dpci. During that time the other CPU could
>> > receive an interrupt and calls 'raise_softirq_for' and put
>> > the dpci on its per-cpu list. There would be two 'dpci_softirq'
>> > running at the same time (on different CPUs) where the
>> > dpci state is STATE_RUN (and STATE_SCHED is cleared). This
>> > ends up hitting:
>> > 
>> >  if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
>> > 	BUG()
>> > 
>> > Instead of that put the dpci back on the per-cpu list to deal
>> > with later.
>> > 
>> > The reason we can get his with this is when an interrupt
>> > affinity is set over multiple CPUs.
>> > 
>> > Another potential fix would be to add a guard in the raise_softirq_for
>> > to check for 'STATE_RUN' bit being set and not schedule the
>> > dpci until that bit has been cleared.
>> 
>> I indeed think this should be investigated, because it would make
>> explicit what ...
>> 
>> > --- a/xen/drivers/passthrough/io.c
>> > +++ b/xen/drivers/passthrough/io.c
>> > @@ -804,7 +804,17 @@ static void dpci_softirq(void)
>> >          d = pirq_dpci->dom;
>> >          smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
>> >          if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
>> > -            BUG();
>> > +        {
>> > +            unsigned long flags;
>> > +
>> > +            /* Put back on the list and retry. */
>> > +            local_irq_save(flags);
>> > +            list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
>> > +            local_irq_restore(flags);
>> > +
>> > +            raise_softirq(HVM_DPCI_SOFTIRQ);
>> > +            continue;
>> > +        }
>> 
>> ... this does implicitly - spin until the bad condition cleared.
> 
> I still haven't gotten to trying to reproduce the issues Malcolm
> saw which is easier for me to do (use Intel super-fast storage
> in a Windows guest) - since I've one of those boxes.
> 
> However in lieu of that, here is a patch that does pass my testing
> of SR-IOV, and I believe should work fine.  I _think_ it covers
> what you want it to have Jan, but of course please correct me
> if I made a mistake in the logic.

Since the code quoted above is still there in the new patch, the new
patch can at best be half of what I suggested.

> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -63,10 +63,37 @@ enum {
>  static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
>  {
>      unsigned long flags;
> +    unsigned long old, new, val = pirq_dpci->state;
>  
> -    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
> -        return;
> +    /*
> +     * This cmpxch is a more clear version of:
> +     * if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
> +     *         return;
> +     * since it also checks for STATE_RUN conditions.
> +     */
> +    for ( ;; )
> +    {
> +        new = 1 << STATE_SCHED;
> +        if ( val )
> +            new |= val;

Why the if()?

>  
> +        old = cmpxchg(&pirq_dpci->state, val, new);
> +        switch ( old )
> +        {
> +        case (1 << STATE_SCHED):
> +        case (1 << STATE_RUN) | (1 << STATE_SCHED):
> +            /*
> +             * Whenever STATE_SCHED is set we MUST not schedule it.
> +             */
> +            return;
> +        }
> +        /*
> +         * If the 'state' is 0 or (1 << STATE_RUN) we can schedule it.
> +         */

Really? Wasn't the intention to _not_ schedule when STATE_RUN is
set? (Also the above is a only line comment, i.e. wants different style.)

I.e. with what you do now you could as well keep the old code.

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-02-02 15:19     ` Jan Beulich
@ 2015-02-02 15:31       ` Konrad Rzeszutek Wilk
  2015-02-02 15:48         ` Jan Beulich
  0 siblings, 1 reply; 27+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-02-02 15:31 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux

On Mon, Feb 02, 2015 at 03:19:33PM +0000, Jan Beulich wrote:
> >>> On 02.02.15 at 15:29, <konrad.wilk@oracle.com> wrote:
> > On Tue, Jan 13, 2015 at 10:20:00AM +0000, Jan Beulich wrote:
> >> >>> On 12.01.15 at 17:45, <konrad.wilk@oracle.com> wrote:
> >> > There is race when we clear the STATE_SCHED in the softirq
> >> > - which allows the 'raise_softirq_for' to progress and
> >> > schedule another dpci. During that time the other CPU could
> >> > receive an interrupt and calls 'raise_softirq_for' and put
> >> > the dpci on its per-cpu list. There would be two 'dpci_softirq'
> >> > running at the same time (on different CPUs) where the
> >> > dpci state is STATE_RUN (and STATE_SCHED is cleared). This
> >> > ends up hitting:
> >> > 
> >> >  if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> >> > 	BUG()
> >> > 
> >> > Instead of that put the dpci back on the per-cpu list to deal
> >> > with later.
> >> > 
> >> > The reason we can get his with this is when an interrupt
> >> > affinity is set over multiple CPUs.
> >> > 
> >> > Another potential fix would be to add a guard in the raise_softirq_for
> >> > to check for 'STATE_RUN' bit being set and not schedule the
> >> > dpci until that bit has been cleared.
> >> 
> >> I indeed think this should be investigated, because it would make
> >> explicit what ...
> >> 
> >> > --- a/xen/drivers/passthrough/io.c
> >> > +++ b/xen/drivers/passthrough/io.c
> >> > @@ -804,7 +804,17 @@ static void dpci_softirq(void)
> >> >          d = pirq_dpci->dom;
> >> >          smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
> >> >          if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> >> > -            BUG();
> >> > +        {
> >> > +            unsigned long flags;
> >> > +
> >> > +            /* Put back on the list and retry. */
> >> > +            local_irq_save(flags);
> >> > +            list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
> >> > +            local_irq_restore(flags);
> >> > +
> >> > +            raise_softirq(HVM_DPCI_SOFTIRQ);
> >> > +            continue;
> >> > +        }
> >> 
> >> ... this does implicitly - spin until the bad condition cleared.
> > 
> > I still haven't gotten to trying to reproduce the issues Malcolm
> > saw which is easier for me to do (use Intel super-fast storage
> > in a Windows guest) - since I've one of those boxes.
> > 
> > However in lieu of that, here is a patch that does pass my testing
> > of SR-IOV, and I believe should work fine.  I _think_ it covers
> > what you want it to have Jan, but of course please correct me
> > if I made a mistake in the logic.
> 
> Since the code quoted above is still there in the new patch, the new
> patch can at best be half of what I suggested.
> 
> > --- a/xen/drivers/passthrough/io.c
> > +++ b/xen/drivers/passthrough/io.c
> > @@ -63,10 +63,37 @@ enum {
> >  static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
> >  {
> >      unsigned long flags;
> > +    unsigned long old, new, val = pirq_dpci->state;
> >  
> > -    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
> > -        return;
> > +    /*
> > +     * This cmpxch is a more clear version of:
> > +     * if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
> > +     *         return;
> > +     * since it also checks for STATE_RUN conditions.
> > +     */
> > +    for ( ;; )
> > +    {
> > +        new = 1 << STATE_SCHED;
> > +        if ( val )
> > +            new |= val;
> 
> Why the if()?

To 'or' the variable new with '1 << STATE_RUN' in case 'val' changed from
the first read ('val = pirq_dpci->state') to the moment when
we do the cmpxchg.

> 
> >  
> > +        old = cmpxchg(&pirq_dpci->state, val, new);
> > +        switch ( old )
> > +        {
> > +        case (1 << STATE_SCHED):
> > +        case (1 << STATE_RUN) | (1 << STATE_SCHED):
> > +            /*
> > +             * Whenever STATE_SCHED is set we MUST not schedule it.
> > +             */
> > +            return;
> > +        }
> > +        /*
> > +         * If the 'state' is 0 or (1 << STATE_RUN) we can schedule it.
> > +         */
> 
> Really? Wasn't the intention to _not_ schedule when STATE_RUN is
> set? (Also the above is a only line comment, i.e. wants different style.)

I must confess I must have misread your last review. You said:

	> Here is a patch that does this. I don't yet have an setup to test
	> the failing scenario (working on that). I removed the part in
	> the softirq because with this patch I cannot see a way it would
	> ever get there (in the softirq hitting the BUG).

	Hmm, but how do you now schedule the second instance that needed ...

	> +    case (1 << STATE_RUN):

	... in this case?

which to me implied you still want to schedule an 'dpci' when STATE_RUN is set?

My thinking is that we should still schedule an 'dpci' when STATE_RUN is set
as that is inline with what the old tasklet code did. It would
schedule the tasklet the moment said tasklet was off the global list (but
it would not add it in the global list - that would be the job of the
tasklet function wrapper to detect and insert said tasklet back on
the global list).


> 
> I.e. with what you do now you could as well keep the old code.
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-02-02 15:31       ` Konrad Rzeszutek Wilk
@ 2015-02-02 15:48         ` Jan Beulich
  2015-02-02 17:44           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2015-02-02 15:48 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux

>>> On 02.02.15 at 16:31, <konrad.wilk@oracle.com> wrote:
> On Mon, Feb 02, 2015 at 03:19:33PM +0000, Jan Beulich wrote:
>> >>> On 02.02.15 at 15:29, <konrad.wilk@oracle.com> wrote:
>> > --- a/xen/drivers/passthrough/io.c
>> > +++ b/xen/drivers/passthrough/io.c
>> > @@ -63,10 +63,37 @@ enum {
>> >  static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
>> >  {
>> >      unsigned long flags;
>> > +    unsigned long old, new, val = pirq_dpci->state;
>> >  
>> > -    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
>> > -        return;
>> > +    /*
>> > +     * This cmpxch is a more clear version of:
>> > +     * if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
>> > +     *         return;
>> > +     * since it also checks for STATE_RUN conditions.
>> > +     */
>> > +    for ( ;; )
>> > +    {
>> > +        new = 1 << STATE_SCHED;
>> > +        if ( val )
>> > +            new |= val;
>> 
>> Why the if()?
> 
> To 'or' the variable new with '1 << STATE_RUN' in case 'val' changed from
> the first read ('val = pirq_dpci->state') to the moment when
> we do the cmpxchg.

But if "val" is zero, the | simply will do nothing.

>> > +        old = cmpxchg(&pirq_dpci->state, val, new);
>> > +        switch ( old )
>> > +        {
>> > +        case (1 << STATE_SCHED):
>> > +        case (1 << STATE_RUN) | (1 << STATE_SCHED):
>> > +            /*
>> > +             * Whenever STATE_SCHED is set we MUST not schedule it.
>> > +             */
>> > +            return;
>> > +        }
>> > +        /*
>> > +         * If the 'state' is 0 or (1 << STATE_RUN) we can schedule it.
>> > +         */
>> 
>> Really? Wasn't the intention to _not_ schedule when STATE_RUN is
>> set? (Also the above is a only line comment, i.e. wants different style.)
> 
> I must confess I must have misread your last review. You said:
> 
> 	> Here is a patch that does this. I don't yet have an setup to test
> 	> the failing scenario (working on that). I removed the part in
> 	> the softirq because with this patch I cannot see a way it would
> 	> ever get there (in the softirq hitting the BUG).
> 
> 	Hmm, but how do you now schedule the second instance that needed ...
> 
> 	> +    case (1 << STATE_RUN):
> 
> 	... in this case?
> 
> which to me implied you still want to schedule an 'dpci' when STATE_RUN is 
> set?
> 
> My thinking is that we should still schedule an 'dpci' when STATE_RUN is set
> as that is inline with what the old tasklet code did. It would
> schedule the tasklet the moment said tasklet was off the global list (but
> it would not add it in the global list - that would be the job of the
> tasklet function wrapper to detect and insert said tasklet back on
> the global list).

Didn't the original discussion (and issue) revolve around scheduling
while STATE_RUN was set? Hence the intention to wait for the flag
to clear - but preferably in an explicit rather than implicit (as your
current and previous patch do) manner.

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-02-02 15:48         ` Jan Beulich
@ 2015-02-02 17:44           ` Konrad Rzeszutek Wilk
  2015-02-03  8:58             ` Jan Beulich
  0 siblings, 1 reply; 27+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-02-02 17:44 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux

On Mon, Feb 02, 2015 at 03:48:19PM +0000, Jan Beulich wrote:
> >>> On 02.02.15 at 16:31, <konrad.wilk@oracle.com> wrote:
> > On Mon, Feb 02, 2015 at 03:19:33PM +0000, Jan Beulich wrote:
> >> >>> On 02.02.15 at 15:29, <konrad.wilk@oracle.com> wrote:
> >> > --- a/xen/drivers/passthrough/io.c
> >> > +++ b/xen/drivers/passthrough/io.c
> >> > @@ -63,10 +63,37 @@ enum {
> >> >  static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
> >> >  {
> >> >      unsigned long flags;
> >> > +    unsigned long old, new, val = pirq_dpci->state;
> >> >  
> >> > -    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
> >> > -        return;
> >> > +    /*
> >> > +     * This cmpxch is a more clear version of:
> >> > +     * if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
> >> > +     *         return;
> >> > +     * since it also checks for STATE_RUN conditions.
> >> > +     */
> >> > +    for ( ;; )
> >> > +    {
> >> > +        new = 1 << STATE_SCHED;
> >> > +        if ( val )
> >> > +            new |= val;
> >> 
> >> Why the if()?
> > 
> > To 'or' the variable new with '1 << STATE_RUN' in case 'val' changed from
> > the first read ('val = pirq_dpci->state') to the moment when
> > we do the cmpxchg.
> 
> But if "val" is zero, the | simply will do nothing.

Correct. Keep in mind that 'new' is set to '1 << STATE_SCHED' at every
loop iteration - so it ends up old = cmpxchg(.., 0, 1 << STATE_SCHED)
(and old == 0, val == 0, so we end up breaking out of the loop).


> 
> >> > +        old = cmpxchg(&pirq_dpci->state, val, new);
> >> > +        switch ( old )
> >> > +        {
> >> > +        case (1 << STATE_SCHED):
> >> > +        case (1 << STATE_RUN) | (1 << STATE_SCHED):
> >> > +            /*
> >> > +             * Whenever STATE_SCHED is set we MUST not schedule it.
> >> > +             */
> >> > +            return;
> >> > +        }
> >> > +        /*
> >> > +         * If the 'state' is 0 or (1 << STATE_RUN) we can schedule it.
> >> > +         */
> >> 
> >> Really? Wasn't the intention to _not_ schedule when STATE_RUN is
> >> set? (Also the above is a only line comment, i.e. wants different style.)
> > 
> > I must confess I must have misread your last review. You said:
> > 
> > 	> Here is a patch that does this. I don't yet have an setup to test
> > 	> the failing scenario (working on that). I removed the part in
> > 	> the softirq because with this patch I cannot see a way it would
> > 	> ever get there (in the softirq hitting the BUG).
> > 
> > 	Hmm, but how do you now schedule the second instance that needed ...
> > 
> > 	> +    case (1 << STATE_RUN):
> > 
> > 	... in this case?
> > 
> > which to me implied you still want to schedule an 'dpci' when STATE_RUN is 
> > set?
> > 
> > My thinking is that we should still schedule an 'dpci' when STATE_RUN is set
> > as that is inline with what the old tasklet code did. It would
> > schedule the tasklet the moment said tasklet was off the global list (but
> > it would not add it in the global list - that would be the job of the
> > tasklet function wrapper to detect and insert said tasklet back on
> > the global list).
> 
> Didn't the original discussion (and issue) revolve around scheduling
> while STATE_RUN was set? Hence the intention to wait for the flag

Yes.
> to clear - but preferably in an explicit rather than implicit (as your
> current and previous patch do) manner.

If we do explicitly we run risk of dead-lock. See below of an draft
(not even compiled tested) of what I think you mean.

The issue is that 'raise_softirq_for' ends up being called from do_IRQ.
And we might have an IRQ coming in just as we are in the dpci_softirq
having set the 1 << STATE_SCHED. Our spin-and-wait in raise_softirq_for
(in this code below) will spin forever.

One way to not get in that quagmire is to well, do what the previous
patches did.

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index ae050df..706a636 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -63,10 +63,35 @@ enum {
 static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
 {
     unsigned long flags;
+    unsigned long old, new, val = pirq_dpci->state;
 
-    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
-        return;
-
+    for ( ;; )
+    {
+        old = cmpxchg(&pirq_dpci->state, 0, 1 << STATE_SCHED);
+        switch ( old )
+        {
+        case (1 << STATE_SCHED):
+            /*
+             * Whenever STATE_SCHED is set we MUST not schedule it.
+             */
+            return;
+        case (1 << STATE_RUN) | (1 << STATE_SCHED):
+        case (1 << STATE_RUN):
+            /* Getting close to finish. Spin. */
+            continue;
+        }
+        /*
+         * If the 'state' is 0 we can schedule it.
+         */
+        if ( old == 0 )
+            break;
+    }
     get_knownalive_domain(pirq_dpci->dom);
 
     local_irq_save(flags);
> 
> Jan
> 

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-02-02 17:44           ` Konrad Rzeszutek Wilk
@ 2015-02-03  8:58             ` Jan Beulich
  2015-03-16 17:59               ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2015-02-03  8:58 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux

>>> On 02.02.15 at 18:44, <konrad.wilk@oracle.com> wrote:
> On Mon, Feb 02, 2015 at 03:48:19PM +0000, Jan Beulich wrote:
>> >>> On 02.02.15 at 16:31, <konrad.wilk@oracle.com> wrote:
>> > On Mon, Feb 02, 2015 at 03:19:33PM +0000, Jan Beulich wrote:
>> >> >>> On 02.02.15 at 15:29, <konrad.wilk@oracle.com> wrote:
>> >> > --- a/xen/drivers/passthrough/io.c
>> >> > +++ b/xen/drivers/passthrough/io.c
>> >> > @@ -63,10 +63,37 @@ enum {
>> >> >  static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
>> >> >  {
>> >> >      unsigned long flags;
>> >> > +    unsigned long old, new, val = pirq_dpci->state;
>> >> >  
>> >> > -    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
>> >> > -        return;
>> >> > +    /*
>> >> > +     * This cmpxch is a more clear version of:
>> >> > +     * if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
>> >> > +     *         return;
>> >> > +     * since it also checks for STATE_RUN conditions.
>> >> > +     */
>> >> > +    for ( ;; )
>> >> > +    {
>> >> > +        new = 1 << STATE_SCHED;
>> >> > +        if ( val )
>> >> > +            new |= val;
>> >> 
>> >> Why the if()?
>> > 
>> > To 'or' the variable new with '1 << STATE_RUN' in case 'val' changed from
>> > the first read ('val = pirq_dpci->state') to the moment when
>> > we do the cmpxchg.
>> 
>> But if "val" is zero, the | simply will do nothing.
> 
> Correct. Keep in mind that 'new' is set to '1 << STATE_SCHED' at every
> loop iteration - so it ends up old = cmpxchg(.., 0, 1 << STATE_SCHED)
> (and old == 0, val == 0, so we end up breaking out of the loop).

Not sure what you're trying to explain to me here. The code you
have is equivalent to

+        new = (1 << STATE_SCHED) | val;

no matter what.

>> Didn't the original discussion (and issue) revolve around scheduling
>> while STATE_RUN was set? Hence the intention to wait for the flag
> 
> Yes.
>> to clear - but preferably in an explicit rather than implicit (as your
>> current and previous patch do) manner.
> 
> If we do explicitly we run risk of dead-lock. See below of an draft
> (not even compiled tested) of what I think you mean.

That's no different from the code you proposed before, just that
the live-lock is better hidden there: By re-raising a softirq from a
softirq handler, you arrange for yourself to be called again right
away.

> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -63,10 +63,35 @@ enum {
>  static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
>  {
>      unsigned long flags;
> +    unsigned long old, new, val = pirq_dpci->state;
>  
> -    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
> -        return;
> -
> +    for ( ;; )
> +    {
> +        old = cmpxchg(&pirq_dpci->state, 0, 1 << STATE_SCHED);
> +        switch ( old )
> +        {
> +        case (1 << STATE_SCHED):
> +            /*
> +             * Whenever STATE_SCHED is set we MUST not schedule it.
> +             */
> +            return;
> +        case (1 << STATE_RUN) | (1 << STATE_SCHED):
> +        case (1 << STATE_RUN):
> +            /* Getting close to finish. Spin. */
> +            continue;
> +        }
> +        /*
> +         * If the 'state' is 0 we can schedule it.
> +         */
> +        if ( old == 0 )
> +            break;
> +    }
>      get_knownalive_domain(pirq_dpci->dom);
>  
>      local_irq_save(flags);

Yes, this looks better. And I again wonder whether STATE_*
couldn't simply become a tristate - dpci_softirq(), rather than setting
STATE_RUN and then clearing STATE_SCHED, could simply
cmpxchg(, STATE_SCHED, STATE_RUN), acting as necessary when
that operation fails.

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-02-03  8:58             ` Jan Beulich
@ 2015-03-16 17:59               ` Konrad Rzeszutek Wilk
  2015-03-17  8:18                 ` Jan Beulich
  0 siblings, 1 reply; 27+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-03-16 17:59 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, linux, xen-devel

> > If we do explicitly we run risk of dead-lock. See below of an draft
> > (not even compiled tested) of what I think you mean.
> 
> That's no different from the code you proposed before, just that
> the live-lock is better hidden there: By re-raising a softirq from a
> softirq handler, you arrange for yourself to be called again right
> away.

Interestingly enough this issue of live-lock has been there since
the start (when this code was written).
The tasklet (so Xen 4.5 and earlier) based implementation allows this:


CPU0                            CPU1                     CPU2

hvm_do_IRQ_dpci()
 - tasklet_schedule()
   [t->scheduled_on = 0]
   if (!t->is_running)
	put tasklet on per-cpu list. raise softirq


 . do_softirq
    [do_tasklet_work]
    t->scheduled_on = -1
    t->is_running = 1
    unlock the lock
     -> call hvm_dirq_assist
                                do_IRQ
                                 hvm_do_IRQ_dpci
                                t->scheduled_on = 1;
				if (!t->is_running).. [it is 1, so skip]

                                			do_IRQ
                                 			 hvm_do_IRQ_dpci
                                			 t->scheduled_on = 2;
							 if (!t->is_running).. [it is 1, so skip]

   takes the lock
    if (t->scheduled_on >= 0)
         tasklet_enqueue:
          put tasklet on per-cpu list, raise softirq.

   [do_tasklet_work]
   .. repeat the story. Inject interrupts right as 'hvm_dirq_assist' it executing.

With interrupts happening right as the dpci_work is being called we can
cause an exact live-lock.

It also suffers from "skipping" interrupts if they span more
than on CPU. The implementation I had does not address this issue
either (it only picks up the 'second' interrupt while the older
picks the 'last' one).

Anyhow the patch below makes the code from a behavior standpoint
the same as what we have in Xen 4.5 and earlier (with warts).
The difference is that the warts are so much faster so you don't
seem them :-)
> 
> > --- a/xen/drivers/passthrough/io.c
> > +++ b/xen/drivers/passthrough/io.c
> > @@ -63,10 +63,35 @@ enum {
> >  static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
> >  {
> >      unsigned long flags;
> > +    unsigned long old, new, val = pirq_dpci->state;
> >  
> > -    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
> > -        return;
> > -
> > +    for ( ;; )
> > +    {
> > +        old = cmpxchg(&pirq_dpci->state, 0, 1 << STATE_SCHED);
> > +        switch ( old )
> > +        {
> > +        case (1 << STATE_SCHED):
> > +            /*
> > +             * Whenever STATE_SCHED is set we MUST not schedule it.
> > +             */
> > +            return;
> > +        case (1 << STATE_RUN) | (1 << STATE_SCHED):
> > +        case (1 << STATE_RUN):
> > +            /* Getting close to finish. Spin. */
> > +            continue;
> > +        }
> > +        /*
> > +         * If the 'state' is 0 we can schedule it.
> > +         */
> > +        if ( old == 0 )
> > +            break;
> > +    }
> >      get_knownalive_domain(pirq_dpci->dom);
> >  
> >      local_irq_save(flags);
> 
> Yes, this looks better. And I again wonder whether STATE_*
> couldn't simply become a tristate - dpci_softirq(), rather than setting
> STATE_RUN and then clearing STATE_SCHED, could simply
> cmpxchg(, STATE_SCHED, STATE_RUN), acting as necessary when
> that operation fails.

One issue that needs to be figured out is what to do when
we have an interrupt mini-storm and need to queue it up.

One option could be to convert the ->mapping to be an
simple atomic counter - incremented every time
hvm_do_IRQ_dpci is called (and hvm_dirq_assist would
have an while loop decrementing this).

However as previous reviews demonstrated there are a lot
of subtle things that need to be taken care of:
 - Legacy interrupts and its 'lets-reset mapping to zero' when
   the interrupt timer has elapsed.
 - Deal with error states when assigning an PIRQ, or during
   the window time between creating/destroying loop and having
   to make sure that the state in the dpci_softirq is sane.

That took me two months to get right and only with other
folks testing and finding this. While I am totally up for
restarting this to make it awesome - I am also aghast at my
progress on that and fear that this might take _right_ past
the feature freeze window.

Hence was wondering if it would just be easier to put
this patch in (see above) - with the benfit that folks have
an faster interrupt passthrough experience and then I work on another
variant of this with tristate cmpxchg and ->mapping atomic counter.

Thoughts?
> 
> Jan
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-03-16 17:59               ` Konrad Rzeszutek Wilk
@ 2015-03-17  8:18                 ` Jan Beulich
  2015-03-17  8:42                   ` Sander Eikelenboom
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2015-03-17  8:18 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux

>>> On 16.03.15 at 18:59, <konrad.wilk@oracle.com> wrote:
> Hence was wondering if it would just be easier to put
> this patch in (see above) - with the benfit that folks have
> an faster interrupt passthrough experience and then I work on another
> variant of this with tristate cmpxchg and ->mapping atomic counter.

Considering how long this issue has been pending I think we really
need to get _something_ in (or revert); if this something is the
patch in its most recent form, so be it (even if maybe not the
simplest of all possible variants). So please submit as a proper non-
RFC patch.

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-03-17  8:18                 ` Jan Beulich
@ 2015-03-17  8:42                   ` Sander Eikelenboom
  2015-03-17 14:54                     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 27+ messages in thread
From: Sander Eikelenboom @ 2015-03-17  8:42 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, xen-devel


Tuesday, March 17, 2015, 9:18:32 AM, you wrote:

>>>> On 16.03.15 at 18:59, <konrad.wilk@oracle.com> wrote:
>> Hence was wondering if it would just be easier to put
>> this patch in (see above) - with the benfit that folks have
>> an faster interrupt passthrough experience and then I work on another
>> variant of this with tristate cmpxchg and ->mapping atomic counter.

> Considering how long this issue has been pending I think we really
> need to get _something_ in (or revert); if this something is the
> patch in its most recent form, so be it (even if maybe not the
> simplest of all possible variants). So please submit as a proper non-
> RFC patch.

> Jan

I'm still running with this first simple stopgap patch from Konrad,
and it has worked fine for me since.

I will see if this new one also "works-for-me", somewhere today :-)

--
Sander



diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index ae050df..d1421b0 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -804,7 +804,18 @@ static void dpci_softirq(void)
         d = pirq_dpci->dom;
         smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
         if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
-            BUG();
+        {
+            unsigned long flags;
+
+            /* Put back on the list and retry. */
+            local_irq_save(flags);
+            list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
+            local_irq_restore(flags);
+
+            raise_softirq(HVM_DPCI_SOFTIRQ);
+            continue;
+        }
+
         /*
          * The one who clears STATE_SCHED MUST refcount the domain.
          */

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-03-17  8:42                   ` Sander Eikelenboom
@ 2015-03-17 14:54                     ` Konrad Rzeszutek Wilk
  2015-03-17 16:01                       ` Jan Beulich
  0 siblings, 1 reply; 27+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-03-17 14:54 UTC (permalink / raw)
  To: Sander Eikelenboom
  Cc: andrew.cooper3, malcolm.crossley, Jan Beulich, xen-devel

[-- Attachment #1: Type: text/plain, Size: 3639 bytes --]

On Tue, Mar 17, 2015 at 09:42:21AM +0100, Sander Eikelenboom wrote:
> 
> Tuesday, March 17, 2015, 9:18:32 AM, you wrote:
> 
> >>>> On 16.03.15 at 18:59, <konrad.wilk@oracle.com> wrote:
> >> Hence was wondering if it would just be easier to put
> >> this patch in (see above) - with the benfit that folks have
> >> an faster interrupt passthrough experience and then I work on another
> >> variant of this with tristate cmpxchg and ->mapping atomic counter.
> 
> > Considering how long this issue has been pending I think we really
> > need to get _something_ in (or revert); if this something is the
> > patch in its most recent form, so be it (even if maybe not the
> > simplest of all possible variants). So please submit as a proper non-
> > RFC patch.
> 
> > Jan
> 
> I'm still running with this first simple stopgap patch from Konrad,
> and it has worked fine for me since.

I believe the patch that Sander and Malcom had been running is the best
candidate.

The other ones I had been fiddling with - such as the one attached
here - I cannot make myself comfortable that it will not hit
a dead-lock. On Intel hardware the softirq is called
from the vmx_resume - which means that the whole 'interrupt guest' and
deliever the event code happens during the VMEXIT to VMENTER time. But that
does not preclude another interrupt destined for this same vCPU
to come right in as we are progressing through the softirqs - and 
dead-lock: in the vmx_resume stack we are in hvm_dirq_assist
(called from dpci_softirq) and haven't cleared the STATE_SHED, while in
the IRQ stack we spin in the raise_sofitrq_for for the STATE_SCHED to be cleared.

An dead-lock avoidance could be added to save the CPU value of
the softirq that is executing the dpci. And then 'raise_softirq_for'
can check that and bail out if (smp_processor_id == dpci_pirq->cpu).
Naturlly this means being very careful _where_ we initialize
the 'cpu' to -1, etc - which brings back to carefully work
out the corner cases and make sure we do the right thing - which can
take time.

The re-using the 'dpci' on the per-cpu list is doing the same
exact thing that older tasklet code was doing. That is :
If the function assigned to the tasklet was running  - the softirq
that ran said function (hvm_dirq_assist) would be responsible for
putting the tasklet back on the per-cpu list. This would allow
to have an running tasklet and an 'to-be-scheduled' tasklet
at the same time. 

And that is what we need. I will post an proper patch
and also add Tested-by from Malcom and Sander on it - as it did
fix their test-cases and is unmodified (except an updated
comment) from what theytested in 2014.

> 
> I will see if this new one also "works-for-me", somewhere today :-)
> 
> --
> Sander
> 
> 
> 
> diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> index ae050df..d1421b0 100644
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -804,7 +804,18 @@ static void dpci_softirq(void)
>          d = pirq_dpci->dom;
>          smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
>          if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> -            BUG();
> +        {
> +            unsigned long flags;
> +
> +            /* Put back on the list and retry. */
> +            local_irq_save(flags);
> +            list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
> +            local_irq_restore(flags);
> +
> +            raise_softirq(HVM_DPCI_SOFTIRQ);
> +            continue;
> +        }
> +
>          /*
>           * The one who clears STATE_SCHED MUST refcount the domain.
>           */
> 

[-- Attachment #2: 0001-dpci-when-scheduling-spin-until-STATE_RUN-or-STATE_S.patch --]
[-- Type: text/plain, Size: 3759 bytes --]

>From 6b32dccfbe00518d3ca9cd94d19a6e007b2645d9 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Tue, 17 Mar 2015 09:46:09 -0400
Subject: [PATCH] dpci: when scheduling spin until STATE_RUN or STATE_SCHED has
 been cleared.

There is race when we clear the STATE_SCHED in the softirq
- which allows the 'raise_softirq_for' (on another CPU)
to schedule the dpci.

Specifically this can happen whenthe other CPU receives
an interrupt, calls 'raise_softirq_for', and puts the dpci
on its per-cpu list (same dpci structure).

There would be two 'dpci_softirq' running at the same time
(on different CPUs) where on one CPU it would be executing
hvm_dirq_assist (so had cleared STATE_SCHED and set STATE_RUN)
and on the other CPU it is trying to call:

   if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
    	BUG();

Since STATE_RUN is already set it would end badly.

The reason we can get his with this is when an interrupt
affinity is set over multiple CPUs.

Potential solutions:

a) Instead of the BUG() we can put the dpci back on the per-cpu
list to deal with later (when the softirq are activated again).
This putting the 'dpci' back on the per-cpu list is an spin
until the bad condition clears.

b) We could also expand the test-and-set(STATE_SCHED) in raise_softirq_for
to detect for 'STATE_RUN' bit being set and schedule the dpci
in a more safe manner (delay it). The dpci would stil not
be scheduled when STATE_SCHED bit was set.

c) This patch explores a third option - we will only schedule
the dpci when the state is cleared (no STATE_SCHED and no STATE_RUN).

We will spin if STATE_RUN is set (as it is in progress and will
finish). If the STATE_SCHED is set (so hasn't run yet) we won't
try to spin and just exit. This can cause an dead-lock if the interrupt
comes when we are processing the dpci in the softirq.

Interestingly the old ('tasklet') code used an a) mechanism.
If the function assigned to the tasklet was running  - the softirq
that ran said function (hvm_dirq_assist) would be responsible for
putting the tasklet back on the per-cpu list. This would allow
to have an running tasklet and an 'to-be-scheduled' tasklet
at the same time. This solution moves this 'to-be-scheduled'
job to be done in 'raise_softirq_for' (instead of the
'softirq').

Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Reported-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/drivers/passthrough/io.c | 28 +++++++++++++++++++++++++---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index ae050df..9c30ebb 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -63,10 +63,32 @@ enum {
 static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
 {
     unsigned long flags;
+    unsigned long old;
 
-    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
-        return;
-
+    /*
+     * This cmpxchg spins until the state is zero (unused).
+     */
+    for ( ;; )
+    {
+        old = cmpxchg(&pirq_dpci->state, 0, 1 << STATE_SCHED);
+        switch ( old )
+        {
+        case (1 << STATE_SCHED):
+            /*
+             * Whenever STATE_SCHED is set we MUST not schedule it.
+             */
+            return;
+        case (1 << STATE_RUN) | (1 << STATE_SCHED):
+        case (1 << STATE_RUN):
+            /* Getting close to finish. Spin. */
+            continue;
+        }
+        /*
+         * If the 'state' is 0 (not in use) we can schedule it.
+         */
+        if ( old == 0 )
+            break;
+    }
     get_knownalive_domain(pirq_dpci->dom);
 
     local_irq_save(flags);
-- 
2.1.0


[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-03-17 14:54                     ` Konrad Rzeszutek Wilk
@ 2015-03-17 16:01                       ` Jan Beulich
  2015-03-17 16:09                         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2015-03-17 16:01 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: andrew.cooper3, malcolm.crossley, xen-devel, Sander Eikelenboom

>>> On 17.03.15 at 15:54, <konrad.wilk@oracle.com> wrote:
> On Tue, Mar 17, 2015 at 09:42:21AM +0100, Sander Eikelenboom wrote:
>> I'm still running with this first simple stopgap patch from Konrad,
>> and it has worked fine for me since.
> 
> I believe the patch that Sander and Malcom had been running is the best
> candidate.

That's the one Sander had quoted I suppose? I don't think this is
any better in terms of live locking, and we went quite some hoops
to get to something that looked more like a fix than a quick
workaround. (If there's nothing we can agree to, we'll have to
revert as we did for 4.5.)

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-03-17 16:01                       ` Jan Beulich
@ 2015-03-17 16:09                         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 27+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-03-17 16:09 UTC (permalink / raw)
  To: Jan Beulich
  Cc: andrew.cooper3, malcolm.crossley, xen-devel, Sander Eikelenboom

On Tue, Mar 17, 2015 at 04:01:49PM +0000, Jan Beulich wrote:
> >>> On 17.03.15 at 15:54, <konrad.wilk@oracle.com> wrote:
> > On Tue, Mar 17, 2015 at 09:42:21AM +0100, Sander Eikelenboom wrote:
> >> I'm still running with this first simple stopgap patch from Konrad,
> >> and it has worked fine for me since.
> > 
> > I believe the patch that Sander and Malcom had been running is the best
> > candidate.
> 
> That's the one Sander had quoted I suppose? I don't think this is

Correct.
> any better in terms of live locking, and we went quite some hoops
> to get to something that looked more like a fix than a quick
> workaround. (If there's nothing we can agree to, we'll have to
> revert as we did for 4.5.)

The live-locking does get broken (other softirqs get activated which
moves things along). Keep in mind that the live-locking scenario
exists already in Xen 4.x with the tasklet implementation.

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-01-23  9:37     ` Jan Beulich
  2015-01-23 14:54       ` Konrad Rzeszutek Wilk
@ 2015-03-17 17:44       ` Konrad Rzeszutek Wilk
  2015-03-17 22:16         ` Sander Eikelenboom
  2015-03-18  7:41         ` Jan Beulich
  1 sibling, 2 replies; 27+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-03-17 17:44 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux


> >> Additionally I think it should be considered whether the bitmap
> >> approach of interpreting ->state is the right one, and we don't
> >> instead want a clean 3-state (idle, sched, run) model.
> > 
> > Could you elaborate a bit more please? As in three different unsigned int
> > (or bool_t) that set in what state we are in?
> 
> An enum { STATE_IDLE, STATE_SCHED, STATE_RUN }. Especially
> if my comment above turns out to be wrong, you'd have no real
> need for the SCHED and RUN flags to be set at the same time.

I cobbled what I believe is what you were thinking off.

As you can see to preserve the existing functionality such as
being able to schedule N amount of interrupt injections
for the N interrupts we might get - I modified '->masked'
to be an atomic counter.

The end result is that we can still live-lock. Unless we:
 - Drop on the floor the injection of N interrupts and
   just deliever at max one per VMX_EXIT (and not bother
   with interrupts arriving when we are in the VMX handler).

 - Alter the softirq code slightly - to have an variant
   which will only iterate once over the pending softirq
   bits per call. (so save an copy of the bitmap on the
   stack when entering the softirq handler - and use that.
   We could also xor it against the current to catch any
   non-duplicate bits being set that we should deal with).

Here is the compile, but not run-time tested patch.

>From e7d8bcd7c5d32c520554a4ad69c4716246036002 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Tue, 17 Mar 2015 13:31:52 -0400
Subject: [RFC PATCH] dpci: Switch to tristate instead of bitmap

*TODO*:
 - Writeup.
 - Tests

Suggested-by: Jan Beulich <jbuelich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/drivers/passthrough/io.c | 140 ++++++++++++++++++++++++-------------------
 xen/include/xen/hvm/irq.h    |   4 +-
 2 files changed, 82 insertions(+), 62 deletions(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index ae050df..663e104 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -30,42 +30,28 @@
 static DEFINE_PER_CPU(struct list_head, dpci_list);
 
 /*
- * These two bit states help to safely schedule, deschedule, and wait until
- * the softirq has finished.
- *
- * The semantics behind these two bits is as follow:
- *  - STATE_SCHED - whoever modifies it has to ref-count the domain (->dom).
- *  - STATE_RUN - only softirq is allowed to set and clear it. If it has
- *      been set hvm_dirq_assist will RUN with a saved value of the
- *      'struct domain' copied from 'pirq_dpci->dom' before STATE_RUN was set.
- *
- * The usual states are: STATE_SCHED(set) -> STATE_RUN(set) ->
- * STATE_SCHED(unset) -> STATE_RUN(unset).
- *
- * However the states can also diverge such as: STATE_SCHED(set) ->
- * STATE_SCHED(unset) -> STATE_RUN(set) -> STATE_RUN(unset). That means
- * the 'hvm_dirq_assist' never run and that the softirq did not do any
- * ref-counting.
- */
-
-enum {
-    STATE_SCHED,
-    STATE_RUN
-};
-
-/*
  * This can be called multiple times, but the softirq is only raised once.
- * That is until the STATE_SCHED state has been cleared. The state can be
- * cleared by: the 'dpci_softirq' (when it has executed 'hvm_dirq_assist'),
- * or by 'pt_pirq_softirq_reset' (which will try to clear the state before
+ * That is until state is in init. The state can be changed  by:
+ * the 'dpci_softirq' (when it has executed 'hvm_dirq_assist'),
+ * or by 'pt_pirq_softirq_reset' (which will try to init the state before
  * the softirq had a chance to run).
  */
 static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
 {
     unsigned long flags;
 
-    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
+    switch ( cmpxchg(&pirq_dpci->state, STATE_INIT, STATE_SCHED) )
+    {
+    case STATE_RUN:
+    case STATE_SCHED:
+        /*
+         * The pirq_dpci->mapping has been incremented to let us know
+         * how many we have left to do.
+         */
         return;
+    case STATE_INIT:
+        break;
+    }
 
     get_knownalive_domain(pirq_dpci->dom);
 
@@ -85,7 +71,7 @@ static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
  */
 bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci)
 {
-    if ( pirq_dpci->state & ((1 << STATE_RUN) | (1 << STATE_SCHED)) )
+    if ( pirq_dpci->state != STATE_INIT )
         return 1;
 
     /*
@@ -109,22 +95,22 @@ static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
 
     ASSERT(spin_is_locked(&d->event_lock));
 
-    switch ( cmpxchg(&pirq_dpci->state, 1 << STATE_SCHED, 0) )
+    switch ( cmpxchg(&pirq_dpci->state, STATE_SCHED, STATE_INIT) )
     {
-    case (1 << STATE_SCHED):
+    case STATE_SCHED:
         /*
-         * We are going to try to de-schedule the softirq before it goes in
-         * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'.
+         * We are going to try to de-schedule the softirq before it goes to
+         * running state. Whoever moves from sched MUST refcount the 'dom'.
          */
         put_domain(d);
         /* fallthrough. */
-    case (1 << STATE_RUN):
-    case (1 << STATE_RUN) | (1 << STATE_SCHED):
+    case STATE_RUN:
+    case STATE_INIT:
         /*
-         * The reason it is OK to reset 'dom' when STATE_RUN bit is set is due
+         * The reason it is OK to reset 'dom' when init or running is set is due
          * to a shortcut the 'dpci_softirq' implements. It stashes the 'dom'
-         * in local variable before it sets STATE_RUN - and therefore will not
-         * dereference '->dom' which would crash.
+         * in local variable before it sets state to running - and therefore
+         * will not dereference '->dom' which would crash.
          */
         pirq_dpci->dom = NULL;
         break;
@@ -135,7 +121,7 @@ static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
      * or never initialized it). Note that we hold the lock that
      * 'hvm_dirq_assist' could be spinning on.
      */
-    pirq_dpci->masked = 0;
+    atomic_set(&pirq_dpci->masked, 0);
 }
 
 bool_t pt_irq_need_timer(uint32_t flags)
@@ -149,7 +135,8 @@ static int pt_irq_guest_eoi(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
     if ( __test_and_clear_bit(_HVM_IRQ_DPCI_EOI_LATCH_SHIFT,
                               &pirq_dpci->flags) )
     {
-        pirq_dpci->masked = 0;
+        ASSERT(atomic_read(&pirq_dpci->masked) <= 1);
+        atomic_set(&pirq_dpci->masked, 0);
         pirq_dpci->pending = 0;
         pirq_guest_eoi(dpci_pirq(pirq_dpci));
     }
@@ -278,6 +265,8 @@ int pt_irq_create_bind(
              * As such on every 'pt_irq_create_bind' call we MUST set it.
              */
             pirq_dpci->dom = d;
+            atomic_set(&pirq_dpci->masked, 0);
+            pirq_dpci->state = STATE_INIT;
             /* bind after hvm_irq_dpci is setup to avoid race with irq handler*/
             rc = pirq_guest_bind(d->vcpu[0], info, 0);
             if ( rc == 0 && pt_irq_bind->u.msi.gtable )
@@ -398,6 +387,8 @@ int pt_irq_create_bind(
             if ( pt_irq_need_timer(pirq_dpci->flags) )
                 init_timer(&pirq_dpci->timer, pt_irq_time_out, pirq_dpci, 0);
             /* Deal with gsi for legacy devices */
+            atomic_set(&pirq_dpci->masked, 0);
+            pirq_dpci->state = STATE_INIT;
             rc = pirq_guest_bind(d->vcpu[0], info, share);
             if ( unlikely(rc) )
             {
@@ -576,6 +567,7 @@ bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
     if ( !dpci->flags && !pt_pirq_softirq_active(dpci) )
     {
         dpci->dom = NULL;
+        dpci->state = STATE_INIT;
         return 1;
     }
     return 0;
@@ -617,7 +609,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
          !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
         return 0;
 
-    pirq_dpci->masked = 1;
+    atomic_inc(&pirq_dpci->masked);
     raise_softirq_for(pirq_dpci);
     return 1;
 }
@@ -672,12 +664,13 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
     spin_unlock(&d->event_lock);
 }
 
-static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
+static bool_t hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
 {
+    int more;
     ASSERT(d->arch.hvm_domain.irq.dpci);
 
     spin_lock(&d->event_lock);
-    if ( test_and_clear_bool(pirq_dpci->masked) )
+    while ( (more = !atomic_dec_and_test(&pirq_dpci->masked)) )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
         const struct dev_intx_gsi_link *digl;
@@ -687,17 +680,13 @@ static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
             send_guest_pirq(d, pirq);
 
             if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI )
-            {
-                spin_unlock(&d->event_lock);
-                return;
-            }
+                break;
         }
 
         if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI )
         {
             vmsi_deliver_pirq(d, pirq_dpci);
-            spin_unlock(&d->event_lock);
-            return;
+            break;
         }
 
         list_for_each_entry ( digl, &pirq_dpci->digl_list, list )
@@ -710,8 +699,7 @@ static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
         {
             /* for translated MSI to INTx interrupt, eoi as early as possible */
             __msi_pirq_eoi(pirq_dpci);
-            spin_unlock(&d->event_lock);
-            return;
+            break;
         }
 
         /*
@@ -725,6 +713,8 @@ static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
         set_timer(&pirq_dpci->timer, NOW() + PT_IRQ_TIME_OUT);
     }
     spin_unlock(&d->event_lock);
+
+    return more;
 }
 
 static void __hvm_dpci_eoi(struct domain *d,
@@ -781,7 +771,7 @@ unlock:
 }
 
 /*
- * Note: 'pt_pirq_softirq_reset' can clear the STATE_SCHED before we get to
+ * Note: 'pt_pirq_softirq_reset' can move the state to init before we get to
  * doing it. If that is the case we let 'pt_pirq_softirq_reset' do ref-counting.
  */
 static void dpci_softirq(void)
@@ -797,23 +787,53 @@ static void dpci_softirq(void)
     {
         struct hvm_pirq_dpci *pirq_dpci;
         struct domain *d;
+        bool_t again = 0;
 
         pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
         list_del(&pirq_dpci->softirq_list);
 
         d = pirq_dpci->dom;
         smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
-        if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
-            BUG();
-        /*
-         * The one who clears STATE_SCHED MUST refcount the domain.
-         */
-        if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
+
+        switch ( cmpxchg(&pirq_dpci->state, STATE_SCHED, STATE_RUN) )
+        {
+        case STATE_INIT:
+            /*  pt_pirq_softirq_reset cleared it. Let it do the ref-counting. */
+            continue;
+        case STATE_RUN:
+            again = 1;
+            /* Fall through. */
+        case STATE_SCHED:
+            break;
+        }
+        if ( !again )
         {
-            hvm_dirq_assist(d, pirq_dpci);
+            /*
+            * The one who changes sched to running MUST refcount the domain.
+            */
+            again = hvm_dirq_assist(d, pirq_dpci);
             put_domain(d);
+            switch ( cmpxchg(&pirq_dpci->state, STATE_RUN, STATE_INIT) )
+            {
+            case STATE_SCHED:
+            case STATE_INIT:
+                BUG();
+            case STATE_RUN:
+                break;
+            }
+        }
+        if ( again )
+        {
+            unsigned long flags;
+
+            /* Put back on the list and retry. */
+            local_irq_save(flags);
+            list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
+            local_irq_restore(flags);
+
+            raise_softirq(HVM_DPCI_SOFTIRQ);
+            continue;
         }
-        clear_bit(STATE_RUN, &pirq_dpci->state);
     }
 }
 
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index 3996f1f..48dbc51 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -93,8 +93,8 @@ struct hvm_irq_dpci {
 /* Machine IRQ to guest device/intx mapping. */
 struct hvm_pirq_dpci {
     uint32_t flags;
-    unsigned int state;
-    bool_t masked;
+    enum { STATE_INIT, STATE_SCHED, STATE_RUN } state;
+    atomic_t masked;
     uint16_t pending;
     struct list_head digl_list;
     struct domain *dom;
-- 
2.1.0

> 
> Jan
> 

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-03-17 17:44       ` Konrad Rzeszutek Wilk
@ 2015-03-17 22:16         ` Sander Eikelenboom
  2015-03-18  7:41         ` Jan Beulich
  1 sibling, 0 replies; 27+ messages in thread
From: Sander Eikelenboom @ 2015-03-17 22:16 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: andrew.cooper3, malcolm.crossley, Jan Beulich, xen-devel


Tuesday, March 17, 2015, 6:44:54 PM, you wrote:


>> >> Additionally I think it should be considered whether the bitmap
>> >> approach of interpreting ->state is the right one, and we don't
>> >> instead want a clean 3-state (idle, sched, run) model.
>> > 
>> > Could you elaborate a bit more please? As in three different unsigned int
>> > (or bool_t) that set in what state we are in?
>> 
>> An enum { STATE_IDLE, STATE_SCHED, STATE_RUN }. Especially
>> if my comment above turns out to be wrong, you'd have no real
>> need for the SCHED and RUN flags to be set at the same time.

> I cobbled what I believe is what you were thinking off.

> As you can see to preserve the existing functionality such as
> being able to schedule N amount of interrupt injections
> for the N interrupts we might get - I modified '->masked'
> to be an atomic counter.

> The end result is that we can still live-lock. Unless we:
>  - Drop on the floor the injection of N interrupts and
>    just deliever at max one per VMX_EXIT (and not bother
>    with interrupts arriving when we are in the VMX handler).

>  - Alter the softirq code slightly - to have an variant
>    which will only iterate once over the pending softirq
>    bits per call. (so save an copy of the bitmap on the
>    stack when entering the softirq handler - and use that.
>    We could also xor it against the current to catch any
>    non-duplicate bits being set that we should deal with).

> Here is the compile, but not run-time tested patch.

> From e7d8bcd7c5d32c520554a4ad69c4716246036002 Mon Sep 17 00:00:00 2001
> From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Date: Tue, 17 Mar 2015 13:31:52 -0400
> Subject: [RFC PATCH] dpci: Switch to tristate instead of bitmap

> *TODO*:
>  - Writeup.
>  - Tests

Done, and unfortunately it doesn't fly ..
Some devices seem to work fine, others don't receive any interrupts shortly 
after boot like:
 40:          3          0          0          0  xen-pirq-ioapic-level  cx25821[1]

Don't see any crashes or errors though, so it seems to silently lock somewhere.

--
Sander

> Suggested-by: Jan Beulich <jbuelich@suse.com>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  xen/drivers/passthrough/io.c | 140 ++++++++++++++++++++++++-------------------
>  xen/include/xen/hvm/irq.h    |   4 +-
>  2 files changed, 82 insertions(+), 62 deletions(-)

> diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> index ae050df..663e104 100644
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -30,42 +30,28 @@
>  static DEFINE_PER_CPU(struct list_head, dpci_list);
>  
>  /*
> - * These two bit states help to safely schedule, deschedule, and wait until
> - * the softirq has finished.
> - *
> - * The semantics behind these two bits is as follow:
> - *  - STATE_SCHED - whoever modifies it has to ref-count the domain (->dom).
> - *  - STATE_RUN - only softirq is allowed to set and clear it. If it has
> - *      been set hvm_dirq_assist will RUN with a saved value of the
> - *      'struct domain' copied from 'pirq_dpci->dom' before STATE_RUN was set.
> - *
> - * The usual states are: STATE_SCHED(set) -> STATE_RUN(set) ->
> - * STATE_SCHED(unset) -> STATE_RUN(unset).
> - *
> - * However the states can also diverge such as: STATE_SCHED(set) ->
> - * STATE_SCHED(unset) -> STATE_RUN(set) -> STATE_RUN(unset). That means
> - * the 'hvm_dirq_assist' never run and that the softirq did not do any
> - * ref-counting.
> - */
> -
> -enum {
> -    STATE_SCHED,
> -    STATE_RUN
> -};
> -
> -/*
>   * This can be called multiple times, but the softirq is only raised once.
> - * That is until the STATE_SCHED state has been cleared. The state can be
> - * cleared by: the 'dpci_softirq' (when it has executed 'hvm_dirq_assist'),
> - * or by 'pt_pirq_softirq_reset' (which will try to clear the state before
> + * That is until state is in init. The state can be changed  by:
> + * the 'dpci_softirq' (when it has executed 'hvm_dirq_assist'),
> + * or by 'pt_pirq_softirq_reset' (which will try to init the state before
>   * the softirq had a chance to run).
>   */
>  static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
>  {
>      unsigned long flags;
>  
> -    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
> +    switch ( cmpxchg(&pirq_dpci->state, STATE_INIT, STATE_SCHED) )
> +    {
> +    case STATE_RUN:
> +    case STATE_SCHED:
> +        /*
> +         * The pirq_dpci->mapping has been incremented to let us know
> +         * how many we have left to do.
> +         */
>          return;
> +    case STATE_INIT:
> +        break;
> +    }
>  
>      get_knownalive_domain(pirq_dpci->dom);
>  
> @@ -85,7 +71,7 @@ static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
>   */
>  bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci)
>  {
> -    if ( pirq_dpci->state & ((1 << STATE_RUN) | (1 << STATE_SCHED)) )
> +    if ( pirq_dpci->state != STATE_INIT )
>          return 1;
>  
>      /*
> @@ -109,22 +95,22 @@ static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
>  
>      ASSERT(spin_is_locked(&d->event_lock));
>  
> -    switch ( cmpxchg(&pirq_dpci->state, 1 << STATE_SCHED, 0) )
> +    switch ( cmpxchg(&pirq_dpci->state, STATE_SCHED, STATE_INIT) )
>      {
> -    case (1 << STATE_SCHED):
> +    case STATE_SCHED:
>          /*
> -         * We are going to try to de-schedule the softirq before it goes in
> -         * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'.
> +         * We are going to try to de-schedule the softirq before it goes to
> +         * running state. Whoever moves from sched MUST refcount the 'dom'.
>           */
>          put_domain(d);
>          /* fallthrough. */
> -    case (1 << STATE_RUN):
> -    case (1 << STATE_RUN) | (1 << STATE_SCHED):
> +    case STATE_RUN:
> +    case STATE_INIT:
>          /*
> -         * The reason it is OK to reset 'dom' when STATE_RUN bit is set is due
> +         * The reason it is OK to reset 'dom' when init or running is set is due
>           * to a shortcut the 'dpci_softirq' implements. It stashes the 'dom'
> -         * in local variable before it sets STATE_RUN - and therefore will not
> -         * dereference '->dom' which would crash.
> +         * in local variable before it sets state to running - and therefore
> +         * will not dereference '->dom' which would crash.
>           */
>          pirq_dpci->dom = NULL;
>          break;
> @@ -135,7 +121,7 @@ static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
>       * or never initialized it). Note that we hold the lock that
>       * 'hvm_dirq_assist' could be spinning on.
>       */
-    pirq_dpci->>masked = 0;
> +    atomic_set(&pirq_dpci->masked, 0);
>  }
>  
>  bool_t pt_irq_need_timer(uint32_t flags)
> @@ -149,7 +135,8 @@ static int pt_irq_guest_eoi(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
>      if ( __test_and_clear_bit(_HVM_IRQ_DPCI_EOI_LATCH_SHIFT,
>                                &pirq_dpci->flags) )
>      {
-        pirq_dpci->>masked = 0;
> +        ASSERT(atomic_read(&pirq_dpci->masked) <= 1);
> +        atomic_set(&pirq_dpci->masked, 0);
>          pirq_dpci->pending = 0;
>          pirq_guest_eoi(dpci_pirq(pirq_dpci));
>      }
> @@ -278,6 +265,8 @@ int pt_irq_create_bind(
>               * As such on every 'pt_irq_create_bind' call we MUST set it.
>               */
>              pirq_dpci->dom = d;
> +            atomic_set(&pirq_dpci->masked, 0);
> +            pirq_dpci->state = STATE_INIT;
>              /* bind after hvm_irq_dpci is setup to avoid race with irq handler*/
>              rc = pirq_guest_bind(d->vcpu[0], info, 0);
>              if ( rc == 0 && pt_irq_bind->u.msi.gtable )
> @@ -398,6 +387,8 @@ int pt_irq_create_bind(
>              if ( pt_irq_need_timer(pirq_dpci->flags) )
>                  init_timer(&pirq_dpci->timer, pt_irq_time_out, pirq_dpci, 0);
>              /* Deal with gsi for legacy devices */
> +            atomic_set(&pirq_dpci->masked, 0);
> +            pirq_dpci->state = STATE_INIT;
>              rc = pirq_guest_bind(d->vcpu[0], info, share);
>              if ( unlikely(rc) )
>              {
> @@ -576,6 +567,7 @@ bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
>      if ( !dpci->flags && !pt_pirq_softirq_active(dpci) )
>      {
>          dpci->dom = NULL;
+        dpci->>state = STATE_INIT;
>          return 1;
>      }
>      return 0;
> @@ -617,7 +609,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
>           !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
>          return 0;
>  
-    pirq_dpci->>masked = 1;
> +    atomic_inc(&pirq_dpci->masked);
>      raise_softirq_for(pirq_dpci);
>      return 1;
>  }
> @@ -672,12 +664,13 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
>      spin_unlock(&d->event_lock);
>  }
>  
> -static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
> +static bool_t hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
>  {
> +    int more;
>      ASSERT(d->arch.hvm_domain.irq.dpci);
>  
>      spin_lock(&d->event_lock);
> -    if ( test_and_clear_bool(pirq_dpci->masked) )
> +    while ( (more = !atomic_dec_and_test(&pirq_dpci->masked)) )
>      {
>          struct pirq *pirq = dpci_pirq(pirq_dpci);
>          const struct dev_intx_gsi_link *digl;
> @@ -687,17 +680,13 @@ static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
>              send_guest_pirq(d, pirq);
>  
>              if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI )
> -            {
> -                spin_unlock(&d->event_lock);
> -                return;
> -            }
> +                break;
>          }
>  
>          if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI )
>          {
>              vmsi_deliver_pirq(d, pirq_dpci);
> -            spin_unlock(&d->event_lock);
> -            return;
> +            break;
>          }
>  
>          list_for_each_entry ( digl, &pirq_dpci->digl_list, list )
> @@ -710,8 +699,7 @@ static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
>          {
>              /* for translated MSI to INTx interrupt, eoi as early as possible */
>              __msi_pirq_eoi(pirq_dpci);
> -            spin_unlock(&d->event_lock);
> -            return;
> +            break;
>          }
>  
>          /*
> @@ -725,6 +713,8 @@ static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
>          set_timer(&pirq_dpci->timer, NOW() + PT_IRQ_TIME_OUT);
>      }
>      spin_unlock(&d->event_lock);
> +
> +    return more;
>  }
>  
>  static void __hvm_dpci_eoi(struct domain *d,
> @@ -781,7 +771,7 @@ unlock:
>  }
>  
>  /*
> - * Note: 'pt_pirq_softirq_reset' can clear the STATE_SCHED before we get to
> + * Note: 'pt_pirq_softirq_reset' can move the state to init before we get to
>   * doing it. If that is the case we let 'pt_pirq_softirq_reset' do ref-counting.
>   */
>  static void dpci_softirq(void)
> @@ -797,23 +787,53 @@ static void dpci_softirq(void)
>      {
>          struct hvm_pirq_dpci *pirq_dpci;
>          struct domain *d;
> +        bool_t again = 0;
>  
>          pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
>          list_del(&pirq_dpci->softirq_list);
>  
>          d = pirq_dpci->dom;
>          smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
> -        if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> -            BUG();
> -        /*
> -         * The one who clears STATE_SCHED MUST refcount the domain.
> -         */
> -        if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
> +
> +        switch ( cmpxchg(&pirq_dpci->state, STATE_SCHED, STATE_RUN) )
> +        {
> +        case STATE_INIT:
> +            /*  pt_pirq_softirq_reset cleared it. Let it do the ref-counting. */
> +            continue;
> +        case STATE_RUN:
> +            again = 1;
> +            /* Fall through. */
> +        case STATE_SCHED:
> +            break;
> +        }
> +        if ( !again )
>          {
> -            hvm_dirq_assist(d, pirq_dpci);
> +            /*
> +            * The one who changes sched to running MUST refcount the domain.
> +            */
> +            again = hvm_dirq_assist(d, pirq_dpci);
>              put_domain(d);
> +            switch ( cmpxchg(&pirq_dpci->state, STATE_RUN, STATE_INIT) )
> +            {
> +            case STATE_SCHED:
> +            case STATE_INIT:
> +                BUG();
> +            case STATE_RUN:
> +                break;
> +            }
> +        }
> +        if ( again )
> +        {
> +            unsigned long flags;
> +
> +            /* Put back on the list and retry. */
> +            local_irq_save(flags);
> +            list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
> +            local_irq_restore(flags);
> +
> +            raise_softirq(HVM_DPCI_SOFTIRQ);
> +            continue;
>          }
> -        clear_bit(STATE_RUN, &pirq_dpci->state);
>      }
>  }
>  
> diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
> index 3996f1f..48dbc51 100644
> --- a/xen/include/xen/hvm/irq.h
> +++ b/xen/include/xen/hvm/irq.h
> @@ -93,8 +93,8 @@ struct hvm_irq_dpci {
>  /* Machine IRQ to guest device/intx mapping. */
>  struct hvm_pirq_dpci {
>      uint32_t flags;
> -    unsigned int state;
> -    bool_t masked;
> +    enum { STATE_INIT, STATE_SCHED, STATE_RUN } state;
> +    atomic_t masked;
>      uint16_t pending;
>      struct list_head digl_list;
>      struct domain *dom;

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-03-17 17:44       ` Konrad Rzeszutek Wilk
  2015-03-17 22:16         ` Sander Eikelenboom
@ 2015-03-18  7:41         ` Jan Beulich
  2015-03-18 14:06           ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2015-03-18  7:41 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux

>>> On 17.03.15 at 18:44, <konrad.wilk@oracle.com> wrote:
> As you can see to preserve the existing functionality such as
> being able to schedule N amount of interrupt injections
> for the N interrupts we might get - I modified '->masked'
> to be an atomic counter.

Why would that be? When an earlier interrupt wasn't fully handled,
real hardware wouldn't latch more than one further instance either.

> The end result is that we can still live-lock. Unless we:
>  - Drop on the floor the injection of N interrupts and
>    just deliever at max one per VMX_EXIT (and not bother
>    with interrupts arriving when we are in the VMX handler).

I'm afraid I again don't see the point here.

>  - Alter the softirq code slightly - to have an variant
>    which will only iterate once over the pending softirq
>    bits per call. (so save an copy of the bitmap on the
>    stack when entering the softirq handler - and use that.
>    We could also xor it against the current to catch any
>    non-duplicate bits being set that we should deal with).

That's clearly not an option: The solution should be isolated
to DPCI code, i.e. without altering existing behavior in other
(more generic) components.

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-03-18  7:41         ` Jan Beulich
@ 2015-03-18 14:06           ` Konrad Rzeszutek Wilk
  2015-03-18 16:43             ` Jan Beulich
  0 siblings, 1 reply; 27+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-03-18 14:06 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux

On Wed, Mar 18, 2015 at 07:41:55AM +0000, Jan Beulich wrote:
> >>> On 17.03.15 at 18:44, <konrad.wilk@oracle.com> wrote:
> > As you can see to preserve the existing functionality such as
> > being able to schedule N amount of interrupt injections
> > for the N interrupts we might get - I modified '->masked'
> > to be an atomic counter.
> 
> Why would that be? When an earlier interrupt wasn't fully handled,
> real hardware wouldn't latch more than one further instance either.

We acknowledge the interrupt in the hypervisor - as in we call
->ack on the handler (which for MSI is an nop anyhow).

If the device is misconfigured and keeps on sending burst of
interrupts every 10 msec for 1msec we can dead-lock.

Either way we should tell the guest about those interrupts.
> 
> > The end result is that we can still live-lock. Unless we:
> >  - Drop on the floor the injection of N interrupts and
> >    just deliever at max one per VMX_EXIT (and not bother
> >    with interrupts arriving when we are in the VMX handler).
> 
> I'm afraid I again don't see the point here.

I am basing all of this on the assumption that we have
many interrupts for the same device coming it - and we have
not been able to tell the guest about it (the guest could
be descheduled, too slow, etc) so that it can do what it
needs to silence the device.

It might be very well an busted device - and if that is
the case be that - but we must not dead-lock due to it.
> 
> >  - Alter the softirq code slightly - to have an variant
> >    which will only iterate once over the pending softirq
> >    bits per call. (so save an copy of the bitmap on the
> >    stack when entering the softirq handler - and use that.
> >    We could also xor it against the current to catch any
> >    non-duplicate bits being set that we should deal with).
> 
> That's clearly not an option: The solution should be isolated
> to DPCI code, i.e. without altering existing behavior in other
> (more generic) components.
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-03-18 14:06           ` Konrad Rzeszutek Wilk
@ 2015-03-18 16:43             ` Jan Beulich
  2015-03-18 17:00               ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2015-03-18 16:43 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux

>>> On 18.03.15 at 15:06, <konrad.wilk@oracle.com> wrote:
> On Wed, Mar 18, 2015 at 07:41:55AM +0000, Jan Beulich wrote:
>> >>> On 17.03.15 at 18:44, <konrad.wilk@oracle.com> wrote:
>> > As you can see to preserve the existing functionality such as
>> > being able to schedule N amount of interrupt injections
>> > for the N interrupts we might get - I modified '->masked'
>> > to be an atomic counter.
>> 
>> Why would that be? When an earlier interrupt wasn't fully handled,
>> real hardware wouldn't latch more than one further instance either.
> 
> We acknowledge the interrupt in the hypervisor - as in we call
> ->ack on the handler (which for MSI is an nop anyhow).

The case where ->ack is a nop (for the purposes here) is specifically
not a problem, as that means we defer ack-ing the LAPIC (hence
further instances can't show up).

> If the device is misconfigured and keeps on sending burst of
> interrupts every 10 msec for 1msec we can dead-lock.

How is this different from the hypervisor itself not being fast
enough to handle one instance before the next one shows up?
I've been trying to reconstruct the rationale for our current
treatment of maskable MSI sources (in that we ack them at the
LAPIC right away), but so far wasn't really successful (sadly
commit 5f4c1bb65e lacks any word of description other than
its title).

(Ill behaved devices shouldn't be handed to guests anyway.)

> Either way we should tell the guest about those interrupts.
>> 
>> > The end result is that we can still live-lock. Unless we:
>> >  - Drop on the floor the injection of N interrupts and
>> >    just deliever at max one per VMX_EXIT (and not bother
>> >    with interrupts arriving when we are in the VMX handler).
>> 
>> I'm afraid I again don't see the point here.
> 
> I am basing all of this on the assumption that we have
> many interrupts for the same device coming it - and we have
> not been able to tell the guest about it (the guest could
> be descheduled, too slow, etc) so that it can do what it
> needs to silence the device.

But that's the same as with the native hardware case: When there
are new interrupt instances before the earlier one was acked, at
most one will be seen at the point the interrupt becomes unmasked
again.

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-03-18 16:43             ` Jan Beulich
@ 2015-03-18 17:00               ` Konrad Rzeszutek Wilk
  2015-03-19  7:15                 ` Jan Beulich
  0 siblings, 1 reply; 27+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-03-18 17:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux

On Wed, Mar 18, 2015 at 04:43:40PM +0000, Jan Beulich wrote:
> >>> On 18.03.15 at 15:06, <konrad.wilk@oracle.com> wrote:
> > On Wed, Mar 18, 2015 at 07:41:55AM +0000, Jan Beulich wrote:
> >> >>> On 17.03.15 at 18:44, <konrad.wilk@oracle.com> wrote:
> >> > As you can see to preserve the existing functionality such as
> >> > being able to schedule N amount of interrupt injections
> >> > for the N interrupts we might get - I modified '->masked'
> >> > to be an atomic counter.
> >> 
> >> Why would that be? When an earlier interrupt wasn't fully handled,
> >> real hardware wouldn't latch more than one further instance either.
> > 
> > We acknowledge the interrupt in the hypervisor - as in we call
> > ->ack on the handler (which for MSI is an nop anyhow).
> 
> The case where ->ack is a nop (for the purposes here) is specifically
> not a problem, as that means we defer ack-ing the LAPIC (hence
> further instances can't show up).
> 
> > If the device is misconfigured and keeps on sending burst of
> > interrupts every 10 msec for 1msec we can dead-lock.
> 
> How is this different from the hypervisor itself not being fast
> enough to handle one instance before the next one shows up?

If by 'handle' you mean process it to the guest (so update guest vAPIC
and so on), then yes - this is exactly the case I am describing.

> I've been trying to reconstruct the rationale for our current
> treatment of maskable MSI sources (in that we ack them at the
> LAPIC right away), but so far wasn't really successful (sadly
> commit 5f4c1bb65e lacks any word of description other than
> its title).
> 
> (Ill behaved devices shouldn't be handed to guests anyway.)

They might become ill-behaved if the guest OS becomes
compromised.
> 
> > Either way we should tell the guest about those interrupts.
> >> 
> >> > The end result is that we can still live-lock. Unless we:
> >> >  - Drop on the floor the injection of N interrupts and
> >> >    just deliever at max one per VMX_EXIT (and not bother
> >> >    with interrupts arriving when we are in the VMX handler).
> >> 
> >> I'm afraid I again don't see the point here.
> > 
> > I am basing all of this on the assumption that we have
> > many interrupts for the same device coming it - and we have
> > not been able to tell the guest about it (the guest could
> > be descheduled, too slow, etc) so that it can do what it
> > needs to silence the device.
> 
> But that's the same as with the native hardware case: When there
> are new interrupt instances before the earlier one was acked, at
> most one will be seen at the point the interrupt becomes unmasked
> again.

Correct. However we split the 'handling' of an interrupt in two
stages. First stage is Acking it and activating an softirq to
process this dpci.

The second stage is running the softirq handler (processing)- and
right then we can get interrupted by the same interrupt (we have
Acked it - so the device is OK to send another one). The interrupt
handler (do_IRQ) will try to tell the softirq to process it.
And in here - depending on which flavour of RFC patches I've
posted - we could deadlock.

The deadlocks arise if we explicitly wait for the softirq to finish
in raise_softirq_for - as in we spin in raise_softirq_for for the
dpci to be out of running - while we have just stomped over the
softirq that was processing the dpci!

The live-lock scenario is also possible - if the device sends an
interrupt right as dpci_softirq is in hvm_dirq_assist - and it
does at such regular intervals that dpci_softirq ends up
rescheduling its dpci every time.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU.
  2015-03-18 17:00               ` Konrad Rzeszutek Wilk
@ 2015-03-19  7:15                 ` Jan Beulich
  0 siblings, 0 replies; 27+ messages in thread
From: Jan Beulich @ 2015-03-19  7:15 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: andrew.cooper3, malcolm.crossley, xen-devel, linux

>>> On 18.03.15 at 18:00, <konrad.wilk@oracle.com> wrote:
> The deadlocks arise if we explicitly wait for the softirq to finish
> in raise_softirq_for - as in we spin in raise_softirq_for for the
> dpci to be out of running - while we have just stomped over the
> softirq that was processing the dpci!

Right - I finally see what you mean and why my alternative
suggestion was bad. So let me go back look at the patch
Sander has been testing for so long another time with this
in mind.

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2015-03-19  7:16 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-12 16:45 [RFC PATCH] dpci: Put the dpci back on the list if running on another CPU Konrad Rzeszutek Wilk
2015-01-12 17:27 ` Sander Eikelenboom
2015-01-12 17:35 ` Konrad Rzeszutek Wilk
2015-01-13 10:26   ` Jan Beulich
2015-01-13 10:20 ` Jan Beulich
2015-01-23  1:44   ` Konrad Rzeszutek Wilk
2015-01-23  9:37     ` Jan Beulich
2015-01-23 14:54       ` Konrad Rzeszutek Wilk
2015-03-17 17:44       ` Konrad Rzeszutek Wilk
2015-03-17 22:16         ` Sander Eikelenboom
2015-03-18  7:41         ` Jan Beulich
2015-03-18 14:06           ` Konrad Rzeszutek Wilk
2015-03-18 16:43             ` Jan Beulich
2015-03-18 17:00               ` Konrad Rzeszutek Wilk
2015-03-19  7:15                 ` Jan Beulich
2015-02-02 14:29   ` Konrad Rzeszutek Wilk
2015-02-02 15:19     ` Jan Beulich
2015-02-02 15:31       ` Konrad Rzeszutek Wilk
2015-02-02 15:48         ` Jan Beulich
2015-02-02 17:44           ` Konrad Rzeszutek Wilk
2015-02-03  8:58             ` Jan Beulich
2015-03-16 17:59               ` Konrad Rzeszutek Wilk
2015-03-17  8:18                 ` Jan Beulich
2015-03-17  8:42                   ` Sander Eikelenboom
2015-03-17 14:54                     ` Konrad Rzeszutek Wilk
2015-03-17 16:01                       ` Jan Beulich
2015-03-17 16:09                         ` Konrad Rzeszutek Wilk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.