All of lore.kernel.org
 help / color / mirror / Atom feed
* One (possible) x86 get_user_pages bug
@ 2011-01-27 13:05 Xiaowei Yang
  2011-01-27 13:56 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Xiaowei Yang @ 2011-01-27 13:05 UTC (permalink / raw)
  To: Nick Piggin, Peter Zijlstra, Jan Beulich
  Cc: Kenneth Lee, wangzhenguo, linqaingmin, fanhenglong, Wu Fengguang,
	linux-kernel, Kaushik Barde

Actually this bug is met with a SLES11 SP1 dom0 kernel 
(2.6.32.12-0.7-xen), and we still can't reproduce it with a native 
2.6.32 kernel. But as we suspect the native kernel might have the same 
issue, we send it to LKML for consultant.

At first the error message looks like this:
----------------------------------------------------------------
[201674.150162] BUG: Bad page state in process java  pfn:d13b8
[201674.151345] page:ffff8800075c7040 flags:4000000000200000 count:0 
mapcount:0        mapping:(null) index:7f093bdfd
[201674.152474] Pid: 14793, comm: java Tainted: G          N 
2.6.32.12-0.7-xen #2
[201674.153585] Call Trace:
[201674.154643]  [<ffffffff80009a75>] dump_trace+0x65/0x180
[201674.155686]  [<ffffffff80369f26>] dump_stack+0x69/0x73
[201674.156744]  [<ffffffff8009c0df>] bad_page+0xdf/0x160
[201674.157773]  [<ffffffff800614f1>] get_futex_key+0x71/0x1a0
[201674.158820]  [<ffffffff80061f72>] futex_wake+0x52/0x130
[201674.159852]  [<ffffffff8006414f>] do_futex+0x11f/0xc40
[201674.160875]  [<ffffffff80064cf2>] sys_futex+0x82/0x160
[201674.161907]  [<ffffffff8003aa26>] mm_release+0xb6/0x110
[201674.162960]  [<ffffffff8003f65e>] exit_mm+0x1e/0x150
[201674.163991]  [<ffffffff80040567>] do_exit+0x127/0x7e0
[201674.165028]  [<ffffffff80040c32>] sys_exit+0x12/0x20
[201674.166070]  [<ffffffff80007388>] system_call_fastpath+0x16/0x1b
[201674.167130]  [<00007f098db046b0>] 0x7f098db046b0
----------------------------------------------------------------

After CONFIG_DEBUG_VM option turned on (kind of), the faulting spot is 
captured -- get_page() in gup_pte_range() is used upon a free page and 
it triggers a BUG_ON.

We created a scenario to reproduce the bug:
----------------------------------------------------------------
// proc1/proc1.2 are 2 threads sharing one page table.
// proc1 is the parent of proc2.

proc1               proc2          proc1.2
...                 ...            // in gup_pte_range()
...                 ...            pte = gup_get_pte()
...                 ...            page1 = pte_page(pte)  // (1)
do_wp_page(page1)   ...            ...
...                 exit_map()     ...
...                 ...            get_page(page1)        // (2)
-----------------------------------------------------------------

do_wp_page() and exit_map() cause page1 to be released into free list 
before get_page() in proc1.2 is called. The longer the delay between 
(1)&(2), the easier the BUG_ON shows.

An experimental patch is made to prevent the PTE being modified in the 
middle of gup_pte_range(). The BUG_ON disappears afterward.

However, from the comments embedded in gup.c, it seems deliberate to 
avoid the lock in the fast path. The question is: if so, how to avoid 
the above scenario?

Thanks,
xiaowei

--------------------------------------------------------------------
--- /usr/src/linux-2.6.32.12-0.7/arch/x86/mm/gup.c.org  2011-01-27 
20:11:45.000000000 +0800
+++ /usr/src/linux-2.6.32.12-0.7/arch/x86/mm/gup.c      2011-01-27 
20:11:22.000000000 +0800
@@ -72,17 +72,18 @@static noinline int gup_pte_range(pmd_t pmd, unsigned 
long addr,
                                unsigned long end, int write, struct 
page **pages, int *nr)
  {
         unsigned long mask;
         pte_t *ptep;
+       spinlock_t *ptl;

         mask = _PAGE_PRESENT|_PAGE_USER;
         if (write)
                 mask |= _PAGE_RW;
-       ptep = pte_offset_map(&pmd, addr);
+       ptep = pte_offset_map_lock(current->mm, &pmd, addr, &ptl);
         do {
                 pte_t pte = gup_get_pte(ptep);
                 struct page *page;

                 if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
-                       pte_unmap(ptep);
+                       pte_unmap_unlock(ptep, ptl);
                         return 0;
                 }
                 VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
@@ -90,8 +91,9 @@
                 get_page(page);
                 pages[*nr] = page;
                 (*nr)++;

         } while (ptep++, addr += PAGE_SIZE, addr != end);
-       pte_unmap(ptep - 1);
+       pte_unmap_unlock(ptep - 1, ptl);

         return 1;
  }



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
  2011-01-27 13:05 One (possible) x86 get_user_pages bug Xiaowei Yang
@ 2011-01-27 13:56 ` Peter Zijlstra
  2011-01-27 14:30   ` Jan Beulich
  2011-01-27 14:49   ` Jan Beulich
  2011-01-27 16:07   ` Jan Beulich
  2 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2011-01-27 13:56 UTC (permalink / raw)
  To: Xiaowei Yang
  Cc: Nick Piggin, Jan Beulich, Kenneth Lee, wangzhenguo, linqaingmin,
	fanhenglong, Wu Fengguang, linux-kernel, Kaushik Barde

On Thu, 2011-01-27 at 21:05 +0800, Xiaowei Yang wrote:
> 
> However, from the comments embedded in gup.c, it seems deliberate to 
> avoid the lock in the fast path. The question is: if so, how to avoid 
> the above scenario? 

Something like the below comes to mind... but I must say I haven't fully
considered the problem yet..

---
 arch/x86/mm/gup.c |    9 +++++----
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dbe34b9..6527933 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -89,10 +89,11 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		}
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
-		get_page(page);
-		SetPageReferenced(page);
-		pages[*nr] = page;
-		(*nr)++;
+		if (get_page_unless_zero(page)) {
+			SetPageReferenced(page);
+			pages[*nr] = page;
+			(*nr)++;
+		}
 
 	} while (ptep++, addr += PAGE_SIZE, addr != end);
 	pte_unmap(ptep - 1);


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
  2011-01-27 13:56 ` Peter Zijlstra
@ 2011-01-27 14:30   ` Jan Beulich
  2011-01-28 10:51     ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2011-01-27 14:30 UTC (permalink / raw)
  To: Peter Zijlstra, Xiaowei Yang
  Cc: fanhenglong, Kaushik Barde, Kenneth Lee, linqaingmin,
	wangzhenguo, Wu Fengguang, Nick Piggin, linux-kernel

>>> On 27.01.11 at 14:56, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Thu, 2011-01-27 at 21:05 +0800, Xiaowei Yang wrote:
>> 
>> However, from the comments embedded in gup.c, it seems deliberate to 
>> avoid the lock in the fast path. The question is: if so, how to avoid 
>> the above scenario? 
> 
> Something like the below comes to mind... but I must say I haven't fully
> considered the problem yet..

That doesn't seem to account for the possible case of the page
even managing to get allocated again to something else.

And I think you would need to drop out of gup_pte_range() in
that case.

I would think this needs to be get_page_unless_zero()
followed by re-checking of the page table entry (probably
not even requiring a second gup_get_pte()); I'm not sure
yet what the correct action would be for change in only the
accessed/dirty bits.

Jan

> ---
>  arch/x86/mm/gup.c |    9 +++++----
>  1 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index dbe34b9..6527933 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -89,10 +89,11 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned 
> long addr,
>  		}
>  		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>  		page = pte_page(pte);
> -		get_page(page);
> -		SetPageReferenced(page);
> -		pages[*nr] = page;
> -		(*nr)++;
> +		if (get_page_unless_zero(page)) {
> +			SetPageReferenced(page);
> +			pages[*nr] = page;
> +			(*nr)++;
> +		}
>  
>  	} while (ptep++, addr += PAGE_SIZE, addr != end);
>  	pte_unmap(ptep - 1);




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
  2011-01-27 13:05 One (possible) x86 get_user_pages bug Xiaowei Yang
@ 2011-01-27 14:49   ` Jan Beulich
  2011-01-27 14:49   ` Jan Beulich
  2011-01-27 16:07   ` Jan Beulich
  2 siblings, 0 replies; 26+ messages in thread
From: Jan Beulich @ 2011-01-27 14:49 UTC (permalink / raw)
  To: Xiaowei Yang, Nick Piggin
  Cc: Peter Zijlstra, fanhenglong, Kaushik Barde, Kenneth Lee,
	linqaingmin, wangzhenguo, Wu Fengguang, xen-devel, linux-kernel

>>> On 27.01.11 at 14:05, Xiaowei Yang <xiaowei.yang@huawei.com> wrote:
> We created a scenario to reproduce the bug:
> ----------------------------------------------------------------
> // proc1/proc1.2 are 2 threads sharing one page table.
> // proc1 is the parent of proc2.
> 
> proc1               proc2          proc1.2
> ...                 ...            // in gup_pte_range()
> ...                 ...            pte = gup_get_pte()
> ...                 ...            page1 = pte_page(pte)  // (1)
> do_wp_page(page1)   ...            ...
> ...                 exit_map()     ...
> ...                 ...            get_page(page1)        // (2)
> -----------------------------------------------------------------
> 
> do_wp_page() and exit_map() cause page1 to be released into free list 
> before get_page() in proc1.2 is called. The longer the delay between 
> (1)&(2), the easier the BUG_ON shows.

The scenario indeed seems to apply independent of virtualization,
but the window obviously can be unbounded unless running
native.

However, going through all the comments in gup.c again I wonder
whether pv Xen guests don't violate the major assumption: There
is talk about interrupts being off preventing (or sufficiently
deferring) remote CPUs doing TLB flushes. In pv Xen guests,
however, non-local TLB flushes do not happen by sending IPIs -
the hypercall interface gets used instead. If that's indeed the
case, I would have expected quite a few bug reports, but I'm
unaware of any - Nick, am I overlooking something here?

Jan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
@ 2011-01-27 14:49   ` Jan Beulich
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Beulich @ 2011-01-27 14:49 UTC (permalink / raw)
  To: Xiaowei Yang, Nick Piggin
  Cc: Peter Zijlstra, fanhenglong, Kaushik Barde, Kenneth Lee,
	linqaingmin, wangzhenguo, Wu Fengguang, xen-devel, linux-kernel

>>> On 27.01.11 at 14:05, Xiaowei Yang <xiaowei.yang@huawei.com> wrote:
> We created a scenario to reproduce the bug:
> ----------------------------------------------------------------
> // proc1/proc1.2 are 2 threads sharing one page table.
> // proc1 is the parent of proc2.
> 
> proc1               proc2          proc1.2
> ...                 ...            // in gup_pte_range()
> ...                 ...            pte = gup_get_pte()
> ...                 ...            page1 = pte_page(pte)  // (1)
> do_wp_page(page1)   ...            ...
> ...                 exit_map()     ...
> ...                 ...            get_page(page1)        // (2)
> -----------------------------------------------------------------
> 
> do_wp_page() and exit_map() cause page1 to be released into free list 
> before get_page() in proc1.2 is called. The longer the delay between 
> (1)&(2), the easier the BUG_ON shows.

The scenario indeed seems to apply independent of virtualization,
but the window obviously can be unbounded unless running
native.

However, going through all the comments in gup.c again I wonder
whether pv Xen guests don't violate the major assumption: There
is talk about interrupts being off preventing (or sufficiently
deferring) remote CPUs doing TLB flushes. In pv Xen guests,
however, non-local TLB flushes do not happen by sending IPIs -
the hypercall interface gets used instead. If that's indeed the
case, I would have expected quite a few bug reports, but I'm
unaware of any - Nick, am I overlooking something here?

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
  2011-01-27 14:49   ` Jan Beulich
  (?)
@ 2011-01-27 15:01   ` Peter Zijlstra
  -1 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2011-01-27 15:01 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Xiaowei Yang, Nick Piggin, fanhenglong, Kaushik Barde,
	Kenneth Lee, linqaingmin, wangzhenguo, Wu Fengguang, xen-devel,
	linux-kernel

On Thu, 2011-01-27 at 14:49 +0000, Jan Beulich wrote:
> >>> On 27.01.11 at 14:05, Xiaowei Yang <xiaowei.yang@huawei.com> wrote:
> > We created a scenario to reproduce the bug:
> > ----------------------------------------------------------------
> > // proc1/proc1.2 are 2 threads sharing one page table.
> > // proc1 is the parent of proc2.
> > 
> > proc1               proc2          proc1.2
> > ...                 ...            // in gup_pte_range()
> > ...                 ...            pte = gup_get_pte()
> > ...                 ...            page1 = pte_page(pte)  // (1)
> > do_wp_page(page1)   ...            ...
> > ...                 exit_map()     ...
> > ...                 ...            get_page(page1)        // (2)
> > -----------------------------------------------------------------
> > 
> > do_wp_page() and exit_map() cause page1 to be released into free list 
> > before get_page() in proc1.2 is called. The longer the delay between 
> > (1)&(2), the easier the BUG_ON shows.
> 
> The scenario indeed seems to apply independent of virtualization,
> but the window obviously can be unbounded unless running
> native.
> 
> However, going through all the comments in gup.c again I wonder
> whether pv Xen guests don't violate the major assumption: There
> is talk about interrupts being off preventing (or sufficiently
> deferring) remote CPUs doing TLB flushes. In pv Xen guests,
> however, non-local TLB flushes do not happen by sending IPIs -
> the hypercall interface gets used instead. If that's indeed the
> case, I would have expected quite a few bug reports, but I'm
> unaware of any - Nick, am I overlooking something here?

Indeed, the delay of tlb flush ipi's should ensure that the pages aren't
freed and should cover the race with unmap.

If Xen violates this then xen needs to fix this somehow..

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
  2011-01-27 13:05 One (possible) x86 get_user_pages bug Xiaowei Yang
@ 2011-01-27 16:07   ` Jan Beulich
  2011-01-27 14:49   ` Jan Beulich
  2011-01-27 16:07   ` Jan Beulich
  2 siblings, 0 replies; 26+ messages in thread
From: Jan Beulich @ 2011-01-27 16:07 UTC (permalink / raw)
  To: Xiaowei Yang, Nick Piggin
  Cc: Peter Zijlstra, fanhenglong, Kaushik Barde, Kenneth Lee,
	linqaingmin, wangzhenguo, Wu Fengguang, xen-devel, linux-kernel

>>> On 27.01.11 at 14:05, Xiaowei Yang <xiaowei.yang@huawei.com> wrote:
> We created a scenario to reproduce the bug:
> ----------------------------------------------------------------
> // proc1/proc1.2 are 2 threads sharing one page table.
> // proc1 is the parent of proc2.
> 
> proc1               proc2          proc1.2
> ...                 ...            // in gup_pte_range()
> ...                 ...            pte = gup_get_pte()
> ...                 ...            page1 = pte_page(pte)  // (1)
> do_wp_page(page1)   ...            ...
> ...                 exit_map()     ...
> ...                 ...            get_page(page1)        // (2)
> -----------------------------------------------------------------
> 
> do_wp_page() and exit_map() cause page1 to be released into free list 
> before get_page() in proc1.2 is called. The longer the delay between 
> (1)&(2), the easier the BUG_ON shows.

Other than responded initially, I don't this can happen outside
of Xen: do_wp_page() won't reach page_cache_release() when
gup_pte_range() is running for the same mm on another CPU,
since it can't get past ptep_clear_flush() (waiting for the CPU
in get_user_pages_fast() to re-enable interrupts).

> An experimental patch is made to prevent the PTE being modified in the 
> middle of gup_pte_range(). The BUG_ON disappears afterward.
> 
> However, from the comments embedded in gup.c, it seems deliberate to 
> avoid the lock in the fast path. The question is: if so, how to avoid 
> the above scenario?

Nick, based on your doing of the initial implementation, would
you be able to estimate whether disabling get_user_pages_fast()
altogether for Xen would be performing measurably worse than
adding the locks (but continuing to avoid acquiring mm->mmap_sem)
as suggested by Xiaowei? That's of course only if the latter is correct
at all, of which I haven't fully convinced myself yet.

Jan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
@ 2011-01-27 16:07   ` Jan Beulich
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Beulich @ 2011-01-27 16:07 UTC (permalink / raw)
  To: Xiaowei Yang, Nick Piggin
  Cc: Peter Zijlstra, fanhenglong, Kaushik Barde, Kenneth Lee,
	linqaingmin, wangzhenguo, Wu Fengguang, xen-devel, linux-kernel

>>> On 27.01.11 at 14:05, Xiaowei Yang <xiaowei.yang@huawei.com> wrote:
> We created a scenario to reproduce the bug:
> ----------------------------------------------------------------
> // proc1/proc1.2 are 2 threads sharing one page table.
> // proc1 is the parent of proc2.
> 
> proc1               proc2          proc1.2
> ...                 ...            // in gup_pte_range()
> ...                 ...            pte = gup_get_pte()
> ...                 ...            page1 = pte_page(pte)  // (1)
> do_wp_page(page1)   ...            ...
> ...                 exit_map()     ...
> ...                 ...            get_page(page1)        // (2)
> -----------------------------------------------------------------
> 
> do_wp_page() and exit_map() cause page1 to be released into free list 
> before get_page() in proc1.2 is called. The longer the delay between 
> (1)&(2), the easier the BUG_ON shows.

Other than responded initially, I don't this can happen outside
of Xen: do_wp_page() won't reach page_cache_release() when
gup_pte_range() is running for the same mm on another CPU,
since it can't get past ptep_clear_flush() (waiting for the CPU
in get_user_pages_fast() to re-enable interrupts).

> An experimental patch is made to prevent the PTE being modified in the 
> middle of gup_pte_range(). The BUG_ON disappears afterward.
> 
> However, from the comments embedded in gup.c, it seems deliberate to 
> avoid the lock in the fast path. The question is: if so, how to avoid 
> the above scenario?

Nick, based on your doing of the initial implementation, would
you be able to estimate whether disabling get_user_pages_fast()
altogether for Xen would be performing measurably worse than
adding the locks (but continuing to avoid acquiring mm->mmap_sem)
as suggested by Xiaowei? That's of course only if the latter is correct
at all, of which I haven't fully convinced myself yet.

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
  2011-01-27 16:07   ` Jan Beulich
  (?)
@ 2011-01-27 16:25   ` Peter Zijlstra
  2011-01-27 16:41       ` Jan Beulich
  -1 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2011-01-27 16:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Xiaowei Yang, Nick Piggin, fanhenglong, Kaushik Barde,
	Kenneth Lee, linqaingmin, wangzhenguo, Wu Fengguang, xen-devel,
	linux-kernel

On Thu, 2011-01-27 at 16:07 +0000, Jan Beulich wrote:
> 
> Nick, based on your doing of the initial implementation, would
> you be able to estimate whether disabling get_user_pages_fast()
> altogether for Xen would be performing measurably worse than
> adding the locks (but continuing to avoid acquiring mm->mmap_sem)
> as suggested by Xiaowei? That's of course only if the latter is correct
> at all, of which I haven't fully convinced myself yet. 

Adding the lock will result in deadlocks, __get_user_pages_fast() is
used from NMI context.

Then again, I've got no clue if Xen even has NMIs..

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
  2011-01-27 16:25   ` Peter Zijlstra
@ 2011-01-27 16:41       ` Jan Beulich
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Beulich @ 2011-01-27 16:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, fanhenglong, Kaushik Barde, Kenneth Lee,
	linqaingmin, wangzhenguo, Xiaowei Yang, Wu Fengguang,
	Nick Piggin, xen-devel, linux-kernel

>>> On 27.01.11 at 17:25, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Thu, 2011-01-27 at 16:07 +0000, Jan Beulich wrote:
>> 
>> Nick, based on your doing of the initial implementation, would
>> you be able to estimate whether disabling get_user_pages_fast()
>> altogether for Xen would be performing measurably worse than
>> adding the locks (but continuing to avoid acquiring mm->mmap_sem)
>> as suggested by Xiaowei? That's of course only if the latter is correct
>> at all, of which I haven't fully convinced myself yet. 
> 
> Adding the lock will result in deadlocks, __get_user_pages_fast() is
> used from NMI context.
> 
> Then again, I've got no clue if Xen even has NMIs..

It does, but not in a way that would be usable for perf events
afaict, so that code path should be dead under Xen. (We don't
even build the offending source file in our kernels, but I'm not
sure how pv-ops deals with situations like this - Jeremy?)

Jan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
@ 2011-01-27 16:41       ` Jan Beulich
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Beulich @ 2011-01-27 16:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, fanhenglong, Kaushik Barde, Kenneth Lee,
	linqaingmin, wangzhenguo, Xiaowei Yang, Wu Fengguang,
	Nick Piggin, xen-devel, linux-kernel

>>> On 27.01.11 at 17:25, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Thu, 2011-01-27 at 16:07 +0000, Jan Beulich wrote:
>> 
>> Nick, based on your doing of the initial implementation, would
>> you be able to estimate whether disabling get_user_pages_fast()
>> altogether for Xen would be performing measurably worse than
>> adding the locks (but continuing to avoid acquiring mm->mmap_sem)
>> as suggested by Xiaowei? That's of course only if the latter is correct
>> at all, of which I haven't fully convinced myself yet. 
> 
> Adding the lock will result in deadlocks, __get_user_pages_fast() is
> used from NMI context.
> 
> Then again, I've got no clue if Xen even has NMIs..

It does, but not in a way that would be usable for perf events
afaict, so that code path should be dead under Xen. (We don't
even build the offending source file in our kernels, but I'm not
sure how pv-ops deals with situations like this - Jeremy?)

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
  2011-01-27 16:07   ` Jan Beulich
  (?)
  (?)
@ 2011-01-27 16:56   ` Peter Zijlstra
  -1 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2011-01-27 16:56 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Xiaowei Yang, Nick Piggin, fanhenglong, Kaushik Barde,
	Kenneth Lee, linqaingmin, wangzhenguo, Wu Fengguang, xen-devel,
	linux-kernel

On Thu, 2011-01-27 at 16:07 +0000, Jan Beulich wrote:
> 
> Nick, based on your doing of the initial implementation, would
> you be able to estimate whether disabling get_user_pages_fast()
> altogether for Xen would be performing measurably worse than
> adding the locks (but continuing to avoid acquiring mm->mmap_sem)
> as suggested by Xiaowei? That's of course only if the latter is correct
> at all, of which I haven't fully convinced myself yet. 

Also, I don't think taking the pte_lock is sufficient in your case..
since x86 mmu_gather frees the page tables themselves too, your
dereference of the pgd/pud/pmd will be subject to the same deref after
free problems.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
  2011-01-27 14:49   ` Jan Beulich
@ 2011-01-27 18:27     ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 26+ messages in thread
From: Jeremy Fitzhardinge @ 2011-01-27 18:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Xiaowei Yang, Nick Piggin, Peter Zijlstra, fanhenglong,
	Kaushik Barde, Kenneth Lee, linqaingmin, wangzhenguo,
	Wu Fengguang, xen-devel, linux-kernel, Marcelo Tosatti,
	Avi Kivity

On 01/27/2011 06:49 AM, Jan Beulich wrote:
> However, going through all the comments in gup.c again I wonder
> whether pv Xen guests don't violate the major assumption: There
> is talk about interrupts being off preventing (or sufficiently
> deferring) remote CPUs doing TLB flushes. In pv Xen guests,
> however, non-local TLB flushes do not happen by sending IPIs -
> the hypercall interface gets used instead

Yes, I was aware of that synchronization mechanism, and I think I'd
convinced myself we were OK.  But I can't think was that reasoning was -
perhaps it was something as simple as "gupf isn't used under Xen" (which
may have been true at the time, but isn't now).

As clever as it is, the whole "disable interrupts -> hold off IPI ->
prevent TLB flush -> delay freeing" chain seems pretty fragile.  I guess
its OK if we assume that x86 will forever have IPI-based cross-cpu TLB
flushes, but one could imagine some kind of "remote tlb shootdown using
bus transaction" appearing in the architecture.

And even just considering virtualization, having non-IPI-based tlb
shootdown is a measurable performance win, since a hypervisor can
optimise away a cross-VCPU shootdown if it knows no physical TLB
contains the target VCPU's entries.  I can imagine the KVM folks could
get some benefit from that as well.

So is there some way we can preserve the current scheme's benefits while
making it a bit more general?  (If anyone else has non-IPI-based
shootdown, it would be s390; is there some inspiration there?  An
instruction perhaps?)

    J

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
@ 2011-01-27 18:27     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 26+ messages in thread
From: Jeremy Fitzhardinge @ 2011-01-27 18:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kaushik Barde, xen-devel, Kenneth Lee, Nick Piggin,
	Marcelo Tosatti, linux-kernel, wangzhenguo, Xiaowei Yang,
	linqaingmin, fanhenglong, Avi Kivity, Wu Fengguang,
	Peter Zijlstra

On 01/27/2011 06:49 AM, Jan Beulich wrote:
> However, going through all the comments in gup.c again I wonder
> whether pv Xen guests don't violate the major assumption: There
> is talk about interrupts being off preventing (or sufficiently
> deferring) remote CPUs doing TLB flushes. In pv Xen guests,
> however, non-local TLB flushes do not happen by sending IPIs -
> the hypercall interface gets used instead

Yes, I was aware of that synchronization mechanism, and I think I'd
convinced myself we were OK.  But I can't think was that reasoning was -
perhaps it was something as simple as "gupf isn't used under Xen" (which
may have been true at the time, but isn't now).

As clever as it is, the whole "disable interrupts -> hold off IPI ->
prevent TLB flush -> delay freeing" chain seems pretty fragile.  I guess
its OK if we assume that x86 will forever have IPI-based cross-cpu TLB
flushes, but one could imagine some kind of "remote tlb shootdown using
bus transaction" appearing in the architecture.

And even just considering virtualization, having non-IPI-based tlb
shootdown is a measurable performance win, since a hypervisor can
optimise away a cross-VCPU shootdown if it knows no physical TLB
contains the target VCPU's entries.  I can imagine the KVM folks could
get some benefit from that as well.

So is there some way we can preserve the current scheme's benefits while
making it a bit more general?  (If anyone else has non-IPI-based
shootdown, it would be s390; is there some inspiration there?  An
instruction perhaps?)

    J

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
  2011-01-27 18:27     ` Jeremy Fitzhardinge
  (?)
@ 2011-01-27 19:27     ` Peter Zijlstra
  -1 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2011-01-27 19:27 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Jan Beulich, Xiaowei Yang, Nick Piggin, fanhenglong,
	Kaushik Barde, Kenneth Lee, linqaingmin, wangzhenguo,
	Wu Fengguang, xen-devel, linux-kernel, Marcelo Tosatti,
	Avi Kivity

On Thu, 2011-01-27 at 10:27 -0800, Jeremy Fitzhardinge wrote:
> 
> So is there some way we can preserve the current scheme's benefits while
> making it a bit more general?  (If anyone else has non-IPI-based
> shootdown, it would be s390; is there some inspiration there?  An
> instruction perhaps?) 

Well, you can provide a xen gupf implementation based on rcu freed
page-tables like powerpc, sparc and s390 have.

But you'll have to change the mmu_gather implementation of xen and use
the get_page_unless_zero() thing mentioned before (or use
page_cache_get_speculative()).

But I see no need to change the x86 implementation, if the architecture
ever changes the way it does tlb invalidation we need to change more
than just the gupf implementation.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
  2011-01-27 16:07   ` Jan Beulich
                     ` (2 preceding siblings ...)
  (?)
@ 2011-01-27 21:24   ` Nick Piggin
  2011-01-28  7:17       ` Xiaowei Yang
  -1 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2011-01-27 21:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Xiaowei Yang, Nick Piggin, Peter Zijlstra, fanhenglong,
	Kaushik Barde, Kenneth Lee, linqaingmin, wangzhenguo,
	Wu Fengguang, xen-devel, linux-kernel

On Fri, Jan 28, 2011 at 3:07 AM, Jan Beulich <JBeulich@novell.com> wrote:
>>>> On 27.01.11 at 14:05, Xiaowei Yang <xiaowei.yang@huawei.com> wrote:
>> We created a scenario to reproduce the bug:
>> ----------------------------------------------------------------
>> // proc1/proc1.2 are 2 threads sharing one page table.
>> // proc1 is the parent of proc2.
>>
>> proc1               proc2          proc1.2
>> ...                 ...            // in gup_pte_range()
>> ...                 ...            pte = gup_get_pte()
>> ...                 ...            page1 = pte_page(pte)  // (1)
>> do_wp_page(page1)   ...            ...
>> ...                 exit_map()     ...
>> ...                 ...            get_page(page1)        // (2)
>> -----------------------------------------------------------------
>>
>> do_wp_page() and exit_map() cause page1 to be released into free list
>> before get_page() in proc1.2 is called. The longer the delay between
>> (1)&(2), the easier the BUG_ON shows.
>
> Other than responded initially, I don't this can happen outside
> of Xen: do_wp_page() won't reach page_cache_release() when
> gup_pte_range() is running for the same mm on another CPU,
> since it can't get past ptep_clear_flush() (waiting for the CPU
> in get_user_pages_fast() to re-enable interrupts).

Yeah, this cannot happen on native.


>> An experimental patch is made to prevent the PTE being modified in the
>> middle of gup_pte_range(). The BUG_ON disappears afterward.
>>
>> However, from the comments embedded in gup.c, it seems deliberate to
>> avoid the lock in the fast path. The question is: if so, how to avoid
>> the above scenario?
>
> Nick, based on your doing of the initial implementation, would
> you be able to estimate whether disabling get_user_pages_fast()
> altogether for Xen would be performing measurably worse than
> adding the locks (but continuing to avoid acquiring mm->mmap_sem)
> as suggested by Xiaowei? That's of course only if the latter is correct
> at all, of which I haven't fully convinced myself yet.

You must have some way to guarantee existence of Linux page
tables when you walk them in order to resolve a TLB refill.

x86 does this with IPI and hardware fill that is atomic WRT interrupts.
So fast gup can disable interrupts to walk page tables, I don't think it
is fragile it is absolutely tied to the system ISA (of course that can
change, but as Peter said, other things will have to change).

Other architectures use RCU for this, so fast gup uses a lockless-
pagecache-alike protcol for that.

If Xen is not using IPIs for flush, it should use whatever locks or
synchronization its TLB refill is using.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
  2011-01-27 21:24   ` Nick Piggin
@ 2011-01-28  7:17       ` Xiaowei Yang
  0 siblings, 0 replies; 26+ messages in thread
From: Xiaowei Yang @ 2011-01-28  7:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Beulich, Nick Piggin, Peter Zijlstra, fanhenglong,
	Kaushik Barde, Kenneth Lee, linqaingmin, wangzhenguo,
	Wu Fengguang, xen-devel, linux-kernel

On 2011-1-28 5:24, Nick Piggin wrote:
> On Fri, Jan 28, 2011 at 3:07 AM, Jan Beulich<JBeulich@novell.com>  wrote:
>>>>> On 27.01.11 at 14:05, Xiaowei Yang<xiaowei.yang@huawei.com>  wrote:
>>> We created a scenario to reproduce the bug:
>>> ----------------------------------------------------------------
>>> // proc1/proc1.2 are 2 threads sharing one page table.
>>> // proc1 is the parent of proc2.
>>>
>>> proc1               proc2          proc1.2
>>> ...                 ...            // in gup_pte_range()
>>> ...                 ...            pte = gup_get_pte()
>>> ...                 ...            page1 = pte_page(pte)  // (1)
>>> do_wp_page(page1)   ...            ...
>>> ...                 exit_map()     ...
>>> ...                 ...            get_page(page1)        // (2)
>>> -----------------------------------------------------------------
>>>
>>> do_wp_page() and exit_map() cause page1 to be released into free list
>>> before get_page() in proc1.2 is called. The longer the delay between
>>> (1)&(2), the easier the BUG_ON shows.
>>
>> Other than responded initially, I don't this can happen outside
>> of Xen: do_wp_page() won't reach page_cache_release() when
>> gup_pte_range() is running for the same mm on another CPU,
>> since it can't get past ptep_clear_flush() (waiting for the CPU
>> in get_user_pages_fast() to re-enable interrupts).
>
> Yeah, this cannot happen on native.
>
>
>>> An experimental patch is made to prevent the PTE being modified in the
>>> middle of gup_pte_range(). The BUG_ON disappears afterward.
>>>
>>> However, from the comments embedded in gup.c, it seems deliberate to
>>> avoid the lock in the fast path. The question is: if so, how to avoid
>>> the above scenario?
>>
>> Nick, based on your doing of the initial implementation, would
>> you be able to estimate whether disabling get_user_pages_fast()
>> altogether for Xen would be performing measurably worse than
>> adding the locks (but continuing to avoid acquiring mm->mmap_sem)
>> as suggested by Xiaowei? That's of course only if the latter is correct
>> at all, of which I haven't fully convinced myself yet.
>
> You must have some way to guarantee existence of Linux page
> tables when you walk them in order to resolve a TLB refill.
>
> x86 does this with IPI and hardware fill that is atomic WRT interrupts.
> So fast gup can disable interrupts to walk page tables, I don't think it
> is fragile it is absolutely tied to the system ISA (of course that can
> change, but as Peter said, other things will have to change).
>
> Other architectures use RCU for this, so fast gup uses a lockless-
> pagecache-alike protcol for that.
>
> If Xen is not using IPIs for flush, it should use whatever locks or
> synchronization its TLB refill is using.

Thanks everyone! It's very clear now that the problem only occurs on Xen 
PV kernel which doesn't use IPI to flush TLB so lacks the implicit sync 
mechanism. For now we can disable the fast path as a temp solution until 
a better one comes up -- comparing to adding extra locks, the slow path 
may not be bad, and actually get_user_pages_fast() is not called that 
much in our environment.

However, this issue may raise another concern: could there be other 
places inside Xen PV kernel which has the same sync problem?

Thanks,
Xiaowei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
@ 2011-01-28  7:17       ` Xiaowei Yang
  0 siblings, 0 replies; 26+ messages in thread
From: Xiaowei Yang @ 2011-01-28  7:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Kaushik Barde, xen-devel, Kenneth Lee, Nick Piggin, linux-kernel,
	Jan Beulich, wangzhenguo, linqaingmin, fanhenglong, Wu Fengguang,
	Peter Zijlstra

On 2011-1-28 5:24, Nick Piggin wrote:
> On Fri, Jan 28, 2011 at 3:07 AM, Jan Beulich<JBeulich@novell.com>  wrote:
>>>>> On 27.01.11 at 14:05, Xiaowei Yang<xiaowei.yang@huawei.com>  wrote:
>>> We created a scenario to reproduce the bug:
>>> ----------------------------------------------------------------
>>> // proc1/proc1.2 are 2 threads sharing one page table.
>>> // proc1 is the parent of proc2.
>>>
>>> proc1               proc2          proc1.2
>>> ...                 ...            // in gup_pte_range()
>>> ...                 ...            pte = gup_get_pte()
>>> ...                 ...            page1 = pte_page(pte)  // (1)
>>> do_wp_page(page1)   ...            ...
>>> ...                 exit_map()     ...
>>> ...                 ...            get_page(page1)        // (2)
>>> -----------------------------------------------------------------
>>>
>>> do_wp_page() and exit_map() cause page1 to be released into free list
>>> before get_page() in proc1.2 is called. The longer the delay between
>>> (1)&(2), the easier the BUG_ON shows.
>>
>> Other than responded initially, I don't this can happen outside
>> of Xen: do_wp_page() won't reach page_cache_release() when
>> gup_pte_range() is running for the same mm on another CPU,
>> since it can't get past ptep_clear_flush() (waiting for the CPU
>> in get_user_pages_fast() to re-enable interrupts).
>
> Yeah, this cannot happen on native.
>
>
>>> An experimental patch is made to prevent the PTE being modified in the
>>> middle of gup_pte_range(). The BUG_ON disappears afterward.
>>>
>>> However, from the comments embedded in gup.c, it seems deliberate to
>>> avoid the lock in the fast path. The question is: if so, how to avoid
>>> the above scenario?
>>
>> Nick, based on your doing of the initial implementation, would
>> you be able to estimate whether disabling get_user_pages_fast()
>> altogether for Xen would be performing measurably worse than
>> adding the locks (but continuing to avoid acquiring mm->mmap_sem)
>> as suggested by Xiaowei? That's of course only if the latter is correct
>> at all, of which I haven't fully convinced myself yet.
>
> You must have some way to guarantee existence of Linux page
> tables when you walk them in order to resolve a TLB refill.
>
> x86 does this with IPI and hardware fill that is atomic WRT interrupts.
> So fast gup can disable interrupts to walk page tables, I don't think it
> is fragile it is absolutely tied to the system ISA (of course that can
> change, but as Peter said, other things will have to change).
>
> Other architectures use RCU for this, so fast gup uses a lockless-
> pagecache-alike protcol for that.
>
> If Xen is not using IPIs for flush, it should use whatever locks or
> synchronization its TLB refill is using.

Thanks everyone! It's very clear now that the problem only occurs on Xen 
PV kernel which doesn't use IPI to flush TLB so lacks the implicit sync 
mechanism. For now we can disable the fast path as a temp solution until 
a better one comes up -- comparing to adding extra locks, the slow path 
may not be bad, and actually get_user_pages_fast() is not called that 
much in our environment.

However, this issue may raise another concern: could there be other 
places inside Xen PV kernel which has the same sync problem?

Thanks,
Xiaowei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
  2011-01-27 14:30   ` Jan Beulich
@ 2011-01-28 10:51     ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2011-01-28 10:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Xiaowei Yang, fanhenglong, Kaushik Barde, Kenneth Lee,
	linqaingmin, wangzhenguo, Wu Fengguang, Nick Piggin,
	linux-kernel

On Thu, 2011-01-27 at 14:30 +0000, Jan Beulich wrote:
> >>> On 27.01.11 at 14:56, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > On Thu, 2011-01-27 at 21:05 +0800, Xiaowei Yang wrote:
> >> 
> >> However, from the comments embedded in gup.c, it seems deliberate to 
> >> avoid the lock in the fast path. The question is: if so, how to avoid 
> >> the above scenario? 
> > 
> > Something like the below comes to mind... but I must say I haven't fully
> > considered the problem yet..
> 
> That doesn't seem to account for the possible case of the page
> even managing to get allocated again to something else.
> 
> And I think you would need to drop out of gup_pte_range() in
> that case.
> 
> I would think this needs to be get_page_unless_zero()
> followed by re-checking of the page table entry (probably
> not even requiring a second gup_get_pte()); I'm not sure
> yet what the correct action would be for change in only the
> accessed/dirty bits.

Nah, if its racy against unmap its racy and the caller needs to be able
to deal with whatever page comes it way. You either get the old, the
new, or no page.

The get_page_unless_zero() + RCU-freed page-tables should make Xen work
again, look at the powerpc gup_fast implementation.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
  2011-01-27 18:27     ` Jeremy Fitzhardinge
  (?)
  (?)
@ 2011-01-30 13:01     ` Avi Kivity
  2011-01-30 22:21         ` Kaushik Barde
  -1 siblings, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2011-01-30 13:01 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Jan Beulich, Xiaowei Yang, Nick Piggin, Peter Zijlstra,
	fanhenglong, Kaushik Barde, Kenneth Lee, linqaingmin,
	wangzhenguo, Wu Fengguang, xen-devel, linux-kernel,
	Marcelo Tosatti

On 01/27/2011 08:27 PM, Jeremy Fitzhardinge wrote:
> And even just considering virtualization, having non-IPI-based tlb
> shootdown is a measurable performance win, since a hypervisor can
> optimise away a cross-VCPU shootdown if it knows no physical TLB
> contains the target VCPU's entries.  I can imagine the KVM folks could
> get some benefit from that as well.

It's nice to avoid the IPI (and waking up a cpu if it happens to be 
asleep) but I think the risk of deviating too much from the baremetal 
arch is too large, as demonstrated by this bug.

(well, async page faults is a counterexample, I wonder if/when it will 
bite us)

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: One (possible) x86 get_user_pages bug
  2011-01-30 13:01     ` Avi Kivity
@ 2011-01-30 22:21         ` Kaushik Barde
  0 siblings, 0 replies; 26+ messages in thread
From: Kaushik Barde @ 2011-01-30 22:21 UTC (permalink / raw)
  To: 'Avi Kivity', 'Jeremy Fitzhardinge'
  Cc: 'Jan Beulich', 'Xiaowei Yang',
	'Nick Piggin', 'Peter Zijlstra',
	fanhenglong, 'Kenneth Lee', 'linqaingmin',
	wangzhenguo, 'Wu Fengguang',
	xen-devel, linux-kernel, 'Marcelo Tosatti'

I agree i.e. deviation from underlying arch consideration is not a good
idea.

Also, agreed, hypervisor knows which page entries are ready for TLB flush
across vCPUs. 

But, using above knowledge, along with TLB flush based on IPI is a better
solution.  Its ability to synchronize it with pCPU based IPI and TLB flush
across vCPU. is key. 

IPIs themselves should be in few hundred uSecs in terms latency. Also, why
should pCPU be in sleep state for active vCPU scheduled page workload?

-Kaushik

-----Original Message-----
From: Avi Kivity [mailto:avi@redhat.com] 
Sent: Sunday, January 30, 2011 5:02 AM
To: Jeremy Fitzhardinge
Cc: Jan Beulich; Xiaowei Yang; Nick Piggin; Peter Zijlstra;
fanhenglong@huawei.com; Kaushik Barde; Kenneth Lee; linqaingmin;
wangzhenguo@huawei.com; Wu Fengguang; xen-devel@lists.xensource.com;
linux-kernel@vger.kernel.org; Marcelo Tosatti
Subject: Re: One (possible) x86 get_user_pages bug

On 01/27/2011 08:27 PM, Jeremy Fitzhardinge wrote:
> And even just considering virtualization, having non-IPI-based tlb
> shootdown is a measurable performance win, since a hypervisor can
> optimise away a cross-VCPU shootdown if it knows no physical TLB
> contains the target VCPU's entries.  I can imagine the KVM folks could
> get some benefit from that as well.

It's nice to avoid the IPI (and waking up a cpu if it happens to be 
asleep) but I think the risk of deviating too much from the baremetal 
arch is too large, as demonstrated by this bug.

(well, async page faults is a counterexample, I wonder if/when it will 
bite us)

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: One (possible) x86 get_user_pages bug
@ 2011-01-30 22:21         ` Kaushik Barde
  0 siblings, 0 replies; 26+ messages in thread
From: Kaushik Barde @ 2011-01-30 22:21 UTC (permalink / raw)
  To: 'Avi Kivity', 'Jeremy Fitzhardinge'
  Cc: xen-devel, 'Kenneth Lee', 'Nick Piggin',
	'Marcelo Tosatti', linux-kernel, 'Jan Beulich',
	wangzhenguo, 'Xiaowei Yang', 'linqaingmin',
	fanhenglong, 'Wu Fengguang', 'Peter Zijlstra'

I agree i.e. deviation from underlying arch consideration is not a good
idea.

Also, agreed, hypervisor knows which page entries are ready for TLB flush
across vCPUs. 

But, using above knowledge, along with TLB flush based on IPI is a better
solution.  Its ability to synchronize it with pCPU based IPI and TLB flush
across vCPU. is key. 

IPIs themselves should be in few hundred uSecs in terms latency. Also, why
should pCPU be in sleep state for active vCPU scheduled page workload?

-Kaushik

-----Original Message-----
From: Avi Kivity [mailto:avi@redhat.com] 
Sent: Sunday, January 30, 2011 5:02 AM
To: Jeremy Fitzhardinge
Cc: Jan Beulich; Xiaowei Yang; Nick Piggin; Peter Zijlstra;
fanhenglong@huawei.com; Kaushik Barde; Kenneth Lee; linqaingmin;
wangzhenguo@huawei.com; Wu Fengguang; xen-devel@lists.xensource.com;
linux-kernel@vger.kernel.org; Marcelo Tosatti
Subject: Re: One (possible) x86 get_user_pages bug

On 01/27/2011 08:27 PM, Jeremy Fitzhardinge wrote:
> And even just considering virtualization, having non-IPI-based tlb
> shootdown is a measurable performance win, since a hypervisor can
> optimise away a cross-VCPU shootdown if it knows no physical TLB
> contains the target VCPU's entries.  I can imagine the KVM folks could
> get some benefit from that as well.

It's nice to avoid the IPI (and waking up a cpu if it happens to be 
asleep) but I think the risk of deviating too much from the baremetal 
arch is too large, as demonstrated by this bug.

(well, async page faults is a counterexample, I wonder if/when it will 
bite us)

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
  2011-01-30 22:21         ` Kaushik Barde
  (?)
@ 2011-01-31 18:04         ` Jeremy Fitzhardinge
  2011-01-31 20:10             ` Kaushik Barde
  -1 siblings, 1 reply; 26+ messages in thread
From: Jeremy Fitzhardinge @ 2011-01-31 18:04 UTC (permalink / raw)
  To: Kaushik Barde
  Cc: 'Avi Kivity', 'Jan Beulich',
	'Xiaowei Yang', 'Nick Piggin',
	'Peter Zijlstra', fanhenglong, 'Kenneth Lee',
	'linqaingmin', wangzhenguo, 'Wu Fengguang',
	xen-devel, linux-kernel, 'Marcelo Tosatti'

On 01/30/2011 02:21 PM, Kaushik Barde wrote:
> I agree i.e. deviation from underlying arch consideration is not a good
> idea.
>
> Also, agreed, hypervisor knows which page entries are ready for TLB flush
> across vCPUs. 
>
> But, using above knowledge, along with TLB flush based on IPI is a better
> solution.  Its ability to synchronize it with pCPU based IPI and TLB flush
> across vCPU. is key. 

I'm not sure I follow you here.  The issue with TLB flush IPIs is that
the hypervisor doesn't know the purpose of the IPI and ends up
(potentially) waking up a sleeping VCPU just to flush its tlb - but
since it was sleeping there were no stale TLB entries to flush.

Xen's TLB flush hypercalls can optimise that case by only sending a real
IPI to PCPUs which are actually running target VCPUs.  In other cases,
where a PCPU is known to have stale entries but it isn't running a
relevant VCPU, it can just mark a deferred TLB flush which gets executed
before the VCPU runs again.

In other words, Xen can take significant advantage of getting a
higher-level call ("flush these TLBs") compared just a simple IPI.

Are you suggesting that the hypervisor should export some kind of "known
dirty TLB" table to the guest, and have the guest work out which VCPUs
need IPIs sent to them?  How would that work?

> IPIs themselves should be in few hundred uSecs in terms latency. Also, why
> should pCPU be in sleep state for active vCPU scheduled page workload?

A "few hundred uSecs" is really very slow - that's nearly a
millisecond.  It's worth spending some effort to avoid those kinds of
delays.

    J

> -Kaushik
>
> -----Original Message-----
> From: Avi Kivity [mailto:avi@redhat.com] 
> Sent: Sunday, January 30, 2011 5:02 AM
> To: Jeremy Fitzhardinge
> Cc: Jan Beulich; Xiaowei Yang; Nick Piggin; Peter Zijlstra;
> fanhenglong@huawei.com; Kaushik Barde; Kenneth Lee; linqaingmin;
> wangzhenguo@huawei.com; Wu Fengguang; xen-devel@lists.xensource.com;
> linux-kernel@vger.kernel.org; Marcelo Tosatti
> Subject: Re: One (possible) x86 get_user_pages bug
>
> On 01/27/2011 08:27 PM, Jeremy Fitzhardinge wrote:
>> And even just considering virtualization, having non-IPI-based tlb
>> shootdown is a measurable performance win, since a hypervisor can
>> optimise away a cross-VCPU shootdown if it knows no physical TLB
>> contains the target VCPU's entries.  I can imagine the KVM folks could
>> get some benefit from that as well.
> It's nice to avoid the IPI (and waking up a cpu if it happens to be 
> asleep) but I think the risk of deviating too much from the baremetal 
> arch is too large, as demonstrated by this bug.
>
> (well, async page faults is a counterexample, I wonder if/when it will 
> bite us)
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: One (possible) x86 get_user_pages bug
  2011-01-31 18:04         ` Jeremy Fitzhardinge
@ 2011-01-31 20:10             ` Kaushik Barde
  0 siblings, 0 replies; 26+ messages in thread
From: Kaushik Barde @ 2011-01-31 20:10 UTC (permalink / raw)
  To: 'Jeremy Fitzhardinge'
  Cc: 'Avi Kivity', 'Jan Beulich',
	'Xiaowei Yang', 'Nick Piggin',
	'Peter Zijlstra', fanhenglong, 'Kenneth Lee',
	'linqaingmin', wangzhenguo, 'Wu Fengguang',
	xen-devel, linux-kernel, 'Marcelo Tosatti'

<< I'm not sure I follow you here.  The issue with TLB flush IPIs is that
the hypervisor doesn't know the purpose of the IPI and ends up
(potentially) waking up a sleeping VCPU just to flush its tlb - but
since it was sleeping there were no stale TLB entries to flush.>>

That's what I was trying understand, what is "Sleep" here? Is it ACPI sleep
or some internal scheduling state? If vCPUs  are asynchronous to pCPU in
terms of ACPI sleep state, then they need to synced-up. That's where entire
ACPI modeling needs to be considered. That's where KVM may not see this
issue. Maybe I am missing something here.

<< A "few hundred uSecs" is really very slow - that's nearly a
millisecond.  It's worth spending some effort to avoid those kinds of
delays.>>

Actually, just checked IPIs are usually 1000-1500 cycles long (comparable to
VMEXIT). My point is ideal solution should be where virtual platform
behavior is closer to bare metal interrupts, memory, cpu state etc.. How to
do it ? well that's what needs to be figured out :-)

-Kaushik


-----Original Message-----
From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] 
Sent: Monday, January 31, 2011 10:05 AM
To: Kaushik Barde
Cc: 'Avi Kivity'; 'Jan Beulich'; 'Xiaowei Yang'; 'Nick Piggin'; 'Peter
Zijlstra'; fanhenglong@huawei.com; 'Kenneth Lee'; 'linqaingmin';
wangzhenguo@huawei.com; 'Wu Fengguang'; xen-devel@lists.xensource.com;
linux-kernel@vger.kernel.org; 'Marcelo Tosatti'
Subject: Re: One (possible) x86 get_user_pages bug

On 01/30/2011 02:21 PM, Kaushik Barde wrote:
> I agree i.e. deviation from underlying arch consideration is not a good
> idea.
>
> Also, agreed, hypervisor knows which page entries are ready for TLB flush
> across vCPUs. 
>
> But, using above knowledge, along with TLB flush based on IPI is a better
> solution.  Its ability to synchronize it with pCPU based IPI and TLB flush
> across vCPU. is key. 

I'm not sure I follow you here.  The issue with TLB flush IPIs is that
the hypervisor doesn't know the purpose of the IPI and ends up
(potentially) waking up a sleeping VCPU just to flush its tlb - but
since it was sleeping there were no stale TLB entries to flush.

Xen's TLB flush hypercalls can optimise that case by only sending a real
IPI to PCPUs which are actually running target VCPUs.  In other cases,
where a PCPU is known to have stale entries but it isn't running a
relevant VCPU, it can just mark a deferred TLB flush which gets executed
before the VCPU runs again.

In other words, Xen can take significant advantage of getting a
higher-level call ("flush these TLBs") compared just a simple IPI.

Are you suggesting that the hypervisor should export some kind of "known
dirty TLB" table to the guest, and have the guest work out which VCPUs
need IPIs sent to them?  How would that work?

> IPIs themselves should be in few hundred uSecs in terms latency. Also, why
> should pCPU be in sleep state for active vCPU scheduled page workload?

A "few hundred uSecs" is really very slow - that's nearly a
millisecond.  It's worth spending some effort to avoid those kinds of
delays.

    J

> -Kaushik
>
> -----Original Message-----
> From: Avi Kivity [mailto:avi@redhat.com] 
> Sent: Sunday, January 30, 2011 5:02 AM
> To: Jeremy Fitzhardinge
> Cc: Jan Beulich; Xiaowei Yang; Nick Piggin; Peter Zijlstra;
> fanhenglong@huawei.com; Kaushik Barde; Kenneth Lee; linqaingmin;
> wangzhenguo@huawei.com; Wu Fengguang; xen-devel@lists.xensource.com;
> linux-kernel@vger.kernel.org; Marcelo Tosatti
> Subject: Re: One (possible) x86 get_user_pages bug
>
> On 01/27/2011 08:27 PM, Jeremy Fitzhardinge wrote:
>> And even just considering virtualization, having non-IPI-based tlb
>> shootdown is a measurable performance win, since a hypervisor can
>> optimise away a cross-VCPU shootdown if it knows no physical TLB
>> contains the target VCPU's entries.  I can imagine the KVM folks could
>> get some benefit from that as well.
> It's nice to avoid the IPI (and waking up a cpu if it happens to be 
> asleep) but I think the risk of deviating too much from the baremetal 
> arch is too large, as demonstrated by this bug.
>
> (well, async page faults is a counterexample, I wonder if/when it will 
> bite us)
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: One (possible) x86 get_user_pages bug
@ 2011-01-31 20:10             ` Kaushik Barde
  0 siblings, 0 replies; 26+ messages in thread
From: Kaushik Barde @ 2011-01-31 20:10 UTC (permalink / raw)
  To: 'Jeremy Fitzhardinge'
  Cc: xen-devel, 'Kenneth Lee', 'Peter Zijlstra',
	'Marcelo Tosatti', linux-kernel, 'Jan Beulich',
	wangzhenguo, 'Xiaowei Yang', 'linqaingmin',
	fanhenglong, 'Avi Kivity', 'Wu Fengguang',
	'Nick Piggin'

<< I'm not sure I follow you here.  The issue with TLB flush IPIs is that
the hypervisor doesn't know the purpose of the IPI and ends up
(potentially) waking up a sleeping VCPU just to flush its tlb - but
since it was sleeping there were no stale TLB entries to flush.>>

That's what I was trying understand, what is "Sleep" here? Is it ACPI sleep
or some internal scheduling state? If vCPUs  are asynchronous to pCPU in
terms of ACPI sleep state, then they need to synced-up. That's where entire
ACPI modeling needs to be considered. That's where KVM may not see this
issue. Maybe I am missing something here.

<< A "few hundred uSecs" is really very slow - that's nearly a
millisecond.  It's worth spending some effort to avoid those kinds of
delays.>>

Actually, just checked IPIs are usually 1000-1500 cycles long (comparable to
VMEXIT). My point is ideal solution should be where virtual platform
behavior is closer to bare metal interrupts, memory, cpu state etc.. How to
do it ? well that's what needs to be figured out :-)

-Kaushik


-----Original Message-----
From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] 
Sent: Monday, January 31, 2011 10:05 AM
To: Kaushik Barde
Cc: 'Avi Kivity'; 'Jan Beulich'; 'Xiaowei Yang'; 'Nick Piggin'; 'Peter
Zijlstra'; fanhenglong@huawei.com; 'Kenneth Lee'; 'linqaingmin';
wangzhenguo@huawei.com; 'Wu Fengguang'; xen-devel@lists.xensource.com;
linux-kernel@vger.kernel.org; 'Marcelo Tosatti'
Subject: Re: One (possible) x86 get_user_pages bug

On 01/30/2011 02:21 PM, Kaushik Barde wrote:
> I agree i.e. deviation from underlying arch consideration is not a good
> idea.
>
> Also, agreed, hypervisor knows which page entries are ready for TLB flush
> across vCPUs. 
>
> But, using above knowledge, along with TLB flush based on IPI is a better
> solution.  Its ability to synchronize it with pCPU based IPI and TLB flush
> across vCPU. is key. 

I'm not sure I follow you here.  The issue with TLB flush IPIs is that
the hypervisor doesn't know the purpose of the IPI and ends up
(potentially) waking up a sleeping VCPU just to flush its tlb - but
since it was sleeping there were no stale TLB entries to flush.

Xen's TLB flush hypercalls can optimise that case by only sending a real
IPI to PCPUs which are actually running target VCPUs.  In other cases,
where a PCPU is known to have stale entries but it isn't running a
relevant VCPU, it can just mark a deferred TLB flush which gets executed
before the VCPU runs again.

In other words, Xen can take significant advantage of getting a
higher-level call ("flush these TLBs") compared just a simple IPI.

Are you suggesting that the hypervisor should export some kind of "known
dirty TLB" table to the guest, and have the guest work out which VCPUs
need IPIs sent to them?  How would that work?

> IPIs themselves should be in few hundred uSecs in terms latency. Also, why
> should pCPU be in sleep state for active vCPU scheduled page workload?

A "few hundred uSecs" is really very slow - that's nearly a
millisecond.  It's worth spending some effort to avoid those kinds of
delays.

    J

> -Kaushik
>
> -----Original Message-----
> From: Avi Kivity [mailto:avi@redhat.com] 
> Sent: Sunday, January 30, 2011 5:02 AM
> To: Jeremy Fitzhardinge
> Cc: Jan Beulich; Xiaowei Yang; Nick Piggin; Peter Zijlstra;
> fanhenglong@huawei.com; Kaushik Barde; Kenneth Lee; linqaingmin;
> wangzhenguo@huawei.com; Wu Fengguang; xen-devel@lists.xensource.com;
> linux-kernel@vger.kernel.org; Marcelo Tosatti
> Subject: Re: One (possible) x86 get_user_pages bug
>
> On 01/27/2011 08:27 PM, Jeremy Fitzhardinge wrote:
>> And even just considering virtualization, having non-IPI-based tlb
>> shootdown is a measurable performance win, since a hypervisor can
>> optimise away a cross-VCPU shootdown if it knows no physical TLB
>> contains the target VCPU's entries.  I can imagine the KVM folks could
>> get some benefit from that as well.
> It's nice to avoid the IPI (and waking up a cpu if it happens to be 
> asleep) but I think the risk of deviating too much from the baremetal 
> arch is too large, as demonstrated by this bug.
>
> (well, async page faults is a counterexample, I wonder if/when it will 
> bite us)
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: One (possible) x86 get_user_pages bug
  2011-01-31 20:10             ` Kaushik Barde
  (?)
@ 2011-01-31 22:10             ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 26+ messages in thread
From: Jeremy Fitzhardinge @ 2011-01-31 22:10 UTC (permalink / raw)
  To: Kaushik Barde
  Cc: 'Avi Kivity', 'Jan Beulich',
	'Xiaowei Yang', 'Nick Piggin',
	'Peter Zijlstra', fanhenglong, 'Kenneth Lee',
	'linqaingmin', wangzhenguo, 'Wu Fengguang',
	xen-devel, linux-kernel, 'Marcelo Tosatti'

On 01/31/2011 12:10 PM, Kaushik Barde wrote:
> << I'm not sure I follow you here.  The issue with TLB flush IPIs is that
> the hypervisor doesn't know the purpose of the IPI and ends up
> (potentially) waking up a sleeping VCPU just to flush its tlb - but
> since it was sleeping there were no stale TLB entries to flush.>>
>
> That's what I was trying understand, what is "Sleep" here? Is it ACPI sleep
> or some internal scheduling state? If vCPUs  are asynchronous to pCPU in
> terms of ACPI sleep state, then they need to synced-up. That's where entire
> ACPI modeling needs to be considered. That's where KVM may not see this
> issue. Maybe I am missing something here.

No, nothing to do with ACPI.  Multiple virtual CPUs (VCPUs) can be
multiplexed onto a single physical CPU (PCPU), in much the same way as
tasks are scheduled onto CPUs (identically, in KVM's case).  If a VCPU
is not currently running - either because it is simply descheduled, or
because it is blocked (what I slightly misleadingly called "sleeping"
above) in a hypercall, then it is not currently using any physical CPU
resources, including the TLBs.  In that case, there's no need to flush
that's VCPU's TLB entries, because there are none.

> << A "few hundred uSecs" is really very slow - that's nearly a
> millisecond.  It's worth spending some effort to avoid those kinds of
> delays.>>
>
> Actually, just checked IPIs are usually 1000-1500 cycles long (comparable to
> VMEXIT). My point is ideal solution should be where virtual platform
> behavior is closer to bare metal interrupts, memory, cpu state etc.. How to
> do it ? well that's what needs to be figured out :-)

The interesting number is not the raw cost of an IPI, but the overall
cost of the remote TLB flush.

    J

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2011-01-31 22:10 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-27 13:05 One (possible) x86 get_user_pages bug Xiaowei Yang
2011-01-27 13:56 ` Peter Zijlstra
2011-01-27 14:30   ` Jan Beulich
2011-01-28 10:51     ` Peter Zijlstra
2011-01-27 14:49 ` Jan Beulich
2011-01-27 14:49   ` Jan Beulich
2011-01-27 15:01   ` Peter Zijlstra
2011-01-27 18:27   ` Jeremy Fitzhardinge
2011-01-27 18:27     ` Jeremy Fitzhardinge
2011-01-27 19:27     ` Peter Zijlstra
2011-01-30 13:01     ` Avi Kivity
2011-01-30 22:21       ` Kaushik Barde
2011-01-30 22:21         ` Kaushik Barde
2011-01-31 18:04         ` Jeremy Fitzhardinge
2011-01-31 20:10           ` Kaushik Barde
2011-01-31 20:10             ` Kaushik Barde
2011-01-31 22:10             ` Jeremy Fitzhardinge
2011-01-27 16:07 ` Jan Beulich
2011-01-27 16:07   ` Jan Beulich
2011-01-27 16:25   ` Peter Zijlstra
2011-01-27 16:41     ` Jan Beulich
2011-01-27 16:41       ` Jan Beulich
2011-01-27 16:56   ` Peter Zijlstra
2011-01-27 21:24   ` Nick Piggin
2011-01-28  7:17     ` Xiaowei Yang
2011-01-28  7:17       ` Xiaowei Yang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.