All of lore.kernel.org
 help / color / mirror / Atom feed
* Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-18 19:17 ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-18 19:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, Linux Memory Management List,
	Xen-devel, Jan Beulich, Ingo Molnar

Hi Nick,

The comment in arch/x86/mm/gup.c:gup_get_pte() says:

	[...] What
	 * we do have is the guarantee that a pte will only either go from not
	 * present to present, or present to not present or both -- it will not
	 * switch to a completely different present page without a TLB flush in
	 * between; something that we are blocking by holding interrupts off.


Disabling the interrupt will prevent the tlb flush IPI from coming in 
and flushing this cpu's tlb, but I don't see how it will prevent some 
other cpu from actually updating the pte in the pagetable, which is what 
we're concerned about here.  Is this the only reason to disable 
interrupts?  Would we need to do it for the !PAE cases?

Also, assuming that disabling the interrupt is enough to get the 
guarantees we need here, there's a Xen problem because we don't use IPIs 
for cross-cpu tlb flushes (well, it happens within Xen).  I'll have to 
think a bit about how to deal with that, but I'm thinking that we could 
add a per-cpu "tlb flushes blocked" flag, and maintain some kind of 
per-cpu deferred tlb flush count so we can get around to doing the flush 
eventually.

But I want to make sure I understand the exact algorithm here.

(I couldn't find an instance of a pte update going from present->present 
anyway; the only caller of set_pte_present is set_pte_vaddr which only 
operates on kernel mappings, so perhaps this is moot.  Oh, look, 
native_set_pte_present thinks its only called on user mappings...  In 
fact set_pte_present seems to have completely lost its rationale for 
existing.)

Thanks,
    J

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-18 19:17 ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-18 19:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, Linux Memory Management List,
	Xen-devel, Jan Beulich, Ingo Molnar

Hi Nick,

The comment in arch/x86/mm/gup.c:gup_get_pte() says:

	[...] What
	 * we do have is the guarantee that a pte will only either go from not
	 * present to present, or present to not present or both -- it will not
	 * switch to a completely different present page without a TLB flush in
	 * between; something that we are blocking by holding interrupts off.


Disabling the interrupt will prevent the tlb flush IPI from coming in 
and flushing this cpu's tlb, but I don't see how it will prevent some 
other cpu from actually updating the pte in the pagetable, which is what 
we're concerned about here.  Is this the only reason to disable 
interrupts?  Would we need to do it for the !PAE cases?

Also, assuming that disabling the interrupt is enough to get the 
guarantees we need here, there's a Xen problem because we don't use IPIs 
for cross-cpu tlb flushes (well, it happens within Xen).  I'll have to 
think a bit about how to deal with that, but I'm thinking that we could 
add a per-cpu "tlb flushes blocked" flag, and maintain some kind of 
per-cpu deferred tlb flush count so we can get around to doing the flush 
eventually.

But I want to make sure I understand the exact algorithm here.

(I couldn't find an instance of a pte update going from present->present 
anyway; the only caller of set_pte_present is set_pte_vaddr which only 
operates on kernel mappings, so perhaps this is moot.  Oh, look, 
native_set_pte_present thinks its only called on user mappings...  In 
fact set_pte_present seems to have completely lost its rationale for 
existing.)

Thanks,
    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-18 19:17 ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-18 19:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management List, Ingo Molnar, Xen-devel,
	Linux Kernel Mailing List

Hi Nick,

The comment in arch/x86/mm/gup.c:gup_get_pte() says:

	[...] What
	 * we do have is the guarantee that a pte will only either go from not
	 * present to present, or present to not present or both -- it will not
	 * switch to a completely different present page without a TLB flush in
	 * between; something that we are blocking by holding interrupts off.


Disabling the interrupt will prevent the tlb flush IPI from coming in 
and flushing this cpu's tlb, but I don't see how it will prevent some 
other cpu from actually updating the pte in the pagetable, which is what 
we're concerned about here.  Is this the only reason to disable 
interrupts?  Would we need to do it for the !PAE cases?

Also, assuming that disabling the interrupt is enough to get the 
guarantees we need here, there's a Xen problem because we don't use IPIs 
for cross-cpu tlb flushes (well, it happens within Xen).  I'll have to 
think a bit about how to deal with that, but I'm thinking that we could 
add a per-cpu "tlb flushes blocked" flag, and maintain some kind of 
per-cpu deferred tlb flush count so we can get around to doing the flush 
eventually.

But I want to make sure I understand the exact algorithm here.

(I couldn't find an instance of a pte update going from present->present 
anyway; the only caller of set_pte_present is set_pte_vaddr which only 
operates on kernel mappings, so perhaps this is moot.  Oh, look, 
native_set_pte_present thinks its only called on user mappings...  In 
fact set_pte_present seems to have completely lost its rationale for 
existing.)

Thanks,
    J

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
  2009-03-18 19:17 ` Jeremy Fitzhardinge
@ 2009-03-18 21:13   ` Avi Kivity
  -1 siblings, 0 replies; 43+ messages in thread
From: Avi Kivity @ 2009-03-18 21:13 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Jeremy Fitzhardinge wrote:
> Hi Nick,
>
> The comment in arch/x86/mm/gup.c:gup_get_pte() says:
>
>     [...] What
>      * we do have is the guarantee that a pte will only either go from 
> not
>      * present to present, or present to not present or both -- it 
> will not
>      * switch to a completely different present page without a TLB 
> flush in
>      * between; something that we are blocking by holding interrupts off.
>
>
> Disabling the interrupt will prevent the tlb flush IPI from coming in 
> and flushing this cpu's tlb, but I don't see how it will prevent some 
> other cpu from actually updating the pte in the pagetable, which is 
> what we're concerned about here.  

The thread that cleared the pte holds the pte lock and is now waiting 
for the IPI.  The thread that wants to update the pte will wait for the 
pte lock, thus also waits on the IPI and gup_fast()'s 
local_irq_enable().  I think.

> Is this the only reason to disable interrupts?  

Another comment says it also prevents pagetable teardown.

> Also, assuming that disabling the interrupt is enough to get the 
> guarantees we need here, there's a Xen problem because we don't use 
> IPIs for cross-cpu tlb flushes (well, it happens within Xen).  I'll 
> have to think a bit about how to deal with that, but I'm thinking that 
> we could add a per-cpu "tlb flushes blocked" flag, and maintain some 
> kind of per-cpu deferred tlb flush count so we can get around to doing 
> the flush eventually.

I was thinking about adding a hypercall for cross-vcpu tlb flushes.  
Guess I'll wait for you to clear up all the issues first.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-18 21:13   ` Avi Kivity
  0 siblings, 0 replies; 43+ messages in thread
From: Avi Kivity @ 2009-03-18 21:13 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Jeremy Fitzhardinge wrote:
> Hi Nick,
>
> The comment in arch/x86/mm/gup.c:gup_get_pte() says:
>
>     [...] What
>      * we do have is the guarantee that a pte will only either go from 
> not
>      * present to present, or present to not present or both -- it 
> will not
>      * switch to a completely different present page without a TLB 
> flush in
>      * between; something that we are blocking by holding interrupts off.
>
>
> Disabling the interrupt will prevent the tlb flush IPI from coming in 
> and flushing this cpu's tlb, but I don't see how it will prevent some 
> other cpu from actually updating the pte in the pagetable, which is 
> what we're concerned about here.  

The thread that cleared the pte holds the pte lock and is now waiting 
for the IPI.  The thread that wants to update the pte will wait for the 
pte lock, thus also waits on the IPI and gup_fast()'s 
local_irq_enable().  I think.

> Is this the only reason to disable interrupts?  

Another comment says it also prevents pagetable teardown.

> Also, assuming that disabling the interrupt is enough to get the 
> guarantees we need here, there's a Xen problem because we don't use 
> IPIs for cross-cpu tlb flushes (well, it happens within Xen).  I'll 
> have to think a bit about how to deal with that, but I'm thinking that 
> we could add a per-cpu "tlb flushes blocked" flag, and maintain some 
> kind of per-cpu deferred tlb flush count so we can get around to doing 
> the flush eventually.

I was thinking about adding a hypercall for cross-vcpu tlb flushes.  
Guess I'll wait for you to clear up all the issues first.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
  2009-03-18 21:13   ` Avi Kivity
  (?)
@ 2009-03-18 21:23     ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-18 21:23 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Avi Kivity wrote:
> Jeremy Fitzhardinge wrote:
>> Disabling the interrupt will prevent the tlb flush IPI from coming in 
>> and flushing this cpu's tlb, but I don't see how it will prevent some 
>> other cpu from actually updating the pte in the pagetable, which is 
>> what we're concerned about here.  
>
> The thread that cleared the pte holds the pte lock and is now waiting 
> for the IPI.  The thread that wants to update the pte will wait for 
> the pte lock, thus also waits on the IPI and gup_fast()'s 
> local_irq_enable().  I think.

But hasn't it already done the pte update at that point?

(I think this conversation really is moot because the kernel never does 
P->P pte updates any more; its always P->N->P.)

>> Is this the only reason to disable interrupts?  
>
> Another comment says it also prevents pagetable teardown.

We could take a reference to the mm to get the same effect, no?

>> Also, assuming that disabling the interrupt is enough to get the 
>> guarantees we need here, there's a Xen problem because we don't use 
>> IPIs for cross-cpu tlb flushes (well, it happens within Xen).  I'll 
>> have to think a bit about how to deal with that, but I'm thinking 
>> that we could add a per-cpu "tlb flushes blocked" flag, and maintain 
>> some kind of per-cpu deferred tlb flush count so we can get around to 
>> doing the flush eventually.
>
> I was thinking about adding a hypercall for cross-vcpu tlb flushes.  
> Guess I'll wait for you to clear up all the issues first.

Typical...

    J


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-18 21:23     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-18 21:23 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Avi Kivity wrote:
> Jeremy Fitzhardinge wrote:
>> Disabling the interrupt will prevent the tlb flush IPI from coming in 
>> and flushing this cpu's tlb, but I don't see how it will prevent some 
>> other cpu from actually updating the pte in the pagetable, which is 
>> what we're concerned about here.  
>
> The thread that cleared the pte holds the pte lock and is now waiting 
> for the IPI.  The thread that wants to update the pte will wait for 
> the pte lock, thus also waits on the IPI and gup_fast()'s 
> local_irq_enable().  I think.

But hasn't it already done the pte update at that point?

(I think this conversation really is moot because the kernel never does 
P->P pte updates any more; its always P->N->P.)

>> Is this the only reason to disable interrupts?  
>
> Another comment says it also prevents pagetable teardown.

We could take a reference to the mm to get the same effect, no?

>> Also, assuming that disabling the interrupt is enough to get the 
>> guarantees we need here, there's a Xen problem because we don't use 
>> IPIs for cross-cpu tlb flushes (well, it happens within Xen).  I'll 
>> have to think a bit about how to deal with that, but I'm thinking 
>> that we could add a per-cpu "tlb flushes blocked" flag, and maintain 
>> some kind of per-cpu deferred tlb flush count so we can get around to 
>> doing the flush eventually.
>
> I was thinking about adding a hypercall for cross-vcpu tlb flushes.  
> Guess I'll wait for you to clear up all the issues first.

Typical...

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-18 21:23     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-18 21:23 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Xen-devel, Linux Kernel Mailing List,
	Linux Memory Management List, Ingo Molnar

Avi Kivity wrote:
> Jeremy Fitzhardinge wrote:
>> Disabling the interrupt will prevent the tlb flush IPI from coming in 
>> and flushing this cpu's tlb, but I don't see how it will prevent some 
>> other cpu from actually updating the pte in the pagetable, which is 
>> what we're concerned about here.  
>
> The thread that cleared the pte holds the pte lock and is now waiting 
> for the IPI.  The thread that wants to update the pte will wait for 
> the pte lock, thus also waits on the IPI and gup_fast()'s 
> local_irq_enable().  I think.

But hasn't it already done the pte update at that point?

(I think this conversation really is moot because the kernel never does 
P->P pte updates any more; its always P->N->P.)

>> Is this the only reason to disable interrupts?  
>
> Another comment says it also prevents pagetable teardown.

We could take a reference to the mm to get the same effect, no?

>> Also, assuming that disabling the interrupt is enough to get the 
>> guarantees we need here, there's a Xen problem because we don't use 
>> IPIs for cross-cpu tlb flushes (well, it happens within Xen).  I'll 
>> have to think a bit about how to deal with that, but I'm thinking 
>> that we could add a per-cpu "tlb flushes blocked" flag, and maintain 
>> some kind of per-cpu deferred tlb flush count so we can get around to 
>> doing the flush eventually.
>
> I was thinking about adding a hypercall for cross-vcpu tlb flushes.  
> Guess I'll wait for you to clear up all the issues first.

Typical...

    J

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
  2009-03-18 21:23     ` Jeremy Fitzhardinge
@ 2009-03-18 21:40       ` Avi Kivity
  -1 siblings, 0 replies; 43+ messages in thread
From: Avi Kivity @ 2009-03-18 21:40 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Jeremy Fitzhardinge wrote:
>>> Disabling the interrupt will prevent the tlb flush IPI from coming 
>>> in and flushing this cpu's tlb, but I don't see how it will prevent 
>>> some other cpu from actually updating the pte in the pagetable, 
>>> which is what we're concerned about here.  
>>
>> The thread that cleared the pte holds the pte lock and is now waiting 
>> for the IPI.  The thread that wants to update the pte will wait for 
>> the pte lock, thus also waits on the IPI and gup_fast()'s 
>> local_irq_enable().  I think.
>
> But hasn't it already done the pte update at that point?
>
> (I think this conversation really is moot because the kernel never 
> does P->P pte updates any more; its always P->N->P.)

I thought you were concerned about cpu 0 doing a gup_fast(), cpu 1 doing 
P->N, and cpu 2 doing N->P.  In this case cpu 2 is waiting on the pte lock.

>>> Is this the only reason to disable interrupts?  
>>
>> Another comment says it also prevents pagetable teardown.
>
> We could take a reference to the mm to get the same effect, no?
>

Won't stop munmap().

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-18 21:40       ` Avi Kivity
  0 siblings, 0 replies; 43+ messages in thread
From: Avi Kivity @ 2009-03-18 21:40 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Jeremy Fitzhardinge wrote:
>>> Disabling the interrupt will prevent the tlb flush IPI from coming 
>>> in and flushing this cpu's tlb, but I don't see how it will prevent 
>>> some other cpu from actually updating the pte in the pagetable, 
>>> which is what we're concerned about here.  
>>
>> The thread that cleared the pte holds the pte lock and is now waiting 
>> for the IPI.  The thread that wants to update the pte will wait for 
>> the pte lock, thus also waits on the IPI and gup_fast()'s 
>> local_irq_enable().  I think.
>
> But hasn't it already done the pte update at that point?
>
> (I think this conversation really is moot because the kernel never 
> does P->P pte updates any more; its always P->N->P.)

I thought you were concerned about cpu 0 doing a gup_fast(), cpu 1 doing 
P->N, and cpu 2 doing N->P.  In this case cpu 2 is waiting on the pte lock.

>>> Is this the only reason to disable interrupts?  
>>
>> Another comment says it also prevents pagetable teardown.
>
> We could take a reference to the mm to get the same effect, no?
>

Won't stop munmap().

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
  2009-03-18 21:40       ` Avi Kivity
  (?)
@ 2009-03-18 22:14         ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-18 22:14 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Avi Kivity wrote:
> Jeremy Fitzhardinge wrote:
>>>> Disabling the interrupt will prevent the tlb flush IPI from coming 
>>>> in and flushing this cpu's tlb, but I don't see how it will prevent 
>>>> some other cpu from actually updating the pte in the pagetable, 
>>>> which is what we're concerned about here.  
>>>
>>> The thread that cleared the pte holds the pte lock and is now 
>>> waiting for the IPI.  The thread that wants to update the pte will 
>>> wait for the pte lock, thus also waits on the IPI and gup_fast()'s 
>>> local_irq_enable().  I think.
>>
>> But hasn't it already done the pte update at that point?
>>
>> (I think this conversation really is moot because the kernel never 
>> does P->P pte updates any more; its always P->N->P.)
>
> I thought you were concerned about cpu 0 doing a gup_fast(), cpu 1 
> doing P->N, and cpu 2 doing N->P.  In this case cpu 2 is waiting on 
> the pte lock.

The issue is that if cpu 0 is doing a gup_fast() and other cpus are 
doing P->P updates, then gup_fast() can potentially get a mix of old and 
new pte values - where P->P is any aggregate set of unsynchronized P->N 
and N->P operations on any number of other cpus.  Ah, but if every P->N 
is followed by a tlb flush, then disabling interrupts will hold off any 
following N->P, allowing gup_fast to get a consistent pte snapshot.

Hm, awkward if flush_tlb_others doesn't IPI...

> Won't stop munmap().

And I guess it does the tlb flush before freeing the pages, so disabling 
the interrupt helps here too.

Simplest fix is to make gup_get_pte() a pvop, but that does seem like 
putting a red flag in front of an inner-loop hotspot, or something...

The per-cpu tlb-flush exclusion flag might really be the way to go.

    J

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-18 22:14         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-18 22:14 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Avi Kivity wrote:
> Jeremy Fitzhardinge wrote:
>>>> Disabling the interrupt will prevent the tlb flush IPI from coming 
>>>> in and flushing this cpu's tlb, but I don't see how it will prevent 
>>>> some other cpu from actually updating the pte in the pagetable, 
>>>> which is what we're concerned about here.  
>>>
>>> The thread that cleared the pte holds the pte lock and is now 
>>> waiting for the IPI.  The thread that wants to update the pte will 
>>> wait for the pte lock, thus also waits on the IPI and gup_fast()'s 
>>> local_irq_enable().  I think.
>>
>> But hasn't it already done the pte update at that point?
>>
>> (I think this conversation really is moot because the kernel never 
>> does P->P pte updates any more; its always P->N->P.)
>
> I thought you were concerned about cpu 0 doing a gup_fast(), cpu 1 
> doing P->N, and cpu 2 doing N->P.  In this case cpu 2 is waiting on 
> the pte lock.

The issue is that if cpu 0 is doing a gup_fast() and other cpus are 
doing P->P updates, then gup_fast() can potentially get a mix of old and 
new pte values - where P->P is any aggregate set of unsynchronized P->N 
and N->P operations on any number of other cpus.  Ah, but if every P->N 
is followed by a tlb flush, then disabling interrupts will hold off any 
following N->P, allowing gup_fast to get a consistent pte snapshot.

Hm, awkward if flush_tlb_others doesn't IPI...

> Won't stop munmap().

And I guess it does the tlb flush before freeing the pages, so disabling 
the interrupt helps here too.

Simplest fix is to make gup_get_pte() a pvop, but that does seem like 
putting a red flag in front of an inner-loop hotspot, or something...

The per-cpu tlb-flush exclusion flag might really be the way to go.

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-18 22:14         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-18 22:14 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Xen-devel, Linux Kernel Mailing List,
	Linux Memory Management List, Ingo Molnar

Avi Kivity wrote:
> Jeremy Fitzhardinge wrote:
>>>> Disabling the interrupt will prevent the tlb flush IPI from coming 
>>>> in and flushing this cpu's tlb, but I don't see how it will prevent 
>>>> some other cpu from actually updating the pte in the pagetable, 
>>>> which is what we're concerned about here.  
>>>
>>> The thread that cleared the pte holds the pte lock and is now 
>>> waiting for the IPI.  The thread that wants to update the pte will 
>>> wait for the pte lock, thus also waits on the IPI and gup_fast()'s 
>>> local_irq_enable().  I think.
>>
>> But hasn't it already done the pte update at that point?
>>
>> (I think this conversation really is moot because the kernel never 
>> does P->P pte updates any more; its always P->N->P.)
>
> I thought you were concerned about cpu 0 doing a gup_fast(), cpu 1 
> doing P->N, and cpu 2 doing N->P.  In this case cpu 2 is waiting on 
> the pte lock.

The issue is that if cpu 0 is doing a gup_fast() and other cpus are 
doing P->P updates, then gup_fast() can potentially get a mix of old and 
new pte values - where P->P is any aggregate set of unsynchronized P->N 
and N->P operations on any number of other cpus.  Ah, but if every P->N 
is followed by a tlb flush, then disabling interrupts will hold off any 
following N->P, allowing gup_fast to get a consistent pte snapshot.

Hm, awkward if flush_tlb_others doesn't IPI...

> Won't stop munmap().

And I guess it does the tlb flush before freeing the pages, so disabling 
the interrupt helps here too.

Simplest fix is to make gup_get_pte() a pvop, but that does seem like 
putting a red flag in front of an inner-loop hotspot, or something...

The per-cpu tlb-flush exclusion flag might really be the way to go.

    J

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
  2009-03-18 22:14         ` Jeremy Fitzhardinge
@ 2009-03-18 22:41           ` Avi Kivity
  -1 siblings, 0 replies; 43+ messages in thread
From: Avi Kivity @ 2009-03-18 22:41 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Jeremy Fitzhardinge wrote:
>> I thought you were concerned about cpu 0 doing a gup_fast(), cpu 1 
>> doing P->N, and cpu 2 doing N->P.  In this case cpu 2 is waiting on 
>> the pte lock.
>
> The issue is that if cpu 0 is doing a gup_fast() and other cpus are 
> doing P->P updates, then gup_fast() can potentially get a mix of old 
> and new pte values - where P->P is any aggregate set of unsynchronized 
> P->N and N->P operations on any number of other cpus.  Ah, but if 
> every P->N is followed by a tlb flush, then disabling interrupts will 
> hold off any following N->P, allowing gup_fast to get a consistent pte 
> snapshot.
>

Right.

> Hm, awkward if flush_tlb_others doesn't IPI...
>

How can it avoid flushing the tlb on cpu [01]?  It's it's gup_fast()ing 
a pte, it may as well load it into the tlb.

>
> Simplest fix is to make gup_get_pte() a pvop, but that does seem like 
> putting a red flag in front of an inner-loop hotspot, or something...
>
> The per-cpu tlb-flush exclusion flag might really be the way to go.

I don't see how it will work, without changing Xen to look at the flag?

local_irq_disable() is used here to lock out a remote cpu, I don't see 
why deferring the flush helps.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-18 22:41           ` Avi Kivity
  0 siblings, 0 replies; 43+ messages in thread
From: Avi Kivity @ 2009-03-18 22:41 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Jeremy Fitzhardinge wrote:
>> I thought you were concerned about cpu 0 doing a gup_fast(), cpu 1 
>> doing P->N, and cpu 2 doing N->P.  In this case cpu 2 is waiting on 
>> the pte lock.
>
> The issue is that if cpu 0 is doing a gup_fast() and other cpus are 
> doing P->P updates, then gup_fast() can potentially get a mix of old 
> and new pte values - where P->P is any aggregate set of unsynchronized 
> P->N and N->P operations on any number of other cpus.  Ah, but if 
> every P->N is followed by a tlb flush, then disabling interrupts will 
> hold off any following N->P, allowing gup_fast to get a consistent pte 
> snapshot.
>

Right.

> Hm, awkward if flush_tlb_others doesn't IPI...
>

How can it avoid flushing the tlb on cpu [01]?  It's it's gup_fast()ing 
a pte, it may as well load it into the tlb.

>
> Simplest fix is to make gup_get_pte() a pvop, but that does seem like 
> putting a red flag in front of an inner-loop hotspot, or something...
>
> The per-cpu tlb-flush exclusion flag might really be the way to go.

I don't see how it will work, without changing Xen to look at the flag?

local_irq_disable() is used here to lock out a remote cpu, I don't see 
why deferring the flush helps.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
  2009-03-18 22:41           ` Avi Kivity
@ 2009-03-18 22:55             ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-18 22:55 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Avi Kivity wrote:
>> Hm, awkward if flush_tlb_others doesn't IPI...
>>
>
> How can it avoid flushing the tlb on cpu [01]?  It's it's 
> gup_fast()ing a pte, it may as well load it into the tlb.

xen_flush_tlb_others uses a hypercall rather than an IPI, so none of the 
logic which depends on there being an IPI will work.

>> Simplest fix is to make gup_get_pte() a pvop, but that does seem like 
>> putting a red flag in front of an inner-loop hotspot, or something...
>>
>> The per-cpu tlb-flush exclusion flag might really be the way to go.
>
> I don't see how it will work, without changing Xen to look at the flag?
>
> local_irq_disable() is used here to lock out a remote cpu, I don't see 
> why deferring the flush helps.

Well, no, not deferring.  Making xen_flush_tlb_others() spin waiting for 
"doing_gup" to clear on the target cpu.  Or add an explicit notion of a 
"pte update barrier" rather than implicitly relying on the tlb IPI 
(which is extremely convenient when available...).

    J

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-18 22:55             ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-18 22:55 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Avi Kivity wrote:
>> Hm, awkward if flush_tlb_others doesn't IPI...
>>
>
> How can it avoid flushing the tlb on cpu [01]?  It's it's 
> gup_fast()ing a pte, it may as well load it into the tlb.

xen_flush_tlb_others uses a hypercall rather than an IPI, so none of the 
logic which depends on there being an IPI will work.

>> Simplest fix is to make gup_get_pte() a pvop, but that does seem like 
>> putting a red flag in front of an inner-loop hotspot, or something...
>>
>> The per-cpu tlb-flush exclusion flag might really be the way to go.
>
> I don't see how it will work, without changing Xen to look at the flag?
>
> local_irq_disable() is used here to lock out a remote cpu, I don't see 
> why deferring the flush helps.

Well, no, not deferring.  Making xen_flush_tlb_others() spin waiting for 
"doing_gup" to clear on the target cpu.  Or add an explicit notion of a 
"pte update barrier" rather than implicitly relying on the tlb IPI 
(which is extremely convenient when available...).

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
  2009-03-18 22:55             ` Jeremy Fitzhardinge
  (?)
@ 2009-03-18 23:05               ` Avi Kivity
  -1 siblings, 0 replies; 43+ messages in thread
From: Avi Kivity @ 2009-03-18 23:05 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Jeremy Fitzhardinge wrote:
> Avi Kivity wrote:
>>> Hm, awkward if flush_tlb_others doesn't IPI...
>>>
>>
>> How can it avoid flushing the tlb on cpu [01]?  It's it's 
>> gup_fast()ing a pte, it may as well load it into the tlb.
>
> xen_flush_tlb_others uses a hypercall rather than an IPI, so none of 
> the logic which depends on there being an IPI will work.

Right, of course, that's what we were talking about.  I thought 
optimizations to avoid IPIs if an mm never visited a cpu.

>
>>> Simplest fix is to make gup_get_pte() a pvop, but that does seem 
>>> like putting a red flag in front of an inner-loop hotspot, or 
>>> something...
>>>
>>> The per-cpu tlb-flush exclusion flag might really be the way to go.
>>
>> I don't see how it will work, without changing Xen to look at the flag?
>>
>> local_irq_disable() is used here to lock out a remote cpu, I don't 
>> see why deferring the flush helps.
>
> Well, no, not deferring.  Making xen_flush_tlb_others() spin waiting 
> for "doing_gup" to clear on the target cpu.  Or add an explicit notion 
> of a "pte update barrier" rather than implicitly relying on the tlb 
> IPI (which is extremely convenient when available...).

Pick up a percpu flag from all cpus and spin on each?  Nasty.

You could use the irq enabled flag; it's available and what native spins 
on (but also means I'll need to add one if I implement this).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-18 23:05               ` Avi Kivity
  0 siblings, 0 replies; 43+ messages in thread
From: Avi Kivity @ 2009-03-18 23:05 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Jeremy Fitzhardinge wrote:
> Avi Kivity wrote:
>>> Hm, awkward if flush_tlb_others doesn't IPI...
>>>
>>
>> How can it avoid flushing the tlb on cpu [01]?  It's it's 
>> gup_fast()ing a pte, it may as well load it into the tlb.
>
> xen_flush_tlb_others uses a hypercall rather than an IPI, so none of 
> the logic which depends on there being an IPI will work.

Right, of course, that's what we were talking about.  I thought 
optimizations to avoid IPIs if an mm never visited a cpu.

>
>>> Simplest fix is to make gup_get_pte() a pvop, but that does seem 
>>> like putting a red flag in front of an inner-loop hotspot, or 
>>> something...
>>>
>>> The per-cpu tlb-flush exclusion flag might really be the way to go.
>>
>> I don't see how it will work, without changing Xen to look at the flag?
>>
>> local_irq_disable() is used here to lock out a remote cpu, I don't 
>> see why deferring the flush helps.
>
> Well, no, not deferring.  Making xen_flush_tlb_others() spin waiting 
> for "doing_gup" to clear on the target cpu.  Or add an explicit notion 
> of a "pte update barrier" rather than implicitly relying on the tlb 
> IPI (which is extremely convenient when available...).

Pick up a percpu flag from all cpus and spin on each?  Nasty.

You could use the irq enabled flag; it's available and what native spins 
on (but also means I'll need to add one if I implement this).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-18 23:05               ` Avi Kivity
  0 siblings, 0 replies; 43+ messages in thread
From: Avi Kivity @ 2009-03-18 23:05 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Xen-devel, Linux Kernel Mailing List,
	Linux Memory Management List, Ingo Molnar

Jeremy Fitzhardinge wrote:
> Avi Kivity wrote:
>>> Hm, awkward if flush_tlb_others doesn't IPI...
>>>
>>
>> How can it avoid flushing the tlb on cpu [01]?  It's it's 
>> gup_fast()ing a pte, it may as well load it into the tlb.
>
> xen_flush_tlb_others uses a hypercall rather than an IPI, so none of 
> the logic which depends on there being an IPI will work.

Right, of course, that's what we were talking about.  I thought 
optimizations to avoid IPIs if an mm never visited a cpu.

>
>>> Simplest fix is to make gup_get_pte() a pvop, but that does seem 
>>> like putting a red flag in front of an inner-loop hotspot, or 
>>> something...
>>>
>>> The per-cpu tlb-flush exclusion flag might really be the way to go.
>>
>> I don't see how it will work, without changing Xen to look at the flag?
>>
>> local_irq_disable() is used here to lock out a remote cpu, I don't 
>> see why deferring the flush helps.
>
> Well, no, not deferring.  Making xen_flush_tlb_others() spin waiting 
> for "doing_gup" to clear on the target cpu.  Or add an explicit notion 
> of a "pte update barrier" rather than implicitly relying on the tlb 
> IPI (which is extremely convenient when available...).

Pick up a percpu flag from all cpus and spin on each?  Nasty.

You could use the irq enabled flag; it's available and what native spins 
on (but also means I'll need to add one if I implement this).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
  2009-03-18 23:05               ` Avi Kivity
@ 2009-03-18 23:32                 ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-18 23:32 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Avi Kivity wrote:
> Jeremy Fitzhardinge wrote:
>> Avi Kivity wrote:
>>>> Hm, awkward if flush_tlb_others doesn't IPI...
>>>>
>>>
>>> How can it avoid flushing the tlb on cpu [01]?  It's it's 
>>> gup_fast()ing a pte, it may as well load it into the tlb.
>>
>> xen_flush_tlb_others uses a hypercall rather than an IPI, so none of 
>> the logic which depends on there being an IPI will work.
>
> Right, of course, that's what we were talking about.  I thought 
> optimizations to avoid IPIs if an mm never visited a cpu.
>
>>
>>>> Simplest fix is to make gup_get_pte() a pvop, but that does seem 
>>>> like putting a red flag in front of an inner-loop hotspot, or 
>>>> something...
>>>>
>>>> The per-cpu tlb-flush exclusion flag might really be the way to go.
>>>
>>> I don't see how it will work, without changing Xen to look at the flag?
>>>
>>> local_irq_disable() is used here to lock out a remote cpu, I don't 
>>> see why deferring the flush helps.
>>
>> Well, no, not deferring.  Making xen_flush_tlb_others() spin waiting 
>> for "doing_gup" to clear on the target cpu.  Or add an explicit 
>> notion of a "pte update barrier" rather than implicitly relying on 
>> the tlb IPI (which is extremely convenient when available...).
>
> Pick up a percpu flag from all cpus and spin on each?  Nasty.

Yeah, not great.  Each of those flag fetches is likely to be cold, so a 
bunch of cache misses.  The only mitigating factor is that cross-cpu tlb 
flushes are expected to be expensive, but some workloads are apparently 
very sensitive to extra latency in that path.  And the hypercall could 
result in no Xen-level IPIs at all, so it could be very quick by 
comparison to an IPI-based Linux implementation, in which case the flag 
polling would be particularly harsh.

Also, the straightforward implementation of "poll until all target cpu's 
flags are clear" may never make progress, so you'd have to "scan flags, 
remove busy cpus from set, repeat until all cpus done".

All annoying because this race is pretty unlikely, and it seems a shame 
to slow down all tlb flushes to deal with it.  Some kind of global 
"doing gup_fast" counter would get flush_tlb_others bypass the check, at 
the cost of putting a couple of atomic ops around the outside of gup_fast.

> You could use the irq enabled flag; it's available and what native 
> spins on (but also means I'll need to add one if I implement this).

Yes, but then we'd end up spuriously polling on cpus which happened to 
disable interrupts for any reason.  And if the vcpu is not running then 
we could end up polling for a long time.  (Same applies for things in 
gup_fast, but I'm assuming that's a lot less common than disabling 
interrupts in general).

    J


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-18 23:32                 ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-18 23:32 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Avi Kivity wrote:
> Jeremy Fitzhardinge wrote:
>> Avi Kivity wrote:
>>>> Hm, awkward if flush_tlb_others doesn't IPI...
>>>>
>>>
>>> How can it avoid flushing the tlb on cpu [01]?  It's it's 
>>> gup_fast()ing a pte, it may as well load it into the tlb.
>>
>> xen_flush_tlb_others uses a hypercall rather than an IPI, so none of 
>> the logic which depends on there being an IPI will work.
>
> Right, of course, that's what we were talking about.  I thought 
> optimizations to avoid IPIs if an mm never visited a cpu.
>
>>
>>>> Simplest fix is to make gup_get_pte() a pvop, but that does seem 
>>>> like putting a red flag in front of an inner-loop hotspot, or 
>>>> something...
>>>>
>>>> The per-cpu tlb-flush exclusion flag might really be the way to go.
>>>
>>> I don't see how it will work, without changing Xen to look at the flag?
>>>
>>> local_irq_disable() is used here to lock out a remote cpu, I don't 
>>> see why deferring the flush helps.
>>
>> Well, no, not deferring.  Making xen_flush_tlb_others() spin waiting 
>> for "doing_gup" to clear on the target cpu.  Or add an explicit 
>> notion of a "pte update barrier" rather than implicitly relying on 
>> the tlb IPI (which is extremely convenient when available...).
>
> Pick up a percpu flag from all cpus and spin on each?  Nasty.

Yeah, not great.  Each of those flag fetches is likely to be cold, so a 
bunch of cache misses.  The only mitigating factor is that cross-cpu tlb 
flushes are expected to be expensive, but some workloads are apparently 
very sensitive to extra latency in that path.  And the hypercall could 
result in no Xen-level IPIs at all, so it could be very quick by 
comparison to an IPI-based Linux implementation, in which case the flag 
polling would be particularly harsh.

Also, the straightforward implementation of "poll until all target cpu's 
flags are clear" may never make progress, so you'd have to "scan flags, 
remove busy cpus from set, repeat until all cpus done".

All annoying because this race is pretty unlikely, and it seems a shame 
to slow down all tlb flushes to deal with it.  Some kind of global 
"doing gup_fast" counter would get flush_tlb_others bypass the check, at 
the cost of putting a couple of atomic ops around the outside of gup_fast.

> You could use the irq enabled flag; it's available and what native 
> spins on (but also means I'll need to add one if I implement this).

Yes, but then we'd end up spuriously polling on cpus which happened to 
disable interrupts for any reason.  And if the vcpu is not running then 
we could end up polling for a long time.  (Same applies for things in 
gup_fast, but I'm assuming that's a lot less common than disabling 
interrupts in general).

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
       [not found]               ` <70513aa50903181617r418ec23s744544dccfd812e8@mail.gmail.com>
@ 2009-03-18 23:37                   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-18 23:37 UTC (permalink / raw)
  To: Shentino
  Cc: Avi Kivity, Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Shentino wrote:
> But, does a CPU running a task in userspace effectively have a read 
> lock on the page tables?

No.  A process has its own user pagetable which the kernel maintains on 
its behalf.  The kernel will briefly take locks on it while doing 
modifications, mostly to deal with multithreaded usermode code running 
on multiple cpus.

    J

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-18 23:37                   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-18 23:37 UTC (permalink / raw)
  To: Shentino
  Cc: Avi Kivity, Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Shentino wrote:
> But, does a CPU running a task in userspace effectively have a read 
> lock on the page tables?

No.  A process has its own user pagetable which the kernel maintains on 
its behalf.  The kernel will briefly take locks on it while doing 
modifications, mostly to deal with multithreaded usermode code running 
on multiple cpus.

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
  2009-03-18 19:17 ` Jeremy Fitzhardinge
@ 2009-03-19  1:32   ` Nick Piggin
  -1 siblings, 0 replies; 43+ messages in thread
From: Nick Piggin @ 2009-03-19  1:32 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Avi Kivity
  Cc: Linux Kernel Mailing List, Linux Memory Management List,
	Xen-devel, Jan Beulich, Ingo Molnar

Hi Jeremy,

I think you got most of your questions already hashed out, but
I could make a suggestion...

On Thursday 19 March 2009 06:17:03 Jeremy Fitzhardinge wrote:
> Hi Nick,
>
> The comment in arch/x86/mm/gup.c:gup_get_pte() says:
>
> 	[...] What
> 	 * we do have is the guarantee that a pte will only either go from not
> 	 * present to present, or present to not present or both -- it will not
> 	 * switch to a completely different present page without a TLB flush in
> 	 * between; something that we are blocking by holding interrupts off.
>
>
> Disabling the interrupt will prevent the tlb flush IPI from coming in
> and flushing this cpu's tlb, but I don't see how it will prevent some
> other cpu from actually updating the pte in the pagetable, which is what
> we're concerned about here.

Yes, I don't believe it is possible to have a *new* pte installed until
the flush is done.


> Is this the only reason to disable
> interrupts?  Would we need to do it for the !PAE cases?

It has to pin page tables, and pin pages as well.


> Also, assuming that disabling the interrupt is enough to get the
> guarantees we need here, there's a Xen problem because we don't use IPIs
> for cross-cpu tlb flushes (well, it happens within Xen).  I'll have to
> think a bit about how to deal with that, but I'm thinking that we could
> add a per-cpu "tlb flushes blocked" flag, and maintain some kind of
> per-cpu deferred tlb flush count so we can get around to doing the flush
> eventually.
>
> But I want to make sure I understand the exact algorithm here.

FWIW, powerpc actually can flush tlbs without IPIs, and it also has
a gup_fast. powerpc RCU frees its page _tables_ so we can walk them,
and then I use speculative page references in order to be able to
take a reference on the page without having it pinned.

Turning gup_get_pte into a pvop would be a bit nasty because on !PAE
it is just a single load, and even on PAE it is pretty cheap.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-19  1:32   ` Nick Piggin
  0 siblings, 0 replies; 43+ messages in thread
From: Nick Piggin @ 2009-03-19  1:32 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Avi Kivity
  Cc: Linux Kernel Mailing List, Linux Memory Management List,
	Xen-devel, Jan Beulich, Ingo Molnar

Hi Jeremy,

I think you got most of your questions already hashed out, but
I could make a suggestion...

On Thursday 19 March 2009 06:17:03 Jeremy Fitzhardinge wrote:
> Hi Nick,
>
> The comment in arch/x86/mm/gup.c:gup_get_pte() says:
>
> 	[...] What
> 	 * we do have is the guarantee that a pte will only either go from not
> 	 * present to present, or present to not present or both -- it will not
> 	 * switch to a completely different present page without a TLB flush in
> 	 * between; something that we are blocking by holding interrupts off.
>
>
> Disabling the interrupt will prevent the tlb flush IPI from coming in
> and flushing this cpu's tlb, but I don't see how it will prevent some
> other cpu from actually updating the pte in the pagetable, which is what
> we're concerned about here.

Yes, I don't believe it is possible to have a *new* pte installed until
the flush is done.


> Is this the only reason to disable
> interrupts?  Would we need to do it for the !PAE cases?

It has to pin page tables, and pin pages as well.


> Also, assuming that disabling the interrupt is enough to get the
> guarantees we need here, there's a Xen problem because we don't use IPIs
> for cross-cpu tlb flushes (well, it happens within Xen).  I'll have to
> think a bit about how to deal with that, but I'm thinking that we could
> add a per-cpu "tlb flushes blocked" flag, and maintain some kind of
> per-cpu deferred tlb flush count so we can get around to doing the flush
> eventually.
>
> But I want to make sure I understand the exact algorithm here.

FWIW, powerpc actually can flush tlbs without IPIs, and it also has
a gup_fast. powerpc RCU frees its page _tables_ so we can walk them,
and then I use speculative page references in order to be able to
take a reference on the page without having it pinned.

Turning gup_get_pte into a pvop would be a bit nasty because on !PAE
it is just a single load, and even on PAE it is pretty cheap.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
  2009-03-18 23:32                 ` Jeremy Fitzhardinge
@ 2009-03-19  9:46                   ` Avi Kivity
  -1 siblings, 0 replies; 43+ messages in thread
From: Avi Kivity @ 2009-03-19  9:46 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Jeremy Fitzhardinge wrote:
>>>
>>> Well, no, not deferring.  Making xen_flush_tlb_others() spin waiting 
>>> for "doing_gup" to clear on the target cpu.  Or add an explicit 
>>> notion of a "pte update barrier" rather than implicitly relying on 
>>> the tlb IPI (which is extremely convenient when available...).
>>
>> Pick up a percpu flag from all cpus and spin on each?  Nasty.
>
> Yeah, not great.  Each of those flag fetches is likely to be cold, so 
> a bunch of cache misses.  The only mitigating factor is that cross-cpu 
> tlb flushes are expected to be expensive, but some workloads are 
> apparently very sensitive to extra latency in that path.  

Right, and they'll do a bunch more cache misses, so in comparison it 
isn't too bad.

> And the hypercall could result in no Xen-level IPIs at all, so it 
> could be very quick by comparison to an IPI-based Linux 
> implementation, in which case the flag polling would be particularly 
> harsh.

Maybe we could bring these optimizations into Linux as well.  The only 
thing Xen knows that Linux doesn't is if a vcpu is not scheduled; all 
other information is shared.

>
> Also, the straightforward implementation of "poll until all target 
> cpu's flags are clear" may never make progress, so you'd have to "scan 
> flags, remove busy cpus from set, repeat until all cpus done".
>
> All annoying because this race is pretty unlikely, and it seems a 
> shame to slow down all tlb flushes to deal with it.  Some kind of 
> global "doing gup_fast" counter would get flush_tlb_others bypass the 
> check, at the cost of putting a couple of atomic ops around the 
> outside of gup_fast.

The nice thing about local_irq_disable() is that it scales so well.

>
>> You could use the irq enabled flag; it's available and what native 
>> spins on (but also means I'll need to add one if I implement this).
>
> Yes, but then we'd end up spuriously polling on cpus which happened to 
> disable interrupts for any reason.  And if the vcpu is not running 
> then we could end up polling for a long time.  (Same applies for 
> things in gup_fast, but I'm assuming that's a lot less common than 
> disabling interrupts in general).

Right.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-19  9:46                   ` Avi Kivity
  0 siblings, 0 replies; 43+ messages in thread
From: Avi Kivity @ 2009-03-19  9:46 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Jeremy Fitzhardinge wrote:
>>>
>>> Well, no, not deferring.  Making xen_flush_tlb_others() spin waiting 
>>> for "doing_gup" to clear on the target cpu.  Or add an explicit 
>>> notion of a "pte update barrier" rather than implicitly relying on 
>>> the tlb IPI (which is extremely convenient when available...).
>>
>> Pick up a percpu flag from all cpus and spin on each?  Nasty.
>
> Yeah, not great.  Each of those flag fetches is likely to be cold, so 
> a bunch of cache misses.  The only mitigating factor is that cross-cpu 
> tlb flushes are expected to be expensive, but some workloads are 
> apparently very sensitive to extra latency in that path.  

Right, and they'll do a bunch more cache misses, so in comparison it 
isn't too bad.

> And the hypercall could result in no Xen-level IPIs at all, so it 
> could be very quick by comparison to an IPI-based Linux 
> implementation, in which case the flag polling would be particularly 
> harsh.

Maybe we could bring these optimizations into Linux as well.  The only 
thing Xen knows that Linux doesn't is if a vcpu is not scheduled; all 
other information is shared.

>
> Also, the straightforward implementation of "poll until all target 
> cpu's flags are clear" may never make progress, so you'd have to "scan 
> flags, remove busy cpus from set, repeat until all cpus done".
>
> All annoying because this race is pretty unlikely, and it seems a 
> shame to slow down all tlb flushes to deal with it.  Some kind of 
> global "doing gup_fast" counter would get flush_tlb_others bypass the 
> check, at the cost of putting a couple of atomic ops around the 
> outside of gup_fast.

The nice thing about local_irq_disable() is that it scales so well.

>
>> You could use the irq enabled flag; it's available and what native 
>> spins on (but also means I'll need to add one if I implement this).
>
> Yes, but then we'd end up spuriously polling on cpus which happened to 
> disable interrupts for any reason.  And if the vcpu is not running 
> then we could end up polling for a long time.  (Same applies for 
> things in gup_fast, but I'm assuming that's a lot less common than 
> disabling interrupts in general).

Right.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
  2009-03-19  9:46                   ` Avi Kivity
  (?)
@ 2009-03-19 17:16                     ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-19 17:16 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar, Keir Fraser

Avi Kivity wrote:
>> And the hypercall could result in no Xen-level IPIs at all, so it 
>> could be very quick by comparison to an IPI-based Linux 
>> implementation, in which case the flag polling would be particularly 
>> harsh.
>
> Maybe we could bring these optimizations into Linux as well.  The only 
> thing Xen knows that Linux doesn't is if a vcpu is not scheduled; all 
> other information is shared.

I don't think there's a guarantee that just because a vcpu isn't running 
now, it won't need a tlb flush.  If a pcpu does runs vcpu 1 -> idle -> 
vcpu 1, then there's no need for it to do a tlb flush, but the hypercall 
can make force a flush when it reschedules vcpu 1 (if the tlb hasn't 
already been flushed by some other means).

(I'm not sure to what extent Xen implements this now, but I wouldn't 
want to over-constrain it.)

>> Also, the straightforward implementation of "poll until all target 
>> cpu's flags are clear" may never make progress, so you'd have to 
>> "scan flags, remove busy cpus from set, repeat until all cpus done".
>>
>> All annoying because this race is pretty unlikely, and it seems a 
>> shame to slow down all tlb flushes to deal with it.  Some kind of 
>> global "doing gup_fast" counter would get flush_tlb_others bypass the 
>> check, at the cost of putting a couple of atomic ops around the 
>> outside of gup_fast.
>
> The nice thing about local_irq_disable() is that it scales so well.

Right.  But it effectively puts the burden on the tlb-flusher to check 
the state (implicitly, by trying to send an interrupt).  Putting an 
explicit poll in gets the same effect, but its pure overhead just to 
deal with the gup race.

I'll put a patch together and see how it looks.

    J

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-19 17:16                     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-19 17:16 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar, Keir Fraser

Avi Kivity wrote:
>> And the hypercall could result in no Xen-level IPIs at all, so it 
>> could be very quick by comparison to an IPI-based Linux 
>> implementation, in which case the flag polling would be particularly 
>> harsh.
>
> Maybe we could bring these optimizations into Linux as well.  The only 
> thing Xen knows that Linux doesn't is if a vcpu is not scheduled; all 
> other information is shared.

I don't think there's a guarantee that just because a vcpu isn't running 
now, it won't need a tlb flush.  If a pcpu does runs vcpu 1 -> idle -> 
vcpu 1, then there's no need for it to do a tlb flush, but the hypercall 
can make force a flush when it reschedules vcpu 1 (if the tlb hasn't 
already been flushed by some other means).

(I'm not sure to what extent Xen implements this now, but I wouldn't 
want to over-constrain it.)

>> Also, the straightforward implementation of "poll until all target 
>> cpu's flags are clear" may never make progress, so you'd have to 
>> "scan flags, remove busy cpus from set, repeat until all cpus done".
>>
>> All annoying because this race is pretty unlikely, and it seems a 
>> shame to slow down all tlb flushes to deal with it.  Some kind of 
>> global "doing gup_fast" counter would get flush_tlb_others bypass the 
>> check, at the cost of putting a couple of atomic ops around the 
>> outside of gup_fast.
>
> The nice thing about local_irq_disable() is that it scales so well.

Right.  But it effectively puts the burden on the tlb-flusher to check 
the state (implicitly, by trying to send an interrupt).  Putting an 
explicit poll in gets the same effect, but its pure overhead just to 
deal with the gup race.

I'll put a patch together and see how it looks.

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-19 17:16                     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-19 17:16 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Xen-devel, Linux Kernel Mailing List,
	Linux Memory Management List, Keir Fraser, Ingo Molnar

Avi Kivity wrote:
>> And the hypercall could result in no Xen-level IPIs at all, so it 
>> could be very quick by comparison to an IPI-based Linux 
>> implementation, in which case the flag polling would be particularly 
>> harsh.
>
> Maybe we could bring these optimizations into Linux as well.  The only 
> thing Xen knows that Linux doesn't is if a vcpu is not scheduled; all 
> other information is shared.

I don't think there's a guarantee that just because a vcpu isn't running 
now, it won't need a tlb flush.  If a pcpu does runs vcpu 1 -> idle -> 
vcpu 1, then there's no need for it to do a tlb flush, but the hypercall 
can make force a flush when it reschedules vcpu 1 (if the tlb hasn't 
already been flushed by some other means).

(I'm not sure to what extent Xen implements this now, but I wouldn't 
want to over-constrain it.)

>> Also, the straightforward implementation of "poll until all target 
>> cpu's flags are clear" may never make progress, so you'd have to 
>> "scan flags, remove busy cpus from set, repeat until all cpus done".
>>
>> All annoying because this race is pretty unlikely, and it seems a 
>> shame to slow down all tlb flushes to deal with it.  Some kind of 
>> global "doing gup_fast" counter would get flush_tlb_others bypass the 
>> check, at the cost of putting a couple of atomic ops around the 
>> outside of gup_fast.
>
> The nice thing about local_irq_disable() is that it scales so well.

Right.  But it effectively puts the burden on the tlb-flusher to check 
the state (implicitly, by trying to send an interrupt).  Putting an 
explicit poll in gets the same effect, but its pure overhead just to 
deal with the gup race.

I'll put a patch together and see how it looks.

    J

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
  2009-03-19  1:32   ` Nick Piggin
@ 2009-03-19 17:31     ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-19 17:31 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Avi Kivity, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Nick Piggin wrote:
>> Also, assuming that disabling the interrupt is enough to get the
>> guarantees we need here, there's a Xen problem because we don't use IPIs
>> for cross-cpu tlb flushes (well, it happens within Xen).  I'll have to
>> think a bit about how to deal with that, but I'm thinking that we could
>> add a per-cpu "tlb flushes blocked" flag, and maintain some kind of
>> per-cpu deferred tlb flush count so we can get around to doing the flush
>> eventually.
>>
>> But I want to make sure I understand the exact algorithm here.
>>     
>
> FWIW, powerpc actually can flush tlbs without IPIs, and it also has
> a gup_fast. powerpc RCU frees its page _tables_ so we can walk them,
> and then I use speculative page references in order to be able to
> take a reference on the page without having it pinned.
>   

Ah, interesting.  So disabling interrupts prevents the RCU free from 
happening, and non-atomic pte fetching is a non-issue.  So it doesn't 
address the PAE side of the problem.

> Turning gup_get_pte into a pvop would be a bit nasty because on !PAE
> it is just a single load, and even on PAE it is pretty cheap.
>   

Well, it wouldn't be too bad; for !PAE it would turn into something we 
could inline, so there'd be little to no cost.  For PAE it would be out 
of line, but a direct function call, which would be nicely cached and 
very predictable once we've gone through the the loop once (and for Xen 
I think I'd just make it a cmpxchg8b-based implementation, assuming that 
the tlb flush hypercall would offset the cost of making gup_fast a bit 
slower).

But it would be better if we can address it at a higher level.

    J

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-19 17:31     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-19 17:31 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Avi Kivity, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Nick Piggin wrote:
>> Also, assuming that disabling the interrupt is enough to get the
>> guarantees we need here, there's a Xen problem because we don't use IPIs
>> for cross-cpu tlb flushes (well, it happens within Xen).  I'll have to
>> think a bit about how to deal with that, but I'm thinking that we could
>> add a per-cpu "tlb flushes blocked" flag, and maintain some kind of
>> per-cpu deferred tlb flush count so we can get around to doing the flush
>> eventually.
>>
>> But I want to make sure I understand the exact algorithm here.
>>     
>
> FWIW, powerpc actually can flush tlbs without IPIs, and it also has
> a gup_fast. powerpc RCU frees its page _tables_ so we can walk them,
> and then I use speculative page references in order to be able to
> take a reference on the page without having it pinned.
>   

Ah, interesting.  So disabling interrupts prevents the RCU free from 
happening, and non-atomic pte fetching is a non-issue.  So it doesn't 
address the PAE side of the problem.

> Turning gup_get_pte into a pvop would be a bit nasty because on !PAE
> it is just a single load, and even on PAE it is pretty cheap.
>   

Well, it wouldn't be too bad; for !PAE it would turn into something we 
could inline, so there'd be little to no cost.  For PAE it would be out 
of line, but a direct function call, which would be nicely cached and 
very predictable once we've gone through the the loop once (and for Xen 
I think I'd just make it a cmpxchg8b-based implementation, assuming that 
the tlb flush hypercall would offset the cost of making gup_fast a bit 
slower).

But it would be better if we can address it at a higher level.

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
  2009-03-19 17:16                     ` Jeremy Fitzhardinge
@ 2009-03-19 17:33                       ` Avi Kivity
  -1 siblings, 0 replies; 43+ messages in thread
From: Avi Kivity @ 2009-03-19 17:33 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar, Keir Fraser

Jeremy Fitzhardinge wrote:
> Avi Kivity wrote:
>>> And the hypercall could result in no Xen-level IPIs at all, so it 
>>> could be very quick by comparison to an IPI-based Linux 
>>> implementation, in which case the flag polling would be particularly 
>>> harsh.
>>
>> Maybe we could bring these optimizations into Linux as well.  The 
>> only thing Xen knows that Linux doesn't is if a vcpu is not 
>> scheduled; all other information is shared.
>
> I don't think there's a guarantee that just because a vcpu isn't 
> running now, it won't need a tlb flush.  If a pcpu does runs vcpu 1 -> 
> idle -> vcpu 1, then there's no need for it to do a tlb flush, but the 
> hypercall can make force a flush when it reschedules vcpu 1 (if the 
> tlb hasn't already been flushed by some other means).

That's what I assumed you meant.  Also, if a vcpu has a different cr3 
loaded, the flush can be elided.  Looks like Linux does this 
(s/vcpu/process/).

> (I'm not sure to what extent Xen implements this now, but I wouldn't 
> want to over-constrain it.)

Well, kvm does this.


>> The nice thing about local_irq_disable() is that it scales so well.
>
> Right.  But it effectively puts the burden on the tlb-flusher to check 
> the state (implicitly, by trying to send an interrupt).  Putting an 
> explicit poll in gets the same effect, but its pure overhead just to 
> deal with the gup race.

I guess it hopes the flushes are much rarer.  Certainly for threaded 
databases doing O_DIRECT stuff, I'd expect lots of gupfs and no tlb flushes.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-19 17:33                       ` Avi Kivity
  0 siblings, 0 replies; 43+ messages in thread
From: Avi Kivity @ 2009-03-19 17:33 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar, Keir Fraser

Jeremy Fitzhardinge wrote:
> Avi Kivity wrote:
>>> And the hypercall could result in no Xen-level IPIs at all, so it 
>>> could be very quick by comparison to an IPI-based Linux 
>>> implementation, in which case the flag polling would be particularly 
>>> harsh.
>>
>> Maybe we could bring these optimizations into Linux as well.  The 
>> only thing Xen knows that Linux doesn't is if a vcpu is not 
>> scheduled; all other information is shared.
>
> I don't think there's a guarantee that just because a vcpu isn't 
> running now, it won't need a tlb flush.  If a pcpu does runs vcpu 1 -> 
> idle -> vcpu 1, then there's no need for it to do a tlb flush, but the 
> hypercall can make force a flush when it reschedules vcpu 1 (if the 
> tlb hasn't already been flushed by some other means).

That's what I assumed you meant.  Also, if a vcpu has a different cr3 
loaded, the flush can be elided.  Looks like Linux does this 
(s/vcpu/process/).

> (I'm not sure to what extent Xen implements this now, but I wouldn't 
> want to over-constrain it.)

Well, kvm does this.


>> The nice thing about local_irq_disable() is that it scales so well.
>
> Right.  But it effectively puts the burden on the tlb-flusher to check 
> the state (implicitly, by trying to send an interrupt).  Putting an 
> explicit poll in gets the same effect, but its pure overhead just to 
> deal with the gup race.

I guess it hopes the flushes are much rarer.  Certainly for threaded 
databases doing O_DIRECT stuff, I'd expect lots of gupfs and no tlb flushes.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
  2009-03-19 17:31     ` Jeremy Fitzhardinge
@ 2009-03-20  4:40       ` Paul E. McKenney
  -1 siblings, 0 replies; 43+ messages in thread
From: Paul E. McKenney @ 2009-03-20  4:40 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Avi Kivity, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

On Thu, Mar 19, 2009 at 10:31:55AM -0700, Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
>>> Also, assuming that disabling the interrupt is enough to get the
>>> guarantees we need here, there's a Xen problem because we don't use IPIs
>>> for cross-cpu tlb flushes (well, it happens within Xen).  I'll have to
>>> think a bit about how to deal with that, but I'm thinking that we could
>>> add a per-cpu "tlb flushes blocked" flag, and maintain some kind of
>>> per-cpu deferred tlb flush count so we can get around to doing the flush
>>> eventually.
>>>
>>> But I want to make sure I understand the exact algorithm here.
>>
>> FWIW, powerpc actually can flush tlbs without IPIs, and it also has
>> a gup_fast. powerpc RCU frees its page _tables_ so we can walk them,
>> and then I use speculative page references in order to be able to
>> take a reference on the page without having it pinned.
>
> Ah, interesting.  So disabling interrupts prevents the RCU free from 
> happening, and non-atomic pte fetching is a non-issue.  So it doesn't 
> address the PAE side of the problem.

This would be rcu_sched, correct?

							Thanx, Paul

>> Turning gup_get_pte into a pvop would be a bit nasty because on !PAE
>> it is just a single load, and even on PAE it is pretty cheap.
>>   
>
> Well, it wouldn't be too bad; for !PAE it would turn into something we 
> could inline, so there'd be little to no cost.  For PAE it would be out of 
> line, but a direct function call, which would be nicely cached and very 
> predictable once we've gone through the the loop once (and for Xen I think 
> I'd just make it a cmpxchg8b-based implementation, assuming that the tlb 
> flush hypercall would offset the cost of making gup_fast a bit slower).
>
> But it would be better if we can address it at a higher level.
>
>    J
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-20  4:40       ` Paul E. McKenney
  0 siblings, 0 replies; 43+ messages in thread
From: Paul E. McKenney @ 2009-03-20  4:40 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Avi Kivity, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

On Thu, Mar 19, 2009 at 10:31:55AM -0700, Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
>>> Also, assuming that disabling the interrupt is enough to get the
>>> guarantees we need here, there's a Xen problem because we don't use IPIs
>>> for cross-cpu tlb flushes (well, it happens within Xen).  I'll have to
>>> think a bit about how to deal with that, but I'm thinking that we could
>>> add a per-cpu "tlb flushes blocked" flag, and maintain some kind of
>>> per-cpu deferred tlb flush count so we can get around to doing the flush
>>> eventually.
>>>
>>> But I want to make sure I understand the exact algorithm here.
>>
>> FWIW, powerpc actually can flush tlbs without IPIs, and it also has
>> a gup_fast. powerpc RCU frees its page _tables_ so we can walk them,
>> and then I use speculative page references in order to be able to
>> take a reference on the page without having it pinned.
>
> Ah, interesting.  So disabling interrupts prevents the RCU free from 
> happening, and non-atomic pte fetching is a non-issue.  So it doesn't 
> address the PAE side of the problem.

This would be rcu_sched, correct?

							Thanx, Paul

>> Turning gup_get_pte into a pvop would be a bit nasty because on !PAE
>> it is just a single load, and even on PAE it is pretty cheap.
>>   
>
> Well, it wouldn't be too bad; for !PAE it would turn into something we 
> could inline, so there'd be little to no cost.  For PAE it would be out of 
> line, but a direct function call, which would be nicely cached and very 
> predictable once we've gone through the the loop once (and for Xen I think 
> I'd just make it a cmpxchg8b-based implementation, assuming that the tlb 
> flush hypercall would offset the cost of making gup_fast a bit slower).
>
> But it would be better if we can address it at a higher level.
>
>    J
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
  2009-03-20  4:40       ` Paul E. McKenney
@ 2009-03-20 15:38         ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-20 15:38 UTC (permalink / raw)
  To: paulmck
  Cc: Nick Piggin, Avi Kivity, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Paul E. McKenney wrote:
>> Ah, interesting.  So disabling interrupts prevents the RCU free from 
>> happening, and non-atomic pte fetching is a non-issue.  So it doesn't 
>> address the PAE side of the problem.
>>     
>
> This would be rcu_sched, correct?
>   

I guess?  Whatever it is that ends up calling all the rcu callbacks 
after the idle.  A cpu with disabled interrupts can't go through idle, 
right?  Or is there an explicit way to hold off rcu?

    J

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-20 15:38         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-20 15:38 UTC (permalink / raw)
  To: paulmck
  Cc: Nick Piggin, Avi Kivity, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

Paul E. McKenney wrote:
>> Ah, interesting.  So disabling interrupts prevents the RCU free from 
>> happening, and non-atomic pte fetching is a non-issue.  So it doesn't 
>> address the PAE side of the problem.
>>     
>
> This would be rcu_sched, correct?
>   

I guess?  Whatever it is that ends up calling all the rcu callbacks 
after the idle.  A cpu with disabled interrupts can't go through idle, 
right?  Or is there an explicit way to hold off rcu?

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
  2009-03-20 15:38         ` Jeremy Fitzhardinge
@ 2009-03-20 15:57           ` Paul E. McKenney
  -1 siblings, 0 replies; 43+ messages in thread
From: Paul E. McKenney @ 2009-03-20 15:57 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Avi Kivity, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

On Fri, Mar 20, 2009 at 08:38:46AM -0700, Jeremy Fitzhardinge wrote:
> Paul E. McKenney wrote:
>>> Ah, interesting.  So disabling interrupts prevents the RCU free from 
>>> happening, and non-atomic pte fetching is a non-issue.  So it doesn't 
>>> address the PAE side of the problem.
>>
>> This would be rcu_sched, correct?
>
> I guess?  Whatever it is that ends up calling all the rcu callbacks after 
> the idle.  A cpu with disabled interrupts can't go through idle, right?  Or 
> is there an explicit way to hold off rcu?

For synchronize_rcu() and call_rcu(), the only guaranteed way to hold
off RCU is rcu_read_lock() and rcu_read_unlock().

For call_rcu_bh, the only guaranteed way to hold off RCU is
rcu_read_lock_bh() and rcu_read_unlock_bh().

For synchronize_srcu(), the only guaranteed way to hold off RCU is
srcu_read_lock() and srcu_read_unlock().

For synchronize_sched() and call_rcu_sched(), anything that disables
preemption, including disabling irqs, holds off RCU.

Although disabling irqs can indeed hold off RCU in some other cases,
the only guarantee is for synchronize_sched() and call_rcu_sched().

							Thanx, Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Question about x86/mm/gup.c's use of disabled interrupts
@ 2009-03-20 15:57           ` Paul E. McKenney
  0 siblings, 0 replies; 43+ messages in thread
From: Paul E. McKenney @ 2009-03-20 15:57 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Avi Kivity, Linux Kernel Mailing List,
	Linux Memory Management List, Xen-devel, Jan Beulich,
	Ingo Molnar

On Fri, Mar 20, 2009 at 08:38:46AM -0700, Jeremy Fitzhardinge wrote:
> Paul E. McKenney wrote:
>>> Ah, interesting.  So disabling interrupts prevents the RCU free from 
>>> happening, and non-atomic pte fetching is a non-issue.  So it doesn't 
>>> address the PAE side of the problem.
>>
>> This would be rcu_sched, correct?
>
> I guess?  Whatever it is that ends up calling all the rcu callbacks after 
> the idle.  A cpu with disabled interrupts can't go through idle, right?  Or 
> is there an explicit way to hold off rcu?

For synchronize_rcu() and call_rcu(), the only guaranteed way to hold
off RCU is rcu_read_lock() and rcu_read_unlock().

For call_rcu_bh, the only guaranteed way to hold off RCU is
rcu_read_lock_bh() and rcu_read_unlock_bh().

For synchronize_srcu(), the only guaranteed way to hold off RCU is
srcu_read_lock() and srcu_read_unlock().

For synchronize_sched() and call_rcu_sched(), anything that disables
preemption, including disabling irqs, holds off RCU.

Although disabling irqs can indeed hold off RCU in some other cases,
the only guarantee is for synchronize_sched() and call_rcu_sched().

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* paravirtops kernel and HVM guests
  2009-03-18 23:32                 ` Jeremy Fitzhardinge
  (?)
  (?)
@ 2009-04-03  2:41                 ` Nitin A Kamble
  2009-04-03  3:37                   ` Jeremy Fitzhardinge
  -1 siblings, 1 reply; 43+ messages in thread
From: Nitin A Kamble @ 2009-04-03  2:41 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen-devel, Linux Kernel Mailing List

Jeremy,
  I was trying the latest bits from your paravirtops tree with
xen-unstable. Dom0 is booting fine in Fedora 10 environment. I am seeing
page fault handing error while trying to create an HVM guest. 
Is anybody working on the HVM support on top of paravirtops kernel?

privcmd_fault: vma=ffff880068081210 7f7cb7cfe000-7f7cb7dfe000, pgoff=0,
uv=00007f7cb7cfe000, ptep=(null) 0000000000000000
Pid: 4253, comm: qemu-dm Not tainted 2.6.29-rc8-tip #6
Call Trace:
 [<ffffffff814357bf>] privcmd_fault+0x59/0x67
 [<ffffffff810d34ce>] __do_fault+0x50/0x349
 [<ffffffff8102912d>] ? __raw_callee_save_xen_pud_val+0x11/0x1e
 [<ffffffff810d5659>] handle_mm_fault+0x2e2/0x6e9
 [<ffffffff8102b9c2>] ? check_events+0x12/0x20
 [<ffffffff8102b9af>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff816ba9ba>] ? _spin_unlock_irqrestore+0x38/0x3d
 [<ffffffff816bcee9>] do_page_fault+0x2a5/0x2bc
 [<ffffffff816bad45>] page_fault+0x25/0x30

Thanks & Regards,
Nitin


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: paravirtops kernel and HVM guests
  2009-04-03  2:41                 ` paravirtops kernel and HVM guests Nitin A Kamble
@ 2009-04-03  3:37                   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 43+ messages in thread
From: Jeremy Fitzhardinge @ 2009-04-03  3:37 UTC (permalink / raw)
  To: Nitin A Kamble; +Cc: Xen-devel, Linux Kernel Mailing List

Nitin A Kamble wrote:
> Jeremy,
>   I was trying the latest bits from your paravirtops tree with
> xen-unstable. Dom0 is booting fine in Fedora 10 environment. I am seeing
> page fault handing error while trying to create an HVM guest. 
> Is anybody working on the HVM support on top of paravirtops kernel?
>   

Yes, that's how it fails at the moment; that debug message is from the 
last time I tried to track down what's going wrong, but I haven't 
managed to make any progress beyond that.   Getting it working is a 
fairly high priority; I'll probably have another look at it next week.

    J

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2009-04-03  3:37 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-18 19:17 Question about x86/mm/gup.c's use of disabled interrupts Jeremy Fitzhardinge
2009-03-18 19:17 ` Jeremy Fitzhardinge
2009-03-18 19:17 ` Jeremy Fitzhardinge
2009-03-18 21:13 ` Avi Kivity
2009-03-18 21:13   ` Avi Kivity
2009-03-18 21:23   ` Jeremy Fitzhardinge
2009-03-18 21:23     ` Jeremy Fitzhardinge
2009-03-18 21:23     ` Jeremy Fitzhardinge
2009-03-18 21:40     ` Avi Kivity
2009-03-18 21:40       ` Avi Kivity
2009-03-18 22:14       ` Jeremy Fitzhardinge
2009-03-18 22:14         ` Jeremy Fitzhardinge
2009-03-18 22:14         ` Jeremy Fitzhardinge
2009-03-18 22:41         ` Avi Kivity
2009-03-18 22:41           ` Avi Kivity
2009-03-18 22:55           ` Jeremy Fitzhardinge
2009-03-18 22:55             ` Jeremy Fitzhardinge
2009-03-18 23:05             ` Avi Kivity
2009-03-18 23:05               ` Avi Kivity
2009-03-18 23:05               ` Avi Kivity
2009-03-18 23:32               ` Jeremy Fitzhardinge
2009-03-18 23:32                 ` Jeremy Fitzhardinge
2009-03-19  9:46                 ` Avi Kivity
2009-03-19  9:46                   ` Avi Kivity
2009-03-19 17:16                   ` Jeremy Fitzhardinge
2009-03-19 17:16                     ` Jeremy Fitzhardinge
2009-03-19 17:16                     ` Jeremy Fitzhardinge
2009-03-19 17:33                     ` Avi Kivity
2009-03-19 17:33                       ` Avi Kivity
2009-04-03  2:41                 ` paravirtops kernel and HVM guests Nitin A Kamble
2009-04-03  3:37                   ` Jeremy Fitzhardinge
     [not found]               ` <70513aa50903181617r418ec23s744544dccfd812e8@mail.gmail.com>
2009-03-18 23:37                 ` Question about x86/mm/gup.c's use of disabled interrupts Jeremy Fitzhardinge
2009-03-18 23:37                   ` Jeremy Fitzhardinge
2009-03-19  1:32 ` Nick Piggin
2009-03-19  1:32   ` Nick Piggin
2009-03-19 17:31   ` Jeremy Fitzhardinge
2009-03-19 17:31     ` Jeremy Fitzhardinge
2009-03-20  4:40     ` Paul E. McKenney
2009-03-20  4:40       ` Paul E. McKenney
2009-03-20 15:38       ` Jeremy Fitzhardinge
2009-03-20 15:38         ` Jeremy Fitzhardinge
2009-03-20 15:57         ` Paul E. McKenney
2009-03-20 15:57           ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.