tlb_start_vma() / tlb_end_vma() inefficiency (was Re: [PATCH 1/1] [ARM] Always do the full MM flush when unmapping VMA)

* tlb_start_vma() / tlb_end_vma() inefficiency (was Re: [PATCH 1/1] [ARM] Always do the full MM flush when unmapping VMA)
       [not found] ` <20090303163212.GB11096@n2100.arm.linux.org.uk>
@ 2009-03-03 18:19   ` Aaro Koskinen
  0 siblings, 0 replies; only message in thread
From: Aaro Koskinen @ 2009-03-03 18:19 UTC (permalink / raw)
  To: linux-kernel

Hello,

Russell King - ARM Linux wrote:
> On Tue, Mar 03, 2009 at 06:23:55PM +0200, Aaro Koskinen wrote:
>> When unmapping N pages (e.g. shared memory) the amount of TLB flushes
>> done is (N*PAGE_SIZE/ZAP_BLOCK_SIZE)*N although it should be N at
>> maximum. With PREEMPT kernel ZAP_BLOCK_SIZE is 8 pages, so there is a
>> noticeable performance penalty and the system is spending its time in
>> flush_tlb_range().
>>
>> The problem is that tlb_end_vma() is passing always the full VMA
>> range. The subrange that needs to be flushed would be available in
>> tlb_finish_mmu(), but the VMA is not available anymore. So always do
>> the full MM flush.
> 
> NAK.  If we're only unmapping a small VMA, this will result in us knocking
> out all TLB entries.  That's far from desirable.
> 
> The better solution is to probably seek to change tlb_end_vma() so that
> it knows how much work to do, which does need a generic kernel change
> and therefore to be discussed on lkml.

Ok, fair enough, moving this to lkml.

So, there is a problem in the way tlb_start_vma() and tlb_end_vma() are 
currently used: unmap_page_range() can be called multiple times when 
unmapping a VMA, and each time it calls tlb_start_vma()/tlb_end_vma() 
with the full range, instead of the subrange it's actually unmapping.

On ARM, tlb_flush_range() is called from tlb_end_vma(), and so, every 
time it goes unnecessarily through the whole VMA range. If I unmap 2048 
pages with PREEMPT enabled, that's 256*2048 flushes. You don't even have 
to measure to see an application freeze when it's unmapping a large 
area. (On some architectures this problem is not visible at all since 
these routines can be NOP.)

The question is how to fix this. There is currently no good way to 
implement these routines for architectures that are doing range-specific 
TLB flushes. As suggested above by Russell, perhaps it could be 
reasonable to change tlb_{start,end}_end() API so that it would also 
pass on the range that is/was actually unmapped by unmap_page_range()?

A.

^ permalink raw reply	[flat|nested] only message in thread