linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Scalability problem (kmap_lock) with -aa kernels
@ 2002-03-19  4:25 Martin J. Bligh
  2002-03-19  8:58 ` Rik van Riel
  2002-03-20  1:40 ` Andrea Arcangeli
  0 siblings, 2 replies; 15+ messages in thread
From: Martin J. Bligh @ 2002-03-19  4:25 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

OK, I finally got the -aa kernel series running in conjunction with the
NUMA-Q discontigmem stuff. For some reason which I haven't debugged
yet 2.4.19-pre3-aa2 won't boot on the NUMA-Q even without the discontigmem
stuff in ... so I went back to 2.4.19-pre1-aa1, which I knew worked from
last time around (thanks again for that patch).

So just comparing aa+discontigmem to standard 2.4.18+discontigmem, I see
kernel compile times are about 35s vs 26.5s .... hmmm. Looking at the top
part of the profiles, I see this:

standard:

 23991 total                                      0.0257
  7679 default_idle                             147.6731
  3044 _text_lock_dcache                          8.7221
  2340 _text_lock_swap                           43.3333
  1160 do_anonymous_page                          3.4940
   776 d_lookup                                   2.8116
   650 __free_pages_ok                            1.2405
   627 lru_cache_add                              6.8152
   608 do_generic_file_read                       0.5468
   498 __generic_copy_from_user                   4.7885
   480 lru_cache_del                             21.8182
   437 atomic_dec_and_lock                        6.0694
   426 schedule                                   0.3017
   402 _text_lock_dec_and_lock                   16.7500
...   
   109 kmap_high                                  0.3028
    46 _text_lock_highmem                  0.4071

andrea:    
 38549 total                                      0.0405
 13102 _text_lock_highmem                       108.2810
  8627 default_idle                             165.9038
  2578 kunmap_high                               14.3222
  2556 kmap_high                                  6.0857
  1242 do_anonymous_page                          3.2684
  1052 _text_lock_swap                           22.8696
   942 _text_lock_dcache                          2.4987
   683 do_page_fault                              0.4337
   587 pte_alloc                                  1.2332
   535 __generic_copy_from_user                   5.1442
   518 d_lookup                                   1.8768
   443 __free_pages_ok                            0.7745
   422 lru_cache_add                              2.7763

_text_lock_highmem appears to be kmap_lock, looking at dissassembly.
Recompiling with the trusty lockmeter, I see this (on -aa).

 33.4% 63.5%  5.4us(7893us)  155us(  16ms)(37.8%)   2551814 36.5% 63.5%    0%  kmap_lock_cacheline
 17.4% 64.9%  5.7us(7893us)  158us(  16ms)(19.7%)   1275907 35.1% 64.9%    0%    kmap_high+0x34
 16.0% 62.1%  5.2us( 982us)  152us(  13ms)(18.1%)   1275907 37.9% 62.1%    0%    kunmap_high+0x40

Ick. On a vaguely comparible mainline kernel we're looking at:

  1.6%  2.7%  0.5us(4208us)   28us(3885us)(0.14%)    716044 97.3%  2.7%    0%  kmap_lock
  1.2%  2.9%  0.9us(4208us)   35us(3885us)(0.09%)    358022 97.1%  2.9%    0%    kmap_high+0x10
 0.33%  2.5%  0.2us(  71us)   21us(2598us)(0.05%)    358022 97.5%  2.5%    0%    kunmap_high+0xc

Andrea - is this your new highmem pte stuff doing this?
Or is that not even in your tree as yet? Would be a shame if that's
the problem as I really want to get the highmem pte stuff - allows
me to put processes pagetables on their own nodes ....

Thanks,

Martin.



^ permalink raw reply	[flat|nested] 15+ messages in thread
* Re: Scalability problem (kmap_lock) with -aa kernels
@ 2002-03-20 16:14 Martin J. Bligh
  2002-03-20 16:39 ` Andrea Arcangeli
  2002-03-20 18:15 ` Hugh Dickins
  0 siblings, 2 replies; 15+ messages in thread
From: Martin J. Bligh @ 2002-03-20 16:14 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

> Yep. What's the profile buffer size? Just make sure to boot with
> profile=2 so you'll have a quite accurate precision.

Yup, I have profile=2.
 
> the frequency is higher during a kernel compile due the pte-highmem, but
> if you just change the workload and you start 16 tasks simultaenous
> reading from a file in cache, you will get the same very frequency no
> matter of pte-highmem or not. What you found is a scalability issue with
> the persistent kmaps, not introduced by pte-highmem (however with
> pte-highmem I have increased its visibility due the additional
> usages of the persistent kmaps for certain pte intensive workload and
> with a larger pool to scan).

I understand that, but if the mechanism doesn't work well, let's not
use it any more that we have to ;-) And I have a crazy plan to fix all
this that I'll send out shortly in another email with a more appropriate
title, but that's a bigger change.
 
> Persistent kmaps aren't meant to scale, 

Indeed. If my thinking is correct, they scale as O(1/N^2) - the pool
size is ultimately fixed as we have a limited virtual address space;
more cpus means we use up the pool N times faster and the cost of
the global tlb_flush_all goes up by a factor of N as well.

> Correct. If shrinking the pool doesn't make significant difference (for

OK, here are the results of shrinking the kmap pool: compile times (with
lockmeter, so they may look different from before) go up from 40s (with
1024 pool) to 43s (with 128 pool) - presumably this is the extra cost
of the extra global tlbflushes. 

The numbers from the profile I gave you yesterday were without lockmeter.
Looking at the profiles from both runs with lockmeter, we can see that
the high cost of kmap_high and kunmap_high themselves does indeed seem
to be due to the anomoly I noted earlier - we must be counting some of
the spin time, and the anomoly goes away with lockmeter installed. Profiles indicate no measurable kunmap_high, and kmap_high goes from about 238 
(with 1024 pool) to 334 (with 128 pool) - the time actually *increases*.

lockstat (1024 pool):

33.4% 63.5%  5.4us(7893us)  155us(  16ms)(37.8%)   2551814 36.5% 63.5%    0%  kmap_lock_cacheline
17.4% 64.9%  5.7us(7893us)  158us(  16ms)(19.7%)   1275907 35.1% 64.9%    0%    kmap_high+0x34

lockstat (128 pool) 

35.5% 67.9%  6.0us(1166us)  171us(  18ms)(43.3%)   2602716 32.1% 67.
9%    0%  kmap_lock_cacheline
19.1% 69.6%  6.4us(1166us)  175us(  17ms)(22.7%)   1301358 30.4% 69.
6%    0%    kmap_high+0x34

So (as expected from the previous sentence) lock times actually go up
by shrinking the pool.

I don't believe that kmap_high is really O(N) on the size of the pool.
Looking at the code for map_new_virtual, note that we start at where
we left off before: last_pkmap_nr = (last_pkmap_nr + 1) & LAST_PKMAP_MASK;
So we don't scan the whole array every time - we just walk through it
one step (on most instances, assuming most of pool is short term use). 
On a smaller pool, more of the pool is clogged with long term usage,
so we have more things to "step over" to find an available mapping,
so it's actually more expensive.

Thus I'd conclude your original idea to increase the size of the kmap
pool was perfectly correct.

> example it may be acceptable if it would reduce the level of overhead
> to the same one of lru_cache_add in the anonymous memory page fault that
> you also don't want on a NUMA-Q just for kernel compiles without the
> need of swapout anything) I can very easily drop the persistent kmap
> usage from my tree so you can try that way too (without adding the
> kernel pagetables in kernel stuff in 2.5 and without dropping the
> quicklist cpu affine cache like what happened in 2.5).

If you could give me a patch to do that, I'd be happy to try it out.
 
> BTW, before I drop the persistent kmaps from the pagetable handling you
> can also make a quick check by removing __GFP_HIGHMEM from the
> allocation in mm/memory.c:pte_alloc_one() and verifying the kmap_high
> overhead goes away during the kernel compiles (that basically disables
> the pte-highmem feature).

I added this change on top of the pool shrinkage (i.e. we're still at 128)
resulting in:

3.4%  4.1%  1.4us(1377us)   31us(1462us)(0.19%)    692386 95.9%  4.1%
    0%  kmap_lock_cacheline
2.9%  4.4%  2.4us(1377us)   39us(1333us)(0.13%)    346193 95.6%  4.4%
    0%    kmap_high+0x34

Much better ;-) And compile times are much better ... hard to say
exactly how much, due to some other complications that I won't
delve into ....

M.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2002-03-20 19:38 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-03-19  4:25 Scalability problem (kmap_lock) with -aa kernels Martin J. Bligh
2002-03-19  8:58 ` Rik van Riel
2002-03-20  1:40 ` Andrea Arcangeli
2002-03-20  6:15   ` Martin J. Bligh
2002-03-20 12:30     ` Andrea Arcangeli
2002-03-20 16:14 Martin J. Bligh
2002-03-20 16:39 ` Andrea Arcangeli
2002-03-20 17:41   ` Rik van Riel
2002-03-20 18:26     ` Andrea Arcangeli
2002-03-20 19:35       ` Rik van Riel
2002-03-20 18:16   ` Martin J. Bligh
2002-03-20 18:29     ` Martin J. Bligh
2002-03-20 18:40     ` Andrea Arcangeli
2002-03-20 18:15 ` Hugh Dickins
2002-03-20 18:56   ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).