[RFC][PATCH] mm: percpu pages: up batch size to fix arithmetic?? errror

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH] mm: percpu pages: up batch size to fix arithmetic?? errror
@ 2013-09-11 22:08 Dave Hansen
  2013-09-11 23:08 ` Cody P Schafer
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Hansen @ 2013-09-11 22:08 UTC (permalink / raw)
  To: Cody P Schafer; +Cc: linux-mm, linux-kernel, cl, Dave Hansen

I really don't know where the:

	batch /= 4;             /* We effectively *= 4 below */
	...
	batch = rounddown_pow_of_two(batch + batch/2) - 1;

came from.  The round down code at *MOST* does a *= 1.5, but
*averages* out to be just under 1.

On a system with 128GB in a zone, this means that we've got
(you can see in /proc/zoneinfo for yourself):

              high:  186 (744kB)
              batch: 31  (124kB)

That 124kB is almost precisely 1/4 of the "1/2 of a meg" that we
were shooting for.  We're under-sizing the batches by about 4x.
This patch kills the /=4.

---

 linux.git-davehans/mm/page_alloc.c |    1 -
 1 file changed, 1 deletion(-)

diff -puN mm/page_alloc.c~debug-pcp-sizes-1 mm/page_alloc.c
--- linux.git/mm/page_alloc.c~debug-pcp-sizes-1	2013-09-11 14:41:08.532445664 -0700
+++ linux.git-davehans/mm/page_alloc.c	2013-09-11 15:03:47.403912683 -0700
@@ -4103,7 +4103,6 @@ static int __meminit zone_batchsize(stru
 	batch = zone->managed_pages / 1024;
 	if (batch * PAGE_SIZE > 512 * 1024)
 		batch = (512 * 1024) / PAGE_SIZE;
-	batch /= 4;		/* We effectively *= 4 below */
 	if (batch < 1)
 		batch = 1;

_

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] mm: percpu pages: up batch size to fix arithmetic?? errror
  2013-09-11 22:08 [RFC][PATCH] mm: percpu pages: up batch size to fix arithmetic?? errror Dave Hansen
@ 2013-09-11 23:08 ` Cody P Schafer
  2013-09-11 23:21   ` Cody P Schafer
  2013-09-11 23:58   ` Dave Hansen
  0 siblings, 2 replies; 7+ messages in thread
From: Cody P Schafer @ 2013-09-11 23:08 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, cl

On 09/11/2013 03:08 PM, Dave Hansen wrote:
> I really don't know where the:
>
> 	batch /= 4;             /* We effectively *= 4 below */
> 	...
> 	batch = rounddown_pow_of_two(batch + batch/2) - 1;
>
> came from.  The round down code at *MOST* does a *= 1.5, but
> *averages* out to be just under 1.
>
> On a system with 128GB in a zone, this means that we've got
> (you can see in /proc/zoneinfo for yourself):
>
>                high:  186 (744kB)
>                batch: 31  (124kB)
>
> That 124kB is almost precisely 1/4 of the "1/2 of a meg" that we
> were shooting for.  We're under-sizing the batches by about 4x.
> This patch kills the /=4.
>
> ---
> diff -puN mm/page_alloc.c~debug-pcp-sizes-1 mm/page_alloc.c
> --- linux.git/mm/page_alloc.c~debug-pcp-sizes-1	2013-09-11 14:41:08.532445664 -0700
> +++ linux.git-davehans/mm/page_alloc.c	2013-09-11 15:03:47.403912683 -0700
> @@ -4103,7 +4103,6 @@ static int __meminit zone_batchsize(stru
>   	batch = zone->managed_pages / 1024;
>   	if (batch * PAGE_SIZE > 512 * 1024)
>   		batch = (512 * 1024) / PAGE_SIZE;
> -	batch /= 4;		/* We effectively *= 4 below */
>   	if (batch < 1)
>   		batch = 1;
>
> _
>

Looking back at the first git commit (way before my time), it appears 
that the percpu pagesets initially had a ->high and ->low (now removed), 
set to batch*6 and batch*2 respectively. I assume the idea was to keep 
the number of pages in the percpu pagesets around batch*4, hence the 
comment.

So we have this variable called "batch", and the code is trying to store 
the _average_ number of pcp pages we want into it (not the batchsize), 
and then we divide our "average" goal by 4 to get a batchsize. All the 
comments refer to the size of the pcp pagesets, not to the pcp pageset 
batchsize.

Looking further, in current code we don't refill the pcp pagesets unless 
they are completely empty (->low was removed a while ago), and then we 
only add ->batch pages.

Has anyone looked at what type of average pcp sizing the current code 
results in?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] mm: percpu pages: up batch size to fix arithmetic?? errror
  2013-09-11 23:08 ` Cody P Schafer
@ 2013-09-11 23:21   ` Cody P Schafer
  2013-09-12  0:20     ` Dave Hansen
  2013-09-11 23:58   ` Dave Hansen
  1 sibling, 1 reply; 7+ messages in thread
From: Cody P Schafer @ 2013-09-11 23:21 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, cl

On 09/11/2013 04:08 PM, Cody P Schafer wrote:
> On 09/11/2013 03:08 PM, Dave Hansen wrote:
>> I really don't know where the:
>>
>>     batch /= 4;             /* We effectively *= 4 below */
>>     ...
>>     batch = rounddown_pow_of_two(batch + batch/2) - 1;
>>
>> came from.  The round down code at *MOST* does a *= 1.5, but
>> *averages* out to be just under 1.
>>
>> On a system with 128GB in a zone, this means that we've got
>> (you can see in /proc/zoneinfo for yourself):
>>
>>                high:  186 (744kB)
>>                batch: 31  (124kB)
>>
>> That 124kB is almost precisely 1/4 of the "1/2 of a meg" that we
>> were shooting for.  We're under-sizing the batches by about 4x.
>> This patch kills the /=4.
>>
>> ---
>> diff -puN mm/page_alloc.c~debug-pcp-sizes-1 mm/page_alloc.c
>> --- linux.git/mm/page_alloc.c~debug-pcp-sizes-1    2013-09-11
>> 14:41:08.532445664 -0700
>> +++ linux.git-davehans/mm/page_alloc.c    2013-09-11
>> 15:03:47.403912683 -0700
>> @@ -4103,7 +4103,6 @@ static int __meminit zone_batchsize(stru
>>       batch = zone->managed_pages / 1024;
>>       if (batch * PAGE_SIZE > 512 * 1024)
>>           batch = (512 * 1024) / PAGE_SIZE;
>> -    batch /= 4;        /* We effectively *= 4 below */
>>       if (batch < 1)
>>           batch = 1;
>>
>> _
>>
>
> Looking back at the first git commit (way before my time), it appears
> that the percpu pagesets initially had a ->high and ->low (now removed),
> set to batch*6 and batch*2 respectively. I assume the idea was to keep
> the number of pages in the percpu pagesets around batch*4, hence the
> comment.
>
> So we have this variable called "batch", and the code is trying to store
> the _average_ number of pcp pages we want into it (not the batchsize),
> and then we divide our "average" goal by 4 to get a batchsize. All the
> comments refer to the size of the pcp pagesets, not to the pcp pageset
> batchsize.
>
> Looking further, in current code we don't refill the pcp pagesets unless
> they are completely empty (->low was removed a while ago), and then we
> only add ->batch pages.
>
> Has anyone looked at what type of average pcp sizing the current code
> results in?

Also, we may want to consider shrinking pcp->high down from 6*pcp->batch 
given that the original "6*" choice was based upon ->batch actually 
being 1/4th of the average pageset size, where now it appears closer to 
being the average.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] mm: percpu pages: up batch size to fix arithmetic?? errror
  2013-09-11 23:08 ` Cody P Schafer
  2013-09-11 23:21   ` Cody P Schafer
@ 2013-09-11 23:58   ` Dave Hansen
  1 sibling, 0 replies; 7+ messages in thread
From: Dave Hansen @ 2013-09-11 23:58 UTC (permalink / raw)
  To: Cody P Schafer; +Cc: linux-mm, linux-kernel, cl

On 09/11/2013 04:08 PM, Cody P Schafer wrote:
> So we have this variable called "batch", and the code is trying to store
> the _average_ number of pcp pages we want into it (not the batchsize),
> and then we divide our "average" goal by 4 to get a batchsize. All the
> comments refer to the size of the pcp pagesets, not to the pcp pageset
> batchsize.

That's a good point, I guess.  I was wondering the same thing.

> Looking further, in current code we don't refill the pcp pagesets unless
> they are completely empty (->low was removed a while ago), and then we
> only add ->batch pages.
> 
> Has anyone looked at what type of average pcp sizing the current code
> results in?

It tends to be within a batch of either ->high (when we are freeing lots
of pages) or ->low (when alloc'ing lots).  I don't see a whole lot of
bouncing around in the middle.  For instance, there aren't a lot of gcc
or make instances during a kernel compile that fit in to the ~0.75MB
->high limit.

Just a dumb little thing like this during a kernel compile on my 4-cpu
laptop:

 while true; do cat /proc/zoneinfo  | egrep 'count:' | tail -4; done >
pcp-counts.1.txt
cat pcp-counts.1.txt | awk '{print $2}' | sort -n | uniq -c | sort -n

says that at least ~1/2 of the time we have <=10 pages.  That makes
sense since the compile spends all of its runtime (relatively slowly)
doing allocations.  It frees all its memory really quickly when it
exits, so the window to see the times when the pools are full is smaller
than when they are empty.

I'm struggling to think of a case where the small batch sizes make sense
these days.  Maybe if you're running a lot of little programs like ls or
awk?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] mm: percpu pages: up batch size to fix arithmetic?? errror
  2013-09-11 23:21   ` Cody P Schafer
@ 2013-09-12  0:20     ` Dave Hansen
  2013-09-12 14:16       ` Christoph Lameter
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Hansen @ 2013-09-12  0:20 UTC (permalink / raw)
  To: Cody P Schafer; +Cc: linux-mm, linux-kernel, cl

BTW, in my little test, the median ->count was 10, and the mean was 45.

On 09/11/2013 04:21 PM, Cody P Schafer wrote:
> Also, we may want to consider shrinking pcp->high down from 6*pcp->batch
> given that the original "6*" choice was based upon ->batch actually
> being 1/4th of the average pageset size, where now it appears closer to
> being the average.

One other thing: we actually had a hot _and_ a cold pageset at that
point, and we now share one pageset for hot and cold pages.  After
looking at it for a bit today, I'm not sure how much the history
matters.  We probably need to take a fresh look at what we want.

Anybody disagree with this?

1. We want ->batch to be large enough that if all the CPUs in a zone
   are doing allocations constantly, there is very little contention on
   the zone_lock.
2. If ->high gets too large, we'll end up keeping too much memory in
   the pcp and __alloc_pages_direct_reclaim() will end up calling the
   (expensive drain_all_pages() too often).
3. We want ->high to approximate the size of the cache which is
   private to a given cpu.  But, that's complicated by the L3 caches
   and hyperthreading today.
4. ->high can be a _bit_ larger than the CPU cache without it being a
   real problem since not _all_ the pages being freed will be fully
   resident in the cache.  Some will be cold, some will only have a few
   of their cachelines resident.
5. A 0.75MB ->high seems a bit low for CPUs with 30MB of L3 cache on
   the socket (although 20 threads share that).

I'll take one of my big systems and run it with some various ->high
settings and see if it makes any difference.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] mm: percpu pages: up batch size to fix arithmetic?? errror
  2013-09-12  0:20     ` Dave Hansen
@ 2013-09-12 14:16       ` Christoph Lameter
  2013-09-12 15:21         ` Dave Hansen
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Lameter @ 2013-09-12 14:16 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Cody P Schafer, linux-mm, linux-kernel

On Wed, 11 Sep 2013, Dave Hansen wrote:

> 3. We want ->high to approximate the size of the cache which is
>    private to a given cpu.  But, that's complicated by the L3 caches
>    and hyperthreading today.

well lets keep it well below that. There are other caches (slab related
f.e.) that are also in constant use.

> I'll take one of my big systems and run it with some various ->high
> settings and see if it makes any difference.

Do you actually see contention issues on the locks? I think we have a
tendency to batch too much in too many caches.





^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] mm: percpu pages: up batch size to fix arithmetic?? errror
  2013-09-12 14:16       ` Christoph Lameter
@ 2013-09-12 15:21         ` Dave Hansen
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Hansen @ 2013-09-12 15:21 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Cody P Schafer, linux-mm, linux-kernel

On 09/12/2013 07:16 AM, Christoph Lameter wrote:
> On Wed, 11 Sep 2013, Dave Hansen wrote:
> 
>> 3. We want ->high to approximate the size of the cache which is
>>    private to a given cpu.  But, that's complicated by the L3 caches
>>    and hyperthreading today.
> 
> well lets keep it well below that. There are other caches (slab related
> f.e.) that are also in constant use.

At the moment, we've got a on-size-fits-all approach.  If you have more
than 512MB of RAM in a zone, you get the high=186(744kb)/batch=31(124kb)
behavior.  On my laptop, I've got 3500kB of L2+L3 for 4 logical cpus, or
~875kB/cpu.  According to what you're saying, the high mark is probably
a _bit_ too high.  On a modern server CPU, the caches are about double
that (per cpu).

>> I'll take one of my big systems and run it with some various ->high
>> settings and see if it makes any difference.
> 
> Do you actually see contention issues on the locks? I think we have a
> tendency to batch too much in too many caches.

Nope.  This all came out of me wondering what that /=4 did.  It's pretty
clear that we've diverged a bit from what the original intent of the
code was.  We need to at _least_ fix the comments up.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-09-12 15:21 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-11 22:08 [RFC][PATCH] mm: percpu pages: up batch size to fix arithmetic?? errror Dave Hansen
2013-09-11 23:08 ` Cody P Schafer
2013-09-11 23:21   ` Cody P Schafer
2013-09-12  0:20     ` Dave Hansen
2013-09-12 14:16       ` Christoph Lameter
2013-09-12 15:21         ` Dave Hansen
2013-09-11 23:58   ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).