linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: [PATCH 0/3] NUMA boot hash allocation interleaving
@ 2004-12-15 17:25 Luck, Tony
  0 siblings, 0 replies; 31+ messages in thread
From: Luck, Tony @ 2004-12-15 17:25 UTC (permalink / raw)
  To: Martin J. Bligh, Andi Kleen, Brent Casavant
  Cc: linux-kernel, linux-mm, linux-ia64

>> Also at least on IA64 the large page size is usually 1-2GB 
>> and that would seem to be a little too large to me for
>> interleaving purposes. Also it may prevent the purpose 
>> you implemented it - not using too much memory from a single
>> node. 
>
>Yes, that'd bork it. But I thought that they had a large sheaf of
>mapping sizes to chose from on ia64?

Yes, ia64 supports lots of pagesizes (the exact list for each cpu
model can be found in /proc/pal/cpu*/vm_info, but the architecture
requires that 4k, 8k, 16k, 64k, 256k, 1m, 4m, 16m, 64m, 256m be
supported by all implementations).  To make good use of them
for vmalloc() would require that we switch the kernel over to
using long format VHPT ... as well as all the architecture
independent changes that Andi listed.

It would be interesting to see some perfmon data on TLB miss rates
before and after this patch, but I'd personally be amazed if you
could find a macro-level benchmark that could reliably detect the
perfomance effects relating to TLB caused by this change.

-Tony

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-21 16:23                         ` Brent Casavant
@ 2004-12-23  2:19                           ` Jose R. Santos
  0 siblings, 0 replies; 31+ messages in thread
From: Jose R. Santos @ 2004-12-23  2:19 UTC (permalink / raw)
  To: Brent Casavant
  Cc: Anton Blanchard, Jose R. Santos, Andi Kleen, Martin J. Bligh,
	linux-kernel, linux-mm, linux-ia64

Brent Casavant <bcasavan@sgi.com> [041221]:
> I didn't realize this was ppc64 testing.  What was the exact setup
> for the testing?  The patch as posted (and I hope clearly explained)
> only turns on the behavior by default when both CONFIG_NUMA and
> CONFIG_IA64 were active.  It could be activated on non-IA64 by setting
> hashdist=1 on the boot line, or by modifying the patch.

I wasn't aware of the little detail.  I re-tested with hashdist=1 and
this time it shows a slowdown of about 3%-4% on a 4-Way Power5 system 
(2 NUMA nodes) with 64GB.  Don't see a big problem if the things is off
by default on non IA64 systems though.

> I would hate to find out that the testing didn't actually enable the
> new behavior.

Serves me right for not reading the entire thread. :)

-JRS

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-21 11:46                       ` Anton Blanchard
@ 2004-12-21 16:23                         ` Brent Casavant
  2004-12-23  2:19                           ` Jose R. Santos
  0 siblings, 1 reply; 31+ messages in thread
From: Brent Casavant @ 2004-12-21 16:23 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Jose R. Santos, Andi Kleen, Martin J. Bligh, linux-kernel,
	linux-mm, linux-ia64

On Tue, 21 Dec 2004, Anton Blanchard wrote:

> > The difference between the two runs was with in noise of the benchmark on
> > my small setup.  I wont be able to get a larger NUMA system until next year,
> > so I'll retest when that happens.  In the mean time, I don't see a reason
> > either to stall this patch, but that may change on I get numbers on a
> > larger system.
> 
> Thanks Jose!
> 
> Brent, looks like we are happy on the ppc64 front.

I didn't realize this was ppc64 testing.  What was the exact setup
for the testing?  The patch as posted (and I hope clearly explained)
only turns on the behavior by default when both CONFIG_NUMA and
CONFIG_IA64 were active.  It could be activated on non-IA64 by setting
hashdist=1 on the boot line, or by modifying the patch.

I would hate to find out that the testing didn't actually enable the
new behavior.

Thanks,
Brent

-- 
Brent Casavant                          If you had nothing to fear,
bcasavan@sgi.com                        how then could you be brave?
Silicon Graphics, Inc.                    -- Queen Dama, Source Wars

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-20 16:56                     ` Jose R. Santos
@ 2004-12-21 11:46                       ` Anton Blanchard
  2004-12-21 16:23                         ` Brent Casavant
  0 siblings, 1 reply; 31+ messages in thread
From: Anton Blanchard @ 2004-12-21 11:46 UTC (permalink / raw)
  To: Jose R. Santos
  Cc: Andi Kleen, Martin J. Bligh, Brent Casavant, linux-kernel,
	linux-mm, linux-ia64

 
> The difference between the two runs was with in noise of the benchmark on
> my small setup.  I wont be able to get a larger NUMA system until next year,
> so I'll retest when that happens.  In the mean time, I don't see a reason
> either to stall this patch, but that may change on I get numbers on a
> larger system.

Thanks Jose!

Brent, looks like we are happy on the ppc64 front.

Anton

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 0/3] NUMA boot hash allocation interleaving
@ 2004-12-20 23:36 Brent Casavant
  0 siblings, 0 replies; 31+ messages in thread
From: Brent Casavant @ 2004-12-20 23:36 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel

Resend: Submitting for inclusion, as this patch series drew no objections
(and even mild approval) from interested parties, and required no changes.
This particular message was slightly edited toward the end, to call out
additional performance testing which was done.

NUMA systems running current Linux kernels suffer from substantial
inequities in the amount of memory allocated from each NUMA node
during boot.  In particular, several large hashes are allocated
using alloc_bootmem, and as such are allocated contiguously from
a single node each.

This becomes a problem for certain workloads that are relatively common
on big-iron HPC NUMA systems.  In particular, a number of MPI and OpenMP
applications which require nearly all available processors in the system
and nearly all the memory on each node run into difficulties.  Due to the
uneven memory distribution onto a few nodes, any thread on those nodes will
require a portion of its memory be allocated from remote nodes.  Any
access to those memory locations will be slower than local accesses,
and thereby slows down the effective computation rate for the affected
CPUs/threads.  This problem is further amplified if the application is
tightly synchronized between threads (as is often the case), as they entire
job can run only at the speed of the slowest thread.

Additionally since these hashes are usually accessed by all CPUS in the
system, the NUMA network link on the node which hosts the hash experiences
disproportionate traffic levels, thereby reducing the memory bandwidth
available to that node's CPUs, and further penalizing performance of the
threads executed thereupon.

As such, it is desired to find a way to distribute these large hash
allocations more evenly across NUMA nodes.  Fortunately current
kernels do perform allocation interleaving for vmalloc() during boot,
which provides a stepping stone to a solution.

This series of patches enables (but does not require) the kernel to
allocate several boot time hashes using vmalloc rather than alloc_bootmem,
thereby causing the hashes to be interleaved amongst NUMA nodes.  In
particular the dentry cache, inode cache, TCP ehash, and TCP bhash have been
changed to be allocated in this manner.  Due to the limited vmalloc space
on architectures such as i386, this behavior is turned on by default only
for IA64 NUMA systems (though there is no reason other interested
architectures could not enable it if desired).  Non-IA64 and non-NUMA
systems continue to use the existing alloc_bootmem() allocation mechanism.
A boot line parameter "hashdist" can be set to override the default
behavior.

The following two sets of example output show the uneven distribution
just after boot, using init=/bin/sh to eliminate as much non-kernel
allocation as possible.

Without the boot hash distribution patches:

 Nid  MemTotal   MemFree   MemUsed      (in kB)
   0   3870656   3697696    172960
   1   3882992   3866656     16336
   2   3883008   3866784     16224
   3   3882992   3866464     16528
   4   3883008   3866592     16416
   5   3883008   3866720     16288
   6   3882992   3342176    540816
   7   3883008   3865440     17568
   8   3882992   3866560     16432
   9   3883008   3866400     16608
  10   3882992   3866592     16400
  11   3883008   3866400     16608
  12   3882992   3866400     16592
  13   3883008   3866432     16576
  14   3883008   3866528     16480
  15   3864768   3848256     16512
 ToT  62097440  61152096    945344

Notice that nodes 0 and 6 have a substantially larger memory utilization
than all other nodes.

With the boot hash distribution patch:

 Nid  MemTotal   MemFree   MemUsed      (in kB)
   0   3870656   3789792     80864
   1   3882992   3843776     39216
   2   3883008   3843808     39200
   3   3882992   3843904     39088
   4   3883008   3827488     55520
   5   3883008   3843712     39296
   6   3882992   3843936     39056
   7   3883008   3844096     38912
   8   3882992   3843712     39280
   9   3883008   3844000     39008
  10   3882992   3843872     39120
  11   3883008   3843872     39136
  12   3882992   3843808     39184
  13   3883008   3843936     39072
  14   3883008   3843712     39296
  15   3864768   3825760     39008
 ToT  62097440  61413184    684256

While not perfectly even, we can see that there is a substantial
improvement in the spread of memory allocated by the kernel during
boot.  The remaining uneveness may be due in part to further boot
time allocations that could be addressed in a similar manner, but
some difference is due to the somewhat special nature of node 0
during boot.  However the uneveness has fallen to a much more
acceptable level (at least to a level that SGI isn't concerned about).

The astute reader will also notice that in this example, with this patch
approximately 256 MB less memory was allocated during boot.  This is due
to the size limits of a single vmalloc.  More specifically, this is because
the automatically computed size of the TCP ehash exceeds the maximum
size which a single vmalloc can accomodate.  However this is of little
practical concern as the vmalloc size limit simply reduces one ridiculously
large allocation (512MB) to a slightly less ridiculously large allocation
(256MB).  In practice machines with large memory configurations are using
the thash_entries setting to limit the size of the TCP ehash _much_ lower
than either of the automatically computed values.  Illustrative of the
exceedingly large nature of the automatically computed size, SGI
currently recommends that customers boot with thash_entries=2097152,
which works out to a 32MB allocation.  In any case, setting hashdist=0
will allow for allocations in excess of vmalloc limits, if so desired.

Other than the vmalloc limit, great care was taken to ensure that the
size of TCP hash allocations was not altered by this patch.  Due to
slightly different computation techniques between the existing TCP code
and alloc_large_system_hash (which is now utilized), some of the magic
constants in the TCP hash allocation code were changed.  On all sizes
of system (128MB through 64GB) that I had access to, the patched code
preserves the previous hash size, as long as the vmalloc limit
(256MB on IA64) is not encountered.

There was concern that changing the TCP-related hashes to use vmalloc
space may adversely impact network performance.  To this end the netperf
set of benchmarks was run.  Some individual tests seemed to benefit
slightly, some seemed to be harmed slightly, but in all cases the average
difference with and without these patches was well within the variabilty
I would see from run to run.

The following is the overall netperf averages (30 10 second runs each)
against an older kernel with these same patches. These tests were run
over loopback as GigE results were so inconsistent run to run both with
and without these patches that they provided no meaningful comparison that
I could discern.  I used the same kernel (IA64 generic) for each run,
simply varying the new "hashdist" boot parameter to turn on or off the new
allocation behavior.  In all cases the thash_entries value was manually
specified as discussed previously to eliminate any variability that
might result from that size difference.

HP ZX1, hashdist=0
==================
TCP_RR = 19389
TCP_MAERTS = 6561 
TCP_STREAM = 6590 
TCP_CC = 9483 
TCP_CRR = 8633 

HP ZX1, hashdist=1
==================
TCP_RR = 19411
TCP_MAERTS = 6559 
TCP_STREAM = 6584 
TCP_CC = 9454 
TCP_CRR = 8626 

SGI Altix, hashdist=0
=====================
TCP_RR = 16871
TCP_MAERTS = 3925 
TCP_STREAM = 4055 
TCP_CC = 8438 
TCP_CRR = 7750 

SGI Altix, hashdist=1
=====================
TCP_RR = 17040
TCP_MAERTS = 3913 
TCP_STREAM = 4044 
TCP_CC = 8367 
TCP_CRR = 7538 

I believe the TCP_CC and TCP_CRR are the tests most sensitive to this
particular change.  But again, I want to emphasize that even the
differences you see above are _well_ within the variability I saw
from run to run of any given test.

In addition, Jose Santos at IBM has run specSFS, which has been
particularly sensitive to TLB issues, against these patches and
saw no performance degredation (differences down in the noise).

-- 
Brent Casavant                          If you had nothing to fear,
bcasavan@sgi.com                        how then could you be brave?
Silicon Graphics, Inc.                    -- Queen Dama, Source Wars

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-16 14:18                   ` Jose R. Santos
@ 2004-12-20 16:56                     ` Jose R. Santos
  2004-12-21 11:46                       ` Anton Blanchard
  0 siblings, 1 reply; 31+ messages in thread
From: Jose R. Santos @ 2004-12-20 16:56 UTC (permalink / raw)
  To: Jose R. Santos
  Cc: Anton Blanchard, Andi Kleen, Martin J. Bligh, Brent Casavant,
	linux-kernel, linux-mm, linux-ia64

Jose R. Santos <jrsantos@austin.ibm.com> [041216]:
> I can do the SpecSFS runs but each runs takes several hours to complete
> and I would need to do two runs (baseline and patched).  I may have it 
> ready by today or tommorow.

The difference between the two runs was with in noise of the benchmark on
my small setup.  I wont be able to get a larger NUMA system until next year,
so I'll retest when that happens.  In the mean time, I don't see a reason
either to stall this patch, but that may change on I get numbers on a
larger system.

-JRS

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
       [not found]                 ` <20041216051323.GI24000@krispykreme.ozlabs.ibm.com>
@ 2004-12-16 14:18                   ` Jose R. Santos
  2004-12-20 16:56                     ` Jose R. Santos
  0 siblings, 1 reply; 31+ messages in thread
From: Jose R. Santos @ 2004-12-16 14:18 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Andi Kleen, Martin J. Bligh, Brent Casavant, linux-kernel,
	linux-mm, linux-ia64, jrsantos

Anton Blanchard <anton@samba.org> [041215]:
>  
> > I asked Brent to run some benchmarks originally and I believe he has 
> > already run all that he could easily set up. If you want more testing
> > you'll need to test yourself I think. 
> 
> We will be testing it.

By "We" you mean "Me" right? :)

I can do the SpecSFS runs but each runs takes several hours to complete
and I would need to do two runs (baseline and patched).  I may have it 
ready by today or tommorow.

-JRS

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-15 14:47             ` Anton Blanchard
  2004-12-15 23:37               ` Brent Casavant
@ 2004-12-16  5:02               ` Andi Kleen
       [not found]                 ` <20041216051323.GI24000@krispykreme.ozlabs.ibm.com>
  1 sibling, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2004-12-16  5:02 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Andi Kleen, Martin J. Bligh, Brent Casavant, linux-kernel,
	linux-mm, linux-ia64, jrsantos

> specSFS (an NFS server benchmarmk) has been very sensitive to TLB issues
> for us, it uses all the memory as pagecache and you end up with 10
> million+ dentries. Something similar that pounds on the dcache would be
> interesting.

I asked Brent to run some benchmarks originally and I believe he has 
already run all that he could easily set up. If you want more testing
you'll need to test yourself I think. 

At least I don't think this patch should be further stalled unless
someone actually comes up with a proof that it actually affects
performance.

-Andi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-15 14:47             ` Anton Blanchard
@ 2004-12-15 23:37               ` Brent Casavant
  2004-12-16  5:02               ` Andi Kleen
  1 sibling, 0 replies; 31+ messages in thread
From: Brent Casavant @ 2004-12-15 23:37 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Andi Kleen, Martin J. Bligh, linux-kernel, linux-mm, linux-ia64,
	jrsantos

On Thu, 16 Dec 2004, Anton Blanchard wrote:

> Id like to see a benchmark that has a large footprint in the hash. A few
> connection netperf run isnt going to stress the hash is it?

Not as well as I'd like, I'll admit.  I really couldn't find any
standard benchmark that would push the TCP hashes hard.

> Also what page size were the runs done with? On x86-64 and ppc64 the 4kB page
> size may make a difference to Brents runs.

16K pages on IA64.  As the patch currently stands x86-64 and ppc64
would not be a concern, as we still use the old behavior by default
for those architectures.  Only IA64 NUMA kernel configurations will
have this on by default.  Additionally, this only affects NUMA machines,
and I'm not aware of any x86-64 architectures of that nature (please
educate me if I'm mistaken).

> specSFS (an NFS server benchmarmk) has been very sensitive to TLB issues
> for us, it uses all the memory as pagecache and you end up with 10
> million+ dentries. Something similar that pounds on the dcache would be
> interesting.

I'll look into running that, but have my doubts as to whether I
can scare up appropriate quantities/types of hardware.

Brent

-- 
Brent Casavant                          If you had nothing to fear,
bcasavan@sgi.com                        how then could you be brave?
Silicon Graphics, Inc.                    -- Queen Dama, Source Wars

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-15  7:17             ` Andi Kleen
  2004-12-15 15:08               ` Martin J. Bligh
@ 2004-12-15 18:24               ` Brent Casavant
  1 sibling, 0 replies; 31+ messages in thread
From: Brent Casavant @ 2004-12-15 18:24 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Martin J. Bligh, linux-kernel, linux-mm, linux-ia64

On Wed, 15 Dec 2004, Andi Kleen wrote:

> On Tue, Dec 14, 2004 at 11:14:46PM -0800, Martin J. Bligh wrote:
> > Well hold on a sec. We don't need to use the hugepages pool for this,
> > do we? This is the same as using huge page mappings for the whole of
> > kernel space on ia32. As long as it's a kernel mapping, and 16MB aligned
> > and contig, we get it for free, surely?
> 
> The whole point of the patch is to not use the direct mapping, but
> use a different interleaved mapping on NUMA machines to spread
> the memory out over multiple nodes.

There is a middle ground, in theory.  At least on a NUMA machine you
can divide up the allocation roughly as requested_size/number_nodes.
Round the result up to the next available page size, and allocate
interleaved on the nodes until you've satisfied the requested size.
This minimizes the number of TLB entries required to interleave the
allocation.

However, as noted, the kernel barely handles two page sizes, much
less multiple page sizes.  If more flexible page-size handling
comes along someday this and many other sections of code could
stand to benefit from some rewriting.

> > > Using other page sizes would be probably tricky because the 
> > > linux VM can currently barely deal with two page sizes.
> > > I suspect handling more would need some VM infrastructure effort
> > > at least in the changed port. 
> > 
> > For the general case I'd agree. But this is a setup-time only tweak
> > of the static kernel mapping, isn't it?
> 
> It's probably not impossible, just lots of ugly special cases.
> e.g. how about supporting it for /proc/kcore etc? 

Just to bring a bit of closure regarding the patches I posted yesterday,
I'm reading the overall discussion as "The patches look good enough for
current kernels, and this would benefit from multiple page size support,
if we ever get it."  Fair read?

Brent

-- 
Brent Casavant                          If you had nothing to fear,
bcasavan@sgi.com                        how then could you be brave?
Silicon Graphics, Inc.                    -- Queen Dama, Source Wars

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-15  7:17             ` Andi Kleen
@ 2004-12-15 15:08               ` Martin J. Bligh
  2004-12-15 18:24               ` Brent Casavant
  1 sibling, 0 replies; 31+ messages in thread
From: Martin J. Bligh @ 2004-12-15 15:08 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Brent Casavant, linux-kernel, linux-mm, linux-ia64

--Andi Kleen <ak@suse.de> wrote (on Wednesday, December 15, 2004 08:17:34 +0100):

> On Tue, Dec 14, 2004 at 11:14:46PM -0800, Martin J. Bligh wrote:
>> Well hold on a sec. We don't need to use the hugepages pool for this,
>> do we? This is the same as using huge page mappings for the whole of
>> kernel space on ia32. As long as it's a kernel mapping, and 16MB aligned
>> and contig, we get it for free, surely?
> 
> The whole point of the patch is to not use the direct mapping, but
> use a different interleaved mapping on NUMA machines to spread
> the memory out over multiple nodes.

Right, I know it's not there pre-existant - I was thinking of frigging it 
by hand though, rather than using the hugepage pool infrastructure.

>> > Using other page sizes would be probably tricky because the 
>> > linux VM can currently barely deal with two page sizes.
>> > I suspect handling more would need some VM infrastructure effort
>> > at least in the changed port. 
>> 
>> For the general case I'd agree. But this is a setup-time only tweak
>> of the static kernel mapping, isn't it?
> 
> It's probably not impossible, just lots of ugly special cases.
> e.g. how about supporting it for /proc/kcore etc? 

Hmmm. Yes, not considered those. 

M.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-15  4:58           ` Andi Kleen
@ 2004-12-15 14:47             ` Anton Blanchard
  2004-12-15 23:37               ` Brent Casavant
  2004-12-16  5:02               ` Andi Kleen
  0 siblings, 2 replies; 31+ messages in thread
From: Anton Blanchard @ 2004-12-15 14:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Martin J. Bligh, Brent Casavant, linux-kernel, linux-mm,
	linux-ia64, jrsantos

 
> Given that Brent did lots of benchmarks which didn't show any slowdowns
> I don't think this is really needed (at least as long as nobody
> demonstrates a ireal slowdown from the patch). And having such special
> cases is always ugly, better not have them when not needed.

Id like to see a benchmark that has a large footprint in the hash. A few
connection netperf run isnt going to stress the hash is it?

Also what page size were the runs done with? On x86-64 and ppc64 the 4kB page
size may make a difference to Brents runs.

specSFS (an NFS server benchmarmk) has been very sensitive to TLB issues
for us, it uses all the memory as pagecache and you end up with 10
million+ dentries. Something similar that pounds on the dcache would be
interesting.

Anton

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-15  7:46             ` Andi Kleen
@ 2004-12-15  9:14               ` Andi Kleen
  0 siblings, 0 replies; 31+ messages in thread
From: Andi Kleen @ 2004-12-15  9:14 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric Dumazet, Brent Casavant, Martin J. Bligh, linux-kernel,
	linux-mm, linux-ia64

> > 2) What are the exact number of data TLB entries (for small pages and 
> > huge ones) for opterons ?
> 
> check the data sheets, but iirc 64 large DTLBs and 1024+ 4K DTLBS.
> That is the L2 TLB, there is also a L1 but it is likely inclusive (?)

After checking the data sheets it is actually 32 2MB DLTBs and 512 4K DTLBs
(L2). And the same for the ITLB. 

-Andi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-15  7:41           ` Eric Dumazet
@ 2004-12-15  7:46             ` Andi Kleen
  2004-12-15  9:14               ` Andi Kleen
  0 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2004-12-15  7:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, Brent Casavant, Martin J. Bligh, linux-kernel,
	linux-mm, linux-ia64

On Wed, Dec 15, 2004 at 08:41:25AM +0100, Eric Dumazet wrote:
> 
> My questions are :
> 
> 1) Are the route cache and tcp hashes use big pages (2MB) on 2.6.5/2.6.9 
> x86_64 kernels.

Yes.

On i386 kernels you can use mem=nopentium to force 4K pages for
the direct mapping, but that was dropped on x86-64. 

> 2) What are the exact number of data TLB entries (for small pages and 
> huge ones) for opterons ?

check the data sheets, but iirc 64 large DTLBs and 1024+ 4K DTLBS.
That is the L2 TLB, there is also a L1 but it is likely inclusive (?)

> 3) All networks interrupts are handled by CPU0. Should we really use 
> NUMA interleaved memory for hashes in this case ?

First it depends on if you run irqbalanced or not and how many
interrupt sources you have.

Even when they are only handled on irq0 it can be still a good idea
to interleave to use the bandwidth of all memory controllers
in the system evenly.

-Andi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-15  4:08         ` Andi Kleen
  2004-12-15  7:14           ` Martin J. Bligh
@ 2004-12-15  7:41           ` Eric Dumazet
  2004-12-15  7:46             ` Andi Kleen
  1 sibling, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2004-12-15  7:41 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Brent Casavant, Martin J. Bligh, linux-kernel, linux-mm, linux-ia64

Andi Kleen wrote:

> 
> I actually considered implementing it for x86-64 some time ago
> for the modules, but then I never bothered. On AMD systems
> I actually prefer to use small pages here. The reason is that
> Opteron has a separated large and small pages TLB and the small
> pages TLB is much bigger. When someone else uses huge TLB 
> pages too (user space or kernel direct mapping) then it's actually
> a good idea to use small pages.

Interesting...

I actually use dual Opterons systems, with very large route cache hashes 
and tcp hashes. (rhash_entries=524288 thash_entries=524288), and a 
Hugetlb aware user space programs.

x86info tells me (maybe wrongly)

Family: 15 Model: 5 Stepping: 8
CPU Model : Opteron
Instruction TLB: Fully associative. 32 entries.
Data TLB: Fully associative. 32 entries.

and /proc/cpuinfo tells me :
model name      : AMD Opteron(tm) Processor 248
TLB size        : 1088 4K pages


My questions are :

1) Are the route cache and tcp hashes use big pages (2MB) on 2.6.5/2.6.9 
x86_64 kernels.
2) What are the exact number of data TLB entries (for small pages and 
huge ones) for opterons ?
3) All networks interrupts are handled by CPU0. Should we really use 
NUMA interleaved memory for hashes in this case ?

Thank you
Eric Dumazet

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-15  7:14           ` Martin J. Bligh
@ 2004-12-15  7:17             ` Andi Kleen
  2004-12-15 15:08               ` Martin J. Bligh
  2004-12-15 18:24               ` Brent Casavant
  0 siblings, 2 replies; 31+ messages in thread
From: Andi Kleen @ 2004-12-15  7:17 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andi Kleen, Brent Casavant, linux-kernel, linux-mm, linux-ia64

On Tue, Dec 14, 2004 at 11:14:46PM -0800, Martin J. Bligh wrote:
> Well hold on a sec. We don't need to use the hugepages pool for this,
> do we? This is the same as using huge page mappings for the whole of
> kernel space on ia32. As long as it's a kernel mapping, and 16MB aligned
> and contig, we get it for free, surely?

The whole point of the patch is to not use the direct mapping, but
use a different interleaved mapping on NUMA machines to spread
the memory out over multiple nodes.

> > Using other page sizes would be probably tricky because the 
> > linux VM can currently barely deal with two page sizes.
> > I suspect handling more would need some VM infrastructure effort
> > at least in the changed port. 
> 
> For the general case I'd agree. But this is a setup-time only tweak
> of the static kernel mapping, isn't it?

It's probably not impossible, just lots of ugly special cases.
e.g. how about supporting it for /proc/kcore etc? 

-Andi


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-15  4:08         ` Andi Kleen
@ 2004-12-15  7:14           ` Martin J. Bligh
  2004-12-15  7:17             ` Andi Kleen
  2004-12-15  7:41           ` Eric Dumazet
  1 sibling, 1 reply; 31+ messages in thread
From: Martin J. Bligh @ 2004-12-15  7:14 UTC (permalink / raw)
  To: Andi Kleen, Brent Casavant; +Cc: linux-kernel, linux-mm, linux-ia64

>> > > I originally was a bit worried about the TLB usage, but it doesn't
>> > > seem to be a too big issue (hopefully the benchmarks weren't too
>> > > micro though)
>> > 
>> > Well, as long as we stripe on large page boundaries, it should be fine,
>> > I'd think. On PPC64, it'll screw the SLB, but ... tough ;-) We can either
>> > turn it off, or only do it on things larger than the segment size, and
>> > just round-robin the rest, or allocate from node with most free.
>> 
>> Is there a reasonably easy-to-use existing infrastructure to do this?
> 
> No. It will be a lot of work actually, requiring new code for 
> each architecture and may even be impossible on some. 
> The current hugetlb code is not really suitable for this
> because it requires an preallocated pool and only works
> for user space.

Well hold on a sec. We don't need to use the hugepages pool for this,
do we? This is the same as using huge page mappings for the whole of
kernel space on ia32. As long as it's a kernel mapping, and 16MB aligned
and contig, we get it for free, surely?

> Also at least on IA64 the large page size is usually 1-2GB 
> and that would seem to be a little too large to me for
> interleaving purposes. Also it may prevent the purpose 
> you implemented it - not using too much memory from a single
> node. 

Yes, that'd bork it. But I thought that they had a large sheaf of
mapping sizes to chose from on ia64?
 
> Using other page sizes would be probably tricky because the 
> linux VM can currently barely deal with two page sizes.
> I suspect handling more would need some VM infrastructure effort
> at least in the changed port. 

For the general case I'd agree. But this is a setup-time only tweak
of the static kernel mapping, isn't it?

I'm not saying it needs doing now. But it's an interesting future
enhancement.

M.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-14 22:00         ` Martin J. Bligh
@ 2004-12-15  4:58           ` Andi Kleen
  2004-12-15 14:47             ` Anton Blanchard
  0 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2004-12-15  4:58 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Brent Casavant, Andi Kleen, linux-kernel, linux-mm, linux-ia64

> > And just to clarify, are you saying you want to see this before inclusion
> > in mainline kernels, or that it would be nice to have but not necessary?
> 
> I'd say it's a nice to have, rather than necessary, as long as it's not
> forced upon people. Maybe a config option that's on by default on ia64
> or something. Causing yourself TLB problems is much more acceptable than
> causing it for others ;-)

Given that Brent did lots of benchmarks which didn't show any slowdowns
I don't think this is really needed (at least as long as nobody
demonstrates a ireal slowdown from the patch). And having such special
cases is always ugly, better not have them when not needed.

-Andi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-14 23:24       ` Brent Casavant
  2004-12-14 22:00         ` Martin J. Bligh
@ 2004-12-15  4:08         ` Andi Kleen
  2004-12-15  7:14           ` Martin J. Bligh
  2004-12-15  7:41           ` Eric Dumazet
  1 sibling, 2 replies; 31+ messages in thread
From: Andi Kleen @ 2004-12-15  4:08 UTC (permalink / raw)
  To: Brent Casavant
  Cc: Martin J. Bligh, Andi Kleen, linux-kernel, linux-mm, linux-ia64

On Tue, Dec 14, 2004 at 05:24:02PM -0600, Brent Casavant wrote:
> On Tue, 14 Dec 2004, Martin J. Bligh wrote:
> 
> > --On Tuesday, December 14, 2004 20:13:48 +0100 Andi Kleen <ak@suse.de> wrote:
> > 
> > > I originally was a bit worried about the TLB usage, but it doesn't
> > > seem to be a too big issue (hopefully the benchmarks weren't too
> > > micro though)
> > 
> > Well, as long as we stripe on large page boundaries, it should be fine,
> > I'd think. On PPC64, it'll screw the SLB, but ... tough ;-) We can either
> > turn it off, or only do it on things larger than the segment size, and
> > just round-robin the rest, or allocate from node with most free.
> 
> Is there a reasonably easy-to-use existing infrastructure to do this?

No. It will be a lot of work actually, requiring new code for 
each architecture and may even be impossible on some. 
The current hugetlb code is not really suitable for this
because it requires an preallocated pool and only works
for user space.

I actually considered implementing it for x86-64 some time ago
for the modules, but then I never bothered. On AMD systems
I actually prefer to use small pages here. The reason is that
Opteron has a separated large and small pages TLB and the small
pages TLB is much bigger. When someone else uses huge TLB 
pages too (user space or kernel direct mapping) then it's actually
a good idea to use small pages.

Also it may be difficult in some cases to even allocate
such large pages even at boot and impossible to do it
later when a module loads.

Also at least on IA64 the large page size is usually 1-2GB 
and that would seem to be a little too large to me for
interleaving purposes. Also it may prevent the purpose 
you implemented it - not using too much memory from a single
node. 

Using other page sizes would be probably tricky because the 
linux VM can currently barely deal with two page sizes.
I suspect handling more would need some VM infrastructure effort
at least in the changed port. 

> I didn't find anything in my examination of vmalloc itself, so I gave
> up on the idea.
> 
> And just to clarify, are you saying you want to see this before inclusion
> in mainline kernels, or that it would be nice to have but not necessary?

I wouldn't do anything in this area unless somebody shows a benchmark /
profiling results where TLB pressure makes a clear difference. And even
then it may be not worth the effort.

-Andi


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-14 18:32 Luck, Tony
@ 2004-12-15  0:28 ` Hiroyuki KAMEZAWA
  0 siblings, 0 replies; 31+ messages in thread
From: Hiroyuki KAMEZAWA @ 2004-12-15  0:28 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Brent Casavant, linux-kernel, linux-mm, linux-ia64, ak, Yasunori Goto

Luck, Tony wrote:
>>this behavior is turned on by default only for IA64 NUMA systems
> 
> 
>>A boot line parameter "hashdist" can be set to override the default
>>behavior.
> 
> 
> 
> Note to node hot-plug developers ... if this patch goes in you
> will also want to disable this behaviour, otherwaise all nodes
> become non-removeable (unless you can transparently re-locate the
> physical memory backing all these tables).
(adding CC to LHMS)

I think GFP_HOTREMOVABLE , which Goto is proposing, will work well
when we want MEMORY_HOTPLUG.


Thnaks.
--Kame <kamezawa.hiroyu@jp.fujitsu.com>

 >
 > -Tony



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-14 19:13   ` Andi Kleen
  2004-12-14 19:48     ` Brent Casavant
  2004-12-14 20:08     ` Martin J. Bligh
@ 2004-12-14 23:24     ` Nick Piggin
  2 siblings, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2004-12-14 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Martin J. Bligh, Brent Casavant, linux-kernel, linux-mm, linux-ia64

On Tue, 2004-12-14 at 20:13 +0100, Andi Kleen wrote:
> On Tue, Dec 14, 2004 at 10:59:50AM -0800, Martin J. Bligh wrote:
> > > NUMA systems running current Linux kernels suffer from substantial
> > > inequities in the amount of memory allocated from each NUMA node
> > > during boot.  In particular, several large hashes are allocated
> > > using alloc_bootmem, and as such are allocated contiguously from
> > > a single node each.
> > 
> > Yup, makes a lot of sense to me to stripe these, for the caches that
> 
> I originally was a bit worried about the TLB usage, but it doesn't
> seem to be a too big issue (hopefully the benchmarks weren't too
> micro though)
> 

I wonder if you could have an indirection table for the hash, which
may allow you to allocate the hash memory from discontinuous, per
node chunks? Wouldn't the extra pointer chase be a similar cost to
incurring TLB misses when using the vmalloc scheme?

That _may_ help with relocating hashes for hotplug as well (although
I expect the hard part may be synchronising access).

Probably too ugly. Just an idea though.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-14 20:08     ` Martin J. Bligh
@ 2004-12-14 23:24       ` Brent Casavant
  2004-12-14 22:00         ` Martin J. Bligh
  2004-12-15  4:08         ` Andi Kleen
  0 siblings, 2 replies; 31+ messages in thread
From: Brent Casavant @ 2004-12-14 23:24 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andi Kleen, linux-kernel, linux-mm, linux-ia64

On Tue, 14 Dec 2004, Martin J. Bligh wrote:

> --On Tuesday, December 14, 2004 20:13:48 +0100 Andi Kleen <ak@suse.de> wrote:
> 
> > I originally was a bit worried about the TLB usage, but it doesn't
> > seem to be a too big issue (hopefully the benchmarks weren't too
> > micro though)
> 
> Well, as long as we stripe on large page boundaries, it should be fine,
> I'd think. On PPC64, it'll screw the SLB, but ... tough ;-) We can either
> turn it off, or only do it on things larger than the segment size, and
> just round-robin the rest, or allocate from node with most free.

Is there a reasonably easy-to-use existing infrastructure to do this?
I didn't find anything in my examination of vmalloc itself, so I gave
up on the idea.

And just to clarify, are you saying you want to see this before inclusion
in mainline kernels, or that it would be nice to have but not necessary?

Thanks,
Brent

-- 
Brent Casavant                          If you had nothing to fear,
bcasavan@sgi.com                        how then could you be brave?
Silicon Graphics, Inc.                    -- Queen Dama, Source Wars

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-14 23:24       ` Brent Casavant
@ 2004-12-14 22:00         ` Martin J. Bligh
  2004-12-15  4:58           ` Andi Kleen
  2004-12-15  4:08         ` Andi Kleen
  1 sibling, 1 reply; 31+ messages in thread
From: Martin J. Bligh @ 2004-12-14 22:00 UTC (permalink / raw)
  To: Brent Casavant; +Cc: Andi Kleen, linux-kernel, linux-mm, linux-ia64

>> > I originally was a bit worried about the TLB usage, but it doesn't
>> > seem to be a too big issue (hopefully the benchmarks weren't too
>> > micro though)
>> 
>> Well, as long as we stripe on large page boundaries, it should be fine,
>> I'd think. On PPC64, it'll screw the SLB, but ... tough ;-) We can either
>> turn it off, or only do it on things larger than the segment size, and
>> just round-robin the rest, or allocate from node with most free.
> 
> Is there a reasonably easy-to-use existing infrastructure to do this?
> I didn't find anything in my examination of vmalloc itself, so I gave
> up on the idea.

Not that I know of. But (without looking at it), it wouldn't seem 
desperately hard to implement (some argument or flag to vmalloc, or vmalloc_largepage) or something.

> And just to clarify, are you saying you want to see this before inclusion
> in mainline kernels, or that it would be nice to have but not necessary?

I'd say it's a nice to have, rather than necessary, as long as it's not
forced upon people. Maybe a config option that's on by default on ia64
or something. Causing yourself TLB problems is much more acceptable than
causing it for others ;-)

M.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-14 19:30   ` Brent Casavant
@ 2004-12-14 20:10     ` Martin J. Bligh
  0 siblings, 0 replies; 31+ messages in thread
From: Martin J. Bligh @ 2004-12-14 20:10 UTC (permalink / raw)
  To: Brent Casavant; +Cc: linux-kernel, linux-mm, linux-ia64, ak

>> Yup, makes a lot of sense to me to stripe these, for the caches that
>> are global (ie inodes, dentries, etc).  Only question I'd have is 
>> didn't Manfred or someone (Andi?) do this before? Or did that never
>> get accepted? I know we talked about it a while back.
> 
> Are you thinking of the 2006-06-05 patch from Andi about using
> the NUMA policy API for boot time allocation?
> 
> If so, that patch was accepted, but affects neither allocations
> performed via alloc_bootmem nor __get_free_pages, which are
> currently used to allocate these hashes.  vmalloc, however, does
> behave as desired with Andi's patch.

Nope, was for the hashes, but I think maybe it was all vapourware.
 
> Which is why vmalloc was chosen to solve this problem.  There were
> other more complicated possible solutions (e.g. multi-level hash tables,
> with the bottommost/largest level being allocated across all nodes),
> however those would have been so intrusive as to be unpalatable.
> So the vmalloc solution seemed reasonable, as long as it is used
> only on architectures with plentiful vmalloc space.

Yup, seems like a reasonable approach.

M.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-14 19:13   ` Andi Kleen
  2004-12-14 19:48     ` Brent Casavant
@ 2004-12-14 20:08     ` Martin J. Bligh
  2004-12-14 23:24       ` Brent Casavant
  2004-12-14 23:24     ` Nick Piggin
  2 siblings, 1 reply; 31+ messages in thread
From: Martin J. Bligh @ 2004-12-14 20:08 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Brent Casavant, linux-kernel, linux-mm, linux-ia64

--On Tuesday, December 14, 2004 20:13:48 +0100 Andi Kleen <ak@suse.de> wrote:

> On Tue, Dec 14, 2004 at 10:59:50AM -0800, Martin J. Bligh wrote:
>> > NUMA systems running current Linux kernels suffer from substantial
>> > inequities in the amount of memory allocated from each NUMA node
>> > during boot.  In particular, several large hashes are allocated
>> > using alloc_bootmem, and as such are allocated contiguously from
>> > a single node each.
>> 
>> Yup, makes a lot of sense to me to stripe these, for the caches that
> 
> I originally was a bit worried about the TLB usage, but it doesn't
> seem to be a too big issue (hopefully the benchmarks weren't too
> micro though)

Well, as long as we stripe on large page boundaries, it should be fine,
I'd think. On PPC64, it'll screw the SLB, but ... tough ;-) We can either
turn it off, or only do it on things larger than the segment size, and
just round-robin the rest, or allocate from node with most free.
 
>> didn't Manfred or someone (Andi?) do this before? Or did that never
>> get accepted? I know we talked about it a while back.
> 
> I talked about it, but never implemented it. I am not aware of any
> other implementation of this before Brent's.

Cool, must have been my imagination ;-)

M.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-14 19:13   ` Andi Kleen
@ 2004-12-14 19:48     ` Brent Casavant
  2004-12-14 20:08     ` Martin J. Bligh
  2004-12-14 23:24     ` Nick Piggin
  2 siblings, 0 replies; 31+ messages in thread
From: Brent Casavant @ 2004-12-14 19:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Martin J. Bligh, linux-kernel, linux-mm, linux-ia64, Erik Jacobson

On Tue, 14 Dec 2004, Andi Kleen wrote:

> I originally was a bit worried about the TLB usage, but it doesn't
> seem to be a too big issue (hopefully the benchmarks weren't too
> micro though)

I had the same thought about TLB usage.  I would have liked a way
to map larger sections of memory with each TLB.  For example,
if we were going to allocate 128 pages on a 32 node system, it
would be nice to do 32 4 page allocations rather than 128 1 page
allocations.  But I didn't see any suitable infrastructure in
vmalloc or elsewhere to make that easily possible, so I didn't
pursue it.

As far as benchmarks -- I was happy just to find a suitable TCP
benchmark, though I share some of the same concern.  Other than
the netperf TCP_CC and TCP_CRR I couldn't find anything that seemed
like it might be a good test and could be set up with the resources
at hand (i.e. I don't have a large cluster to pound on a web server
benchmark).  That said, if someone does find an unresolvable problem
with the TCP portion (3/3) of the patch, I hope 1/3 and 2/3 are still
worthy of consideration.

> I talked about it, but never implemented it. I am not aware of any
> other implementation of this before Brent's.

To give credit where it's due, Erik Jacobson, also at SGI, proposed
pretty much the same idea on 2003-11-12 in "available memory imbalance
on large NUMA systems".  Andrew responded to that patch in a generally
favorable manner, though asked whether we needed closer scrutinization
of whether hashes were being sized appropriately on large systems
(something that could still use further examination, BTW, particularly
for the TCP ehash).  I used Erik's patch to identify the particular
hashes I needed to tackle in this set.

Thanks,
Brent

-- 
Brent Casavant                          If you had nothing to fear,
bcasavan@sgi.com                        how then could you be brave?
Silicon Graphics, Inc.                    -- Queen Dama, Source Wars

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-14 18:59 ` Martin J. Bligh
  2004-12-14 19:13   ` Andi Kleen
@ 2004-12-14 19:30   ` Brent Casavant
  2004-12-14 20:10     ` Martin J. Bligh
  1 sibling, 1 reply; 31+ messages in thread
From: Brent Casavant @ 2004-12-14 19:30 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: linux-kernel, linux-mm, linux-ia64, ak

On Tue, 14 Dec 2004, Martin J. Bligh wrote:

> Yup, makes a lot of sense to me to stripe these, for the caches that
> are global (ie inodes, dentries, etc).  Only question I'd have is 
> didn't Manfred or someone (Andi?) do this before? Or did that never
> get accepted? I know we talked about it a while back.

Are you thinking of the 2006-06-05 patch from Andi about using
the NUMA policy API for boot time allocation?

If so, that patch was accepted, but affects neither allocations
performed via alloc_bootmem nor __get_free_pages, which are
currently used to allocate these hashes.  vmalloc, however, does
behave as desired with Andi's patch.

Which is why vmalloc was chosen to solve this problem.  There were
other more complicated possible solutions (e.g. multi-level hash tables,
with the bottommost/largest level being allocated across all nodes),
however those would have been so intrusive as to be unpalatable.
So the vmalloc solution seemed reasonable, as long as it is used
only on architectures with plentiful vmalloc space.

Thanks,
Brent

-- 
Brent Casavant                          If you had nothing to fear,
bcasavan@sgi.com                        how then could you be brave?
Silicon Graphics, Inc.                    -- Queen Dama, Source Wars

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-14 18:59 ` Martin J. Bligh
@ 2004-12-14 19:13   ` Andi Kleen
  2004-12-14 19:48     ` Brent Casavant
                       ` (2 more replies)
  2004-12-14 19:30   ` Brent Casavant
  1 sibling, 3 replies; 31+ messages in thread
From: Andi Kleen @ 2004-12-14 19:13 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Brent Casavant, linux-kernel, linux-mm, linux-ia64, ak

On Tue, Dec 14, 2004 at 10:59:50AM -0800, Martin J. Bligh wrote:
> > NUMA systems running current Linux kernels suffer from substantial
> > inequities in the amount of memory allocated from each NUMA node
> > during boot.  In particular, several large hashes are allocated
> > using alloc_bootmem, and as such are allocated contiguously from
> > a single node each.
> 
> Yup, makes a lot of sense to me to stripe these, for the caches that

I originally was a bit worried about the TLB usage, but it doesn't
seem to be a too big issue (hopefully the benchmarks weren't too
micro though)

> didn't Manfred or someone (Andi?) do this before? Or did that never
> get accepted? I know we talked about it a while back.

I talked about it, but never implemented it. I am not aware of any
other implementation of this before Brent's.

-Andi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] NUMA boot hash allocation interleaving
  2004-12-14 17:53 Brent Casavant
@ 2004-12-14 18:59 ` Martin J. Bligh
  2004-12-14 19:13   ` Andi Kleen
  2004-12-14 19:30   ` Brent Casavant
  0 siblings, 2 replies; 31+ messages in thread
From: Martin J. Bligh @ 2004-12-14 18:59 UTC (permalink / raw)
  To: Brent Casavant, linux-kernel, linux-mm; +Cc: linux-ia64, ak

> NUMA systems running current Linux kernels suffer from substantial
> inequities in the amount of memory allocated from each NUMA node
> during boot.  In particular, several large hashes are allocated
> using alloc_bootmem, and as such are allocated contiguously from
> a single node each.

Yup, makes a lot of sense to me to stripe these, for the caches that
are global (ie inodes, dentries, etc).  Only question I'd have is 
didn't Manfred or someone (Andi?) do this before? Or did that never
get accepted? I know we talked about it a while back.

M,



^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 0/3] NUMA boot hash allocation interleaving
@ 2004-12-14 18:32 Luck, Tony
  2004-12-15  0:28 ` Hiroyuki KAMEZAWA
  0 siblings, 1 reply; 31+ messages in thread
From: Luck, Tony @ 2004-12-14 18:32 UTC (permalink / raw)
  To: Brent Casavant, linux-kernel, linux-mm; +Cc: linux-ia64, ak

>this behavior is turned on by default only for IA64 NUMA systems

>A boot line parameter "hashdist" can be set to override the default
>behavior.


Note to node hot-plug developers ... if this patch goes in you
will also want to disable this behaviour, otherwaise all nodes
become non-removeable (unless you can transparently re-locate the
physical memory backing all these tables).

-Tony

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 0/3] NUMA boot hash allocation interleaving
@ 2004-12-14 17:53 Brent Casavant
  2004-12-14 18:59 ` Martin J. Bligh
  0 siblings, 1 reply; 31+ messages in thread
From: Brent Casavant @ 2004-12-14 17:53 UTC (permalink / raw)
  To: linux-kernel, linux-mm; +Cc: linux-ia64, ak

NUMA systems running current Linux kernels suffer from substantial
inequities in the amount of memory allocated from each NUMA node
during boot.  In particular, several large hashes are allocated
using alloc_bootmem, and as such are allocated contiguously from
a single node each.

This becomes a problem for certain workloads that are relatively common
on big-iron HPC NUMA systems.  In particular, a number of MPI and OpenMP
applications which require nearly all available processors in the system
and nearly all the memory on each node run into difficulties.  Due to the
uneven memory distribution onto a few nodes, any thread on those nodes will
require a portion of its memory be allocated from remote nodes.  Any
access to those memory locations will be slower than local accesses,
and thereby slows down the effective computation rate for the affected
CPUs/threads.  This problem is further amplified if the application is
tightly synchronized between threads (as is often the case), as they entire
job can run only at the speed of the slowest thread.

Additionally since these hashes are usually accessed by all CPUS in the
system, the NUMA network link on the node which hosts the hash experiences
disproportionate traffic levels, thereby reducing the memory bandwidth
available to that node's CPUs, and further penalizing performance of the
threads executed thereupon.

As such, it is desired to find a way to distribute these large hash
allocations more evenly across NUMA nodes.  Fortunately current
kernels do perform allocation interleaving for vmalloc() during boot,
which provides a stepping stone to a solution.

This series of patches enables (but does not require) the kernel to
allocate several boot time hashes using vmalloc rather than alloc_bootmem,
thereby causing the hashes to be interleaved amongst NUMA nodes.  In
particular the dentry cache, inode cache, TCP ehash, and TCP bhash have been
changed to be allocated in this manner.  Due to the limited vmalloc space
on architectures such as i386, this behavior is turned on by default only
for IA64 NUMA systems (though there is no reason other interested
architectures could not enable it if desired).  Non-IA64 and non-NUMA
systems continue to use the existing alloc_bootmem() allocation mechanism.
A boot line parameter "hashdist" can be set to override the default
behavior.

The following two sets of example output show the uneven distribution
just after boot, using init=/bin/sh to eliminate as much non-kernel
allocation as possible.

Without the boot hash distribution patches:

 Nid  MemTotal   MemFree   MemUsed      (in kB)
   0   3870656   3697696    172960
   1   3882992   3866656     16336
   2   3883008   3866784     16224
   3   3882992   3866464     16528
   4   3883008   3866592     16416
   5   3883008   3866720     16288
   6   3882992   3342176    540816
   7   3883008   3865440     17568
   8   3882992   3866560     16432
   9   3883008   3866400     16608
  10   3882992   3866592     16400
  11   3883008   3866400     16608
  12   3882992   3866400     16592
  13   3883008   3866432     16576
  14   3883008   3866528     16480
  15   3864768   3848256     16512
 ToT  62097440  61152096    945344

Notice that nodes 0 and 6 have a substantially larger memory utilization
than all other nodes.

With the boot hash distribution patch:

 Nid  MemTotal   MemFree   MemUsed      (in kB)
   0   3870656   3789792     80864
   1   3882992   3843776     39216
   2   3883008   3843808     39200
   3   3882992   3843904     39088
   4   3883008   3827488     55520
   5   3883008   3843712     39296
   6   3882992   3843936     39056
   7   3883008   3844096     38912
   8   3882992   3843712     39280
   9   3883008   3844000     39008
  10   3882992   3843872     39120
  11   3883008   3843872     39136
  12   3882992   3843808     39184
  13   3883008   3843936     39072
  14   3883008   3843712     39296
  15   3864768   3825760     39008
 ToT  62097440  61413184    684256

While not perfectly even, we can see that there is a substantial
improvement in the spread of memory allocated by the kernel during
boot.  The remaining uneveness may be due in part to further boot
time allocations that could be addressed in a similar manner, but
some difference is due to the somewhat special nature of node 0
during boot.  However the uneveness has fallen to a much more
acceptable level (at least to a level that SGI isn't concerned about).

The astute reader will also notice that in this example, with this patch
approximately 256 MB less memory was allocated during boot.  This is due
to the size limits of a single vmalloc.  More specifically, this is because
the automatically computed size of the TCP ehash exceeds the maximum
size which a single vmalloc can accomodate.  However this is of little
practical concern as the vmalloc size limit simply reduces one ridiculously
large allocation (512MB) to a slightly less ridiculously large allocation
(256MB).  In practice machines with large memory configurations are using
the thash_entries setting to limit the size of the TCP ehash _much_ lower
than either of the automatically computed values.  Illustrative of the
exceedingly large nature of the automatically computed size, SGI
currently recommends that customers boot with thash_entries=2097152,
which works out to a 32MB allocation.  In any case, setting hashdist=0
will allow for allocations in excess of vmalloc limits, if so desired.

Other than the vmalloc limit, great care was taken to ensure that the
size of TCP hash allocations was not altered by this patch.  Due to
slightly different computation techniques between the existing TCP code
and alloc_large_system_hash (which is now utilized), some of the magic
constants in the TCP hash allocation code were changed.  On all sizes
of system (128MB through 64GB) that I had access to, the patched code
preserves the previous hash size, as long as the vmalloc limit
(256MB on IA64) is not encountered.

There was concern that changing the TCP-related hashes to use vmalloc
space may adversely impact network performance.  To this end the netperf
set of benchmarks was run.  Some individual tests seemed to benefit
slightly, some seemed to be harmed slightly, but in all cases the average
difference with and without these patches was well within the variabilty
I would see from run to run.

The following is the overall netperf averages (30 10 second runs each)
against an older kernel with these same patches. These tests were run
over loopback as GigE results were so inconsistent run to run both with
and without these patches that they provided no meaningful comparison that
I could discern.  I used the same kernel (IA64 generic) for each run,
simply varying the new "hashdist" boot parameter to turn on or off the new
allocation behavior.  In all cases the thash_entries value was manually
specified as discussed previously to eliminate any variability that
might result from that size difference.

HP ZX1, hashdist=0
==================
TCP_RR = 19389
TCP_MAERTS = 6561 
TCP_STREAM = 6590 
TCP_CC = 9483 
TCP_CRR = 8633 

HP ZX1, hashdist=1
==================
TCP_RR = 19411
TCP_MAERTS = 6559 
TCP_STREAM = 6584 
TCP_CC = 9454 
TCP_CRR = 8626 

SGI Altix, hashdist=0
=====================
TCP_RR = 16871
TCP_MAERTS = 3925 
TCP_STREAM = 4055 
TCP_CC = 8438 
TCP_CRR = 7750 

SGI Altix, hashdist=1
=====================
TCP_RR = 17040
TCP_MAERTS = 3913 
TCP_STREAM = 4044 
TCP_CC = 8367 
TCP_CRR = 7538 

I believe the TCP_CC and TCP_CRR are the tests most sensitive to this
particular change.  But again, I want to emphasize that even the
differences you see above are _well_ within the variability I saw
from run to run of any given test.

-- 
Brent Casavant                          If you had nothing to fear,
bcasavan@sgi.com                        how then could you be brave?
Silicon Graphics, Inc.                    -- Queen Dama, Source Wars

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2004-12-23  2:20 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-12-15 17:25 [PATCH 0/3] NUMA boot hash allocation interleaving Luck, Tony
  -- strict thread matches above, loose matches on Subject: below --
2004-12-20 23:36 Brent Casavant
2004-12-14 18:32 Luck, Tony
2004-12-15  0:28 ` Hiroyuki KAMEZAWA
2004-12-14 17:53 Brent Casavant
2004-12-14 18:59 ` Martin J. Bligh
2004-12-14 19:13   ` Andi Kleen
2004-12-14 19:48     ` Brent Casavant
2004-12-14 20:08     ` Martin J. Bligh
2004-12-14 23:24       ` Brent Casavant
2004-12-14 22:00         ` Martin J. Bligh
2004-12-15  4:58           ` Andi Kleen
2004-12-15 14:47             ` Anton Blanchard
2004-12-15 23:37               ` Brent Casavant
2004-12-16  5:02               ` Andi Kleen
     [not found]                 ` <20041216051323.GI24000@krispykreme.ozlabs.ibm.com>
2004-12-16 14:18                   ` Jose R. Santos
2004-12-20 16:56                     ` Jose R. Santos
2004-12-21 11:46                       ` Anton Blanchard
2004-12-21 16:23                         ` Brent Casavant
2004-12-23  2:19                           ` Jose R. Santos
2004-12-15  4:08         ` Andi Kleen
2004-12-15  7:14           ` Martin J. Bligh
2004-12-15  7:17             ` Andi Kleen
2004-12-15 15:08               ` Martin J. Bligh
2004-12-15 18:24               ` Brent Casavant
2004-12-15  7:41           ` Eric Dumazet
2004-12-15  7:46             ` Andi Kleen
2004-12-15  9:14               ` Andi Kleen
2004-12-14 23:24     ` Nick Piggin
2004-12-14 19:30   ` Brent Casavant
2004-12-14 20:10     ` Martin J. Bligh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).