[patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access

* [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
@ 2008-05-30  3:56 Christoph Lameter
  2008-05-30  3:56 ` [patch 01/41] cpu_alloc: Increase percpu area size to 128k Christoph Lameter
                   ` (41 more replies)
  0 siblings, 42 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

In various places the kernel maintains arrays of pointers indexed by
processor numbers. These are used to locate objects that need to be used
when executing on a specirfic processor. Both the slab allocator
and the page allocator use these arrays and there the arrays are used in
performance critical code. The allocpercpu functionality is a simple
allocator to provide these arrays. However, there are certain drawbacks
in using such arrays:

1. The arrays become huge for large systems and may be very sparsely
   populated (if they are dimensionied for NR_CPUS) on an architecture
   like IA64 that allows up to 4k cpus if a kernel is then booted on a
   machine that only supports 8 processors. We could nr_cpu_ids there
   but we would still have to allocate all possible processors up to
   the number of processor ids. cpu_alloc can deal with sparse cpu_maps.

2. The arrays cause surrounding variables to no longer fit into a single
   cacheline. The layout of core data structure is typically optimized so
   that variables frequently used together are placed in the same cacheline.
   Arrays of pointers move these variables far apart and destroy this effect.

3. A processor frequently follows only one pointer for its own use. Thus
   that cacheline with that pointer has to be kept in memory. The neighboring
   pointers are all to other processors that are rarely used. So a whole
   cacheline of 128 bytes may be consumed but only 8 bytes of information
   is constant use. It would be better to be able to place more information
   in this cacheline.

4. The lookup of the per cpu object is expensive and requires multiple
   memory accesses to:

   A) smp_processor_id()
   B) pointer to the base of the per cpu pointer array
   C) pointer to the per cpu object in the pointer array
   D) the per cpu object itself.

5. Each use of allocper requires its own per cpu array. On large
   system large arrays have to be allocated again and again.

6. Processor hotplug cannot effectively track the per cpu objects
   since the VM cannot find all memory that was allocated for
   a specific cpu. It is impossible to add or remove objects in
   a consistent way. Although the allocpercpu subsystem was extended
   to add that capability is not used since use would require adding
   cpu hotplug callbacks to each and every use of allocpercpu in
   the kernel.

The patchset here provides an cpu allocator that arranges data differently.
Objects are placed tightly in linear areas reserved for each processor.
The areas are of a fixed size so that address calculation can be used
instead of a lookup. This means that

1. The VM knows where all the per cpu variables are and it could remove
   or add cpu areas as cpus come online or go offline.

2. There is only a single per cpu array that is used for the percpu area
   and all per cpu allocations.

3. The lookup of a per cpu object is easy and requires memory access to
	(worst case: architecture does not provide cpu ops):

   A) per cpu offset from the per cpu pointer table
      (if its the current processor then there is usually some
      more efficient means of retrieving the offset)
   B) cpu pointer to the object
   C) the per cpu object itself.

4. Surrounding variables can be placed in the same cacheline.
   This allow f.e. in SLUB to avoid caching objects in per cpu structures
   since the kmem_cache structure is finally available without the need
   to access a cache cold cacheline.

5. A single pointer can be used regardless of the number of processors
   in the system.

The cpu allocator manages a fixed size data per cpu data area. The size
can be configured as needed.

The current usage of the cpu area can be seen in the field

	cpu_bytes

in /proc/vmstat

The patchset is agsinst 2.6.26-rc4.

There are two arch implementation of cpu ops provides.

1. x86. Another version of the zero based x86 patches
   exist by Mike.

2. IA64. Limited implementation since IA64 has
   no fast RMV ops. But we can avoid the addition of the
   my_cpu_offset in hotpaths.

This is a rather complex patchset and I am not sure how to merge it.
Maybe it would be best to merge a piece at a time beginning with the
basic infrastructure in the first few patches?

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread