linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: percpu-2.5.63-bk5-1 (properly generated)
@ 2003-03-02 18:24 Martin J. Bligh
  2003-03-02 20:24 ` William Lee Irwin III
  0 siblings, 1 reply; 16+ messages in thread
From: Martin J. Bligh @ 2003-03-02 18:24 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel

>> Tested, boots, and runs on NUMA-Q. Trims 6s of 41s off kernel compiles.

Odd. I get nothing like that difference.

Kernbench-2: (make -j N vmlinux, where N = 2 x num_cpus)
                              Elapsed        User      System         CPU
              2.5.63-mjb2       44.43      557.16       95.31     1467.83
      2.5.63-mjb2-pernode       44.21      556.92       95.16     1474.33

Kernbench-16: (make -j N vmlinux, where N = 16 x num_cpus)
                              Elapsed        User      System         CPU
              2.5.63-mjb2       45.39      560.26      117.25     1492.33
      2.5.63-mjb2-pernode       44.78      560.24      112.20     1501.17

No difference for make -j32, definite improvement in the systime for -j256.

Measurement error? Different test (make -j 65536? ;-))
The full output of the time command (systime, user, elapsed, etc)
and profiles might be useful in working out why we have this disparity.
(diffprofile at end)

It also seems to slow SDET down a fraction:

SDET 1  (see disclaimer)
                           Throughput    Std. Dev
              2.5.63-mjb2       100.0%         0.9%
      2.5.63-mjb2-pernode        92.8%         2.9%

SDET 2  (see disclaimer)
                           Throughput    Std. Dev
              2.5.63-mjb2       100.0%         6.5%
      2.5.63-mjb2-pernode        97.7%         3.2%

SDET 4  (see disclaimer)
                           Throughput    Std. Dev
              2.5.63-mjb2       100.0%         2.3%
      2.5.63-mjb2-pernode        97.9%         3.5%

SDET 8  (see disclaimer)
                           Throughput    Std. Dev
              2.5.63-mjb2       100.0%         3.3%
      2.5.63-mjb2-pernode       101.0%         2.6%

SDET 16  (see disclaimer)
                           Throughput    Std. Dev
              2.5.63-mjb2       100.0%         2.7%
      2.5.63-mjb2-pernode       103.8%         3.1%

SDET 32  (see disclaimer)
                           Throughput    Std. Dev
              2.5.63-mjb2       100.0%         0.6%
      2.5.63-mjb2-pernode        96.0%         1.4%

SDET 64  (see disclaimer)
                           Throughput    Std. Dev
              2.5.63-mjb2       100.0%         1.2%
      2.5.63-mjb2-pernode        96.8%         1.2%

Any chance you could split the patch up a bit, maybe roughly along the
lines of the following groupings:

> (6)  per_cpu()/__get_cpu_var() needs to parenthesize the cpu arg
> (2)  make irq_stat[] per_cpu, x86-only
> (3)  make mmu_gathers[] per_cpu, with comcomitant divorce from asm-generic
> (16) make task_cache per_cpu
> (17) make runqueues[] per_cpu
> (19) make reap_timers[] per_cpu


> (1) reuse the arch/i386/discontigmem.c per-node mem_map[] virtual remap
> 	to remap node-local memory backing per_cpu and per_node areas
> (8)  added MAX_NODE_CPUS for pessimistic sizing of virtual remapping arenas
> (13) declare MAX_NODE_CPUS in include/asm-i386/srat.h
> (4)  delay discontig zone_sizes_init() to dodge bootstrap order issues
> (5)  add .data.pernode section handling in vmlinux.lds.S
> (7)  introduced asm-generic/pernode.h to do similar things as percpu.h
> (10) #undef asm-i386/per{cpu,node}.h's __GENERIC_PER_{CPU,NODE}
> (11) declare setup_per_cpu_areas() in asm-i386/percpu.h
> (12) make an asm-i386/pernode.h stub header like include/asm-generic/pernode.h
> (15) call setup_per_node_areas() in init/main.c, with analogous hooks


> (18) make node_nr_running[] per_node
> (14) make zone_table[] per_node


> (9)  fix return type error in NUMA-Q get_zholes_size()

diffprofile for kernbench -j256
(+ worse with patch, - better)


        60   125.0% page_address
        30     3.9% d_lookup
        28    18.4% path_lookup
        15    10.5% do_schedule
        12    63.2% __pagevec_lru_add_active
        11    47.8% bad_range
        10    15.9% kmap_atomic
...
       -11   -18.3% file_ra_state_init
       -11    -3.9% zap_pte_range
       -12    -5.5% page_add_rmap
       -14    -6.8% file_move
       -15   -17.9% strnlen_user
       -20    -4.3% vm_enough_memory
       -38   -13.9% __fput
       -39    -8.4% get_empty_filp
       -43   -14.5% atomic_dec_and_lock
      -389    -8.3% default_idle
      -552   -26.1% .text.lock.file_table
      -931    -5.6% total

diffprofile for SDET 64:
(+ worse with patch, - better)


      1294     2.7% total
      1057     3.1% default_idle
       225    10.5% __down
        86    13.1% __wake_up
        48    66.7% page_address
        41     4.8% do_schedule
        37    77.1% kmap_atomic
        25    12.8% __copy_to_user_ll
        20     4.5% page_remove_rmap
        16   320.0% bad_range
        15    13.0% path_lookup
        14    18.4% copy_mm
        12     4.0% release_pages
        11    33.3% read_block_bitmap
        11     2.3% copy_page_range
        10    25.6% path_release
        10     3.2% page_add_rmap
         9    16.7% free_hot_cold_page
         9   450.0% page_waitqueue
         7    36.8% ext2_get_group_desc
         6    12.0% proc_pid_stat
         6     5.0% vfs_read
         6    26.1% kmem_cache_alloc
         6    35.3% alloc_inode
         6    46.2% mark_page_accessed
         5    22.7% ext2_update_inode
         5    45.5% __blk_queue_bounce
         5    55.6% exec_mmap
...
        -5   -23.8% real_lookup
        -5   -25.0% exit_mmap
        -5   -25.0% do_page_cache_readahead
        -5    -3.6% do_anonymous_page
        -5    -7.9% follow_mount
        -5   -18.5% generic_delete_inode
        -5   -45.5% kunmap_high
        -6   -17.6% __copy_user_intel
        -7   -23.3% dentry_open
        -8   -40.0% default_wake_function
        -8   -10.1% filemap_nopage
        -9   -12.2% __find_get_block_slow
        -9    -8.7% ext2_free_blocks
       -11    -5.9% atomic_dec_and_lock
       -13    -9.0% __fput
       -15   -75.0% mark_buffer_dirty
       -19    -5.3% find_get_page
       -20   -23.5% file_move
       -22   -20.8% __find_get_block
       -37   -14.1% get_empty_filp
      -175   -39.1% .text.lock.file_table




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: percpu-2.5.63-bk5-1 (properly generated)
  2003-03-02 18:24 percpu-2.5.63-bk5-1 (properly generated) Martin J. Bligh
@ 2003-03-02 20:24 ` William Lee Irwin III
  2003-03-02 20:46   ` Martin J. Bligh
  0 siblings, 1 reply; 16+ messages in thread
From: William Lee Irwin III @ 2003-03-02 20:24 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: linux-kernel

At some point in the past, I wrote:
>>> Tested, boots, and runs on NUMA-Q. Trims 6s of 41s off kernel compiles.

On Sun, Mar 02, 2003 at 10:24:37AM -0800, Martin J. Bligh wrote:
> Odd. I get nothing like that difference.
> Kernbench-2: (make -j N vmlinux, where N = 2 x num_cpus)
>                               Elapsed        User      System         CPU
>               2.5.63-mjb2       44.43      557.16       95.31     1467.83
>       2.5.63-mjb2-pernode       44.21      556.92       95.16     1474.33
> Kernbench-16: (make -j N vmlinux, where N = 16 x num_cpus)
>                               Elapsed        User      System         CPU
>               2.5.63-mjb2       45.39      560.26      117.25     1492.33
>       2.5.63-mjb2-pernode       44.78      560.24      112.20     1501.17
> No difference for make -j32, definite improvement in the systime for -j256.

Maybe your machine's running slow?
AFAIK the machines we're using are identical, and mine sees:

make -j bzImage > /dev/null  317.70s user 148.43s system 1295% cpu 35.984 total
(yes, this is 5 off of 41s, apparently 1s measurement variations are typical)

make -j36 bzImage > /dev/null  302.33s user 115.02s system 1284% cpu 32.492 total
make -j38 bzImage > /dev/null  302.52s user 117.06s system 1300% cpu 32.258 total
make -j40 bzImage > /dev/null  303.53s user 117.42s system 1305% cpu 32.251 total
make -j44 bzImage > /dev/null  304.02s user 122.14s system 1299% cpu 32.792 total

Check MTRR's etc.?


-- wli

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: percpu-2.5.63-bk5-1 (properly generated)
  2003-03-02 20:24 ` William Lee Irwin III
@ 2003-03-02 20:46   ` Martin J. Bligh
  2003-03-02 21:06     ` William Lee Irwin III
  0 siblings, 1 reply; 16+ messages in thread
From: Martin J. Bligh @ 2003-03-02 20:46 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel

> On Sun, Mar 02, 2003 at 10:24:37AM -0800, Martin J. Bligh wrote:
>> Odd. I get nothing like that difference.
>> Kernbench-2: (make -j N vmlinux, where N = 2 x num_cpus)
>>                               Elapsed        User      System         CPU
>>               2.5.63-mjb2       44.43      557.16       95.31     1467.83
>>       2.5.63-mjb2-pernode       44.21      556.92       95.16     1474.33
>> Kernbench-16: (make -j N vmlinux, where N = 16 x num_cpus)
>>                               Elapsed        User      System         CPU
>>               2.5.63-mjb2       45.39      560.26      117.25     1492.33
>>       2.5.63-mjb2-pernode       44.78      560.24      112.20     1501.17
>> No difference for make -j32, definite improvement in the systime for -j256.
> 
> Maybe your machine's running slow?
> AFAIK the machines we're using are identical, and mine sees:

No, I just started using a config file for kernbench that has every option
under the sun turned on ;-) Makes a longer test, and stabilises results.
ftp://ftp.kernel.org/pub/linux/kernel/people/people/mbligh/config/kernbench2.config
(2.4.17). It's the difference between before and after runs that's going to
be interesting anyway.

> make -j bzImage > /dev/null  317.70s user 148.43s system 1295% cpu 35.984 total
> (yes, this is 5 off of 41s, apparently 1s measurement variations are typical)

make -j is going to spawn as many tasks as possible, creating a massive
forkbomb ... that might be behind the differences - your patch might make
more of a difference for huge amounts of context switching / cache thrash
(not necessarily a bad thing, I just want to find the cause).
 
> make -j36 bzImage > /dev/null  302.33s user 115.02s system 1284% cpu 32.492 total
> make -j38 bzImage > /dev/null  302.52s user 117.06s system 1300% cpu 32.258 total
> make -j40 bzImage > /dev/null  303.53s user 117.42s system 1305% cpu 32.251 total
> make -j44 bzImage > /dev/null  304.02s user 122.14s system 1299% cpu 32.792 total

How does that compare with and without your patch though?

Would be useful if you can grab a before and after profile, and see exactly
what it is that's getting thrashed that you're fixing (may just be everything).

M.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: percpu-2.5.63-bk5-1 (properly generated)
  2003-03-02 20:46   ` Martin J. Bligh
@ 2003-03-02 21:06     ` William Lee Irwin III
  2003-03-02 21:58       ` Martin J. Bligh
  0 siblings, 1 reply; 16+ messages in thread
From: William Lee Irwin III @ 2003-03-02 21:06 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: linux-kernel

At some point in the past, I wrote:
>> make -j44 bzImage > /dev/null  304.02s user 122.14s system 1299% cpu 32.792 total
[...]

On Sun, Mar 02, 2003 at 12:46:00PM -0800, Martin J. Bligh wrote:
> How does that compare with and without your patch though?

There's a relatively large (12s/44s == 27%) difference between absolute
timings on our machines, which suggests a large disturbance in the force.
2.5.63-bk5 virgin appears to get timings in the low 40's.


On Sun, Mar 02, 2003 at 12:46:00PM -0800, Martin J. Bligh wrote:
> Would be useful if you can grab a before and after profile, and see exactly
> what it is that's getting thrashed that you're fixing (may just be everything).

>From the profile posted it's the division in page_zone().


-- wli


diff -urpN pernode-2.5.63-bk5-1/include/linux/mm.h pernode-2.5.63-bk5-2/include/linux/mm.h
--- pernode-2.5.63-bk5-1/include/linux/mm.h	2003-03-02 02:55:14.000000000 -0800
+++ pernode-2.5.63-bk5-2/include/linux/mm.h	2003-03-02 12:55:20.000000000 -0800
@@ -316,6 +316,7 @@ static inline void put_page(struct page 
  * sets it, so none of the operations on it need to be atomic.
  */
 #define NODE_SHIFT 4
+#define ZONE_MASK  ((1UL << NODE_SHIFT) - 1)
 #define ZONE_SHIFT (BITS_PER_LONG - 8)
 
 struct zone;
@@ -324,7 +325,7 @@ DECLARE_PER_NODE(struct zone *[MAX_NR_ZO
 static inline struct zone *page_zone(struct page *page)
 {
 	unsigned long zone = page->flags >> ZONE_SHIFT;
-	return per_node(zone_table, zone/MAX_NR_ZONES)[zone % MAX_NR_ZONES];
+	return per_node(zone_table, zone >> NODE_SHIFT)[zone & ZONE_MASK];
 }
 
 static inline void set_page_zone(struct page *page, unsigned long zone_num)
diff -urpN pernode-2.5.63-bk5-1/mm/page_alloc.c pernode-2.5.63-bk5-2/mm/page_alloc.c
--- pernode-2.5.63-bk5-1/mm/page_alloc.c	2003-03-02 02:55:14.000000000 -0800
+++ pernode-2.5.63-bk5-2/mm/page_alloc.c	2003-03-02 12:15:00.000000000 -0800
@@ -1262,7 +1262,7 @@ static void __init free_area_init_core(s
 		 */
 		for (i = 0; i < size; i++) {
 			struct page *page = lmem_map + local_offset + i;
-			set_page_zone(page, nid * MAX_NR_ZONES + j);
+			set_page_zone(page, (nid << NODE_SHIFT) + j);
 			set_page_count(page, 0);
 			SetPageReserved(page);
 			INIT_LIST_HEAD(&page->list);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: percpu-2.5.63-bk5-1 (properly generated)
  2003-03-02 21:06     ` William Lee Irwin III
@ 2003-03-02 21:58       ` Martin J. Bligh
  2003-03-02 22:10         ` William Lee Irwin III
  0 siblings, 1 reply; 16+ messages in thread
From: Martin J. Bligh @ 2003-03-02 21:58 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel

> There's a relatively large (12s/44s == 27%) difference between absolute
> timings on our machines, which suggests a large disturbance in the force.
> 2.5.63-bk5 virgin appears to get timings in the low 40's.

Did you actually read the previous email? 
Same config file? Same tree? same compiler (gcc 2.95.4?)

>> Would be useful if you can grab a before and after profile, and see exactly
>> what it is that's getting thrashed that you're fixing (may just be everything).
> 
>> From the profile posted it's the division in page_zone().

I think we're talking about different things:

1. Need to isolate what's causing the 6s improvement you're seeing.
Can you generate profiles & time output for before and after the patch,
and describe the test you're running (presumably make -j).

2. SDET degredation. I'll try the additional patch you sent out on that.

M.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: percpu-2.5.63-bk5-1 (properly generated)
  2003-03-02 21:58       ` Martin J. Bligh
@ 2003-03-02 22:10         ` William Lee Irwin III
  2003-03-02 23:13           ` Martin J. Bligh
  0 siblings, 1 reply; 16+ messages in thread
From: William Lee Irwin III @ 2003-03-02 22:10 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: linux-kernel

At some point in the past, I wrote:
>> There's a relatively large (12s/44s == 27%) difference between absolute
>> timings on our machines, which suggests a large disturbance in the force.
>> 2.5.63-bk5 virgin appears to get timings in the low 40's.

On Sun, Mar 02, 2003 at 01:58:58PM -0800, Martin J. Bligh wrote:
> Did you actually read the previous email? 
> Same config file? Same tree? same compiler (gcc 2.95.4?)

gcc2.95.4; 2.5.63-bk5 w/& w/o, no patchkits prior, .config below


At some point in the past, I wrote:
>> From the profile posted it's the division in page_zone().

On Sun, Mar 02, 2003 at 01:58:58PM -0800, Martin J. Bligh wrote:
> I think we're talking about different things:
> 1. Need to isolate what's causing the 6s improvement you're seeing.
> Can you generate profiles & time output for before and after the patch,
> and describe the test you're running (presumably make -j).
> 2. SDET degredation. I'll try the additional patch you sent out on that.

It's not hard to figure out.
>         60   125.0% page_address
>         12    63.2% __pagevec_lru_add_active
>         11    47.8% bad_range
>         10    15.9% kmap_atomic

All users of page_zone(). The question you're (hopefully) about to
answer is whether it was the division or something else like codesize
or the newly introduced indirection.

If that is still seeing page_zone() suckage, I'll rip zone_table[] out
of it entirely.


-- wli

#
# Automatically generated make config: don't edit
#
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_SWAP=y
CONFIG_UID16=y
CONFIG_GENERIC_ISA_DMA=y

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y

#
# General setup
#
CONFIG_SYSVIPC=y
# CONFIG_BSD_PROCESS_ACCT is not set
CONFIG_SYSCTL=y
CONFIG_LOG_BUF_SHIFT=17

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_OBSOLETE_MODPARM=y
# CONFIG_MODVERSIONS is not set
CONFIG_KMOD=y

#
# Processor type and features
#
# CONFIG_X86_PC is not set
# CONFIG_X86_VOYAGER is not set
CONFIG_X86_NUMAQ=y
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
CONFIG_MPENTIUMIII=y
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MELAN is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_L1_CACHE_SHIFT=5
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_PREFETCH=y
CONFIG_HUGETLB_PAGE=y
CONFIG_SMP=y
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_NR_CPUS=32
CONFIG_NUMA=y
CONFIG_DISCONTIGMEM=y
CONFIG_HAVE_ARCH_BOOTMEM_NODE=y
# CONFIG_X86_MCE is not set
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
# CONFIG_MICROCODE is not set
# CONFIG_X86_MSR is not set
# CONFIG_X86_CPUID is not set
# CONFIG_EDD is not set
# CONFIG_NOHIGHMEM is not set
# CONFIG_HIGHMEM4G is not set
CONFIG_HIGHMEM64G=y
CONFIG_HIGHMEM=y
CONFIG_X86_PAE=y
CONFIG_HIGHPTE=y
# CONFIG_MATH_EMULATION is not set
CONFIG_MTRR=y
CONFIG_HAVE_DEC_LOCK=y

#
# Power management options (ACPI, APM)
#
# CONFIG_PM is not set

#
# ACPI Support
#
# CONFIG_ACPI is not set

#
# CPU Frequency scaling
#
# CONFIG_CPU_FREQ is not set

#
# Bus options (PCI, PCMCIA, EISA, MCA, ISA)
#
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
CONFIG_PCI_GODIRECT=y
# CONFIG_PCI_GOANY is not set
CONFIG_PCI_DIRECT=y
# CONFIG_SCx200 is not set
CONFIG_PCI_LEGACY_PROC=y
CONFIG_PCI_NAMES=y
CONFIG_ISA=y
# CONFIG_EISA is not set
# CONFIG_MCA is not set
# CONFIG_HOTPLUG is not set

#
# Executable file formats
#
CONFIG_KCORE_ELF=y
# CONFIG_KCORE_AOUT is not set
CONFIG_BINFMT_AOUT=y
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_MISC=y

#
# Memory Technology Devices (MTD)
#
# CONFIG_MTD is not set

#
# Parallel port support
#
# CONFIG_PARPORT is not set

#
# Plug and Play support
#
# CONFIG_PNP is not set

#
# Block devices
#
CONFIG_BLK_DEV_FD=y
# CONFIG_BLK_DEV_XD is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
CONFIG_BLK_DEV_LOOP=y
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_RAM is not set
CONFIG_LBD=y

#
# ATA/ATAPI/MFM/RLL device support
#
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_SCSI=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
# CONFIG_BLK_DEV_SR is not set
# CONFIG_CHR_DEV_SG is not set

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
# CONFIG_SCSI_REPORT_LUNS is not set
CONFIG_SCSI_CONSTANTS=y
# CONFIG_SCSI_LOGGING is not set

#
# SCSI low-level drivers
#
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_7000FASST is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AHA152X is not set
# CONFIG_SCSI_AHA1542 is not set
# CONFIG_SCSI_AACRAID is not set
CONFIG_SCSI_AIC7XXX=m
CONFIG_AIC7XXX_CMDS_PER_DEVICE=253
CONFIG_AIC7XXX_RESET_DELAY_MS=15000
# CONFIG_AIC7XXX_PROBE_EISA_VL is not set
# CONFIG_AIC7XXX_BUILD_FIRMWARE is not set
CONFIG_AIC7XXX_DEBUG_ENABLE=y
CONFIG_AIC7XXX_DEBUG_MASK=0
CONFIG_AIC7XXX_REG_PRETTY_PRINT=y
# CONFIG_SCSI_AIC7XXX_OLD is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_IN2000 is not set
# CONFIG_SCSI_AM53C974 is not set
# CONFIG_SCSI_MEGARAID is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_CPQFCTS is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_DTC3280 is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_EATA_PIO is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_GENERIC_NCR5380 is not set
# CONFIG_SCSI_GENERIC_NCR5380_MMIO is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_NCR53C406A is not set
# CONFIG_SCSI_NCR53C7xx is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_NCR53C8XX is not set
# CONFIG_SCSI_SYM53C8XX is not set
# CONFIG_SCSI_PAS16 is not set
# CONFIG_SCSI_PCI2000 is not set
# CONFIG_SCSI_PCI2220I is not set
# CONFIG_SCSI_PSI240I is not set
# CONFIG_SCSI_QLOGIC_FAS is not set
CONFIG_SCSI_QLOGIC_ISP=y
# CONFIG_SCSI_QLOGIC_FC is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_SEAGATE is not set
# CONFIG_SCSI_SYM53C416 is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_T128 is not set
# CONFIG_SCSI_U14_34F is not set
# CONFIG_SCSI_ULTRASTOR is not set
# CONFIG_SCSI_NSP32 is not set
# CONFIG_SCSI_DEBUG is not set

#
# Old CD-ROM drivers (not SCSI, not IDE)
#
# CONFIG_CD_NO_IDESCSI is not set

#
# Multi-device support (RAID and LVM)
#
# CONFIG_MD is not set

#
# Fusion MPT device support
#
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support (EXPERIMENTAL)
#
# CONFIG_IEEE1394 is not set

#
# I2O device support
#
# CONFIG_I2O is not set

#
# Networking support
#
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_MMAP is not set
# CONFIG_NETLINK_DEV is not set
# CONFIG_NETFILTER is not set
# CONFIG_FILTER is not set
CONFIG_UNIX=y
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
# CONFIG_IP_ADVANCED_ROUTER is not set
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
# CONFIG_IP_MROUTE is not set
# CONFIG_ARPD is not set
# CONFIG_INET_ECN is not set
# CONFIG_SYN_COOKIES is not set
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_XFRM_USER is not set
# CONFIG_IPV6 is not set

#
# SCTP Configuration (EXPERIMENTAL)
#
CONFIG_IPV6_SCTP__=y
# CONFIG_IP_SCTP is not set
# CONFIG_ATM is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_LLC is not set
# CONFIG_DECNET is not set
# CONFIG_BRIDGE is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_NET_DIVERT is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
# CONFIG_NET_FASTROUTE is not set
# CONFIG_NET_HW_FLOWCONTROL is not set

#
# QoS and/or fair queueing
#
# CONFIG_NET_SCHED is not set

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
CONFIG_NETDEVICES=y

#
# ARCnet devices
#
# CONFIG_ARCNET is not set
CONFIG_DUMMY=m
# CONFIG_BONDING is not set
# CONFIG_EQUALIZER is not set
# CONFIG_TUN is not set
# CONFIG_ETHERTAP is not set

#
# Ethernet (10 or 100Mbit)
#
CONFIG_NET_ETHERNET=y
# CONFIG_MII is not set
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_NET_VENDOR_3COM is not set
# CONFIG_LANCE is not set
# CONFIG_NET_VENDOR_SMC is not set
# CONFIG_NET_VENDOR_RACAL is not set

#
# Tulip family network device support
#
CONFIG_NET_TULIP=y
CONFIG_DE2104X=y
CONFIG_TULIP=y
# CONFIG_TULIP_MWI is not set
CONFIG_TULIP_MMIO=y
# CONFIG_DE4X5 is not set
# CONFIG_WINBOND_840 is not set
# CONFIG_DM9102 is not set
# CONFIG_AT1700 is not set
# CONFIG_DEPCA is not set
# CONFIG_HP100 is not set
# CONFIG_NET_ISA is not set
CONFIG_NET_PCI=y
# CONFIG_PCNET32 is not set
# CONFIG_AMD8111_ETH is not set
CONFIG_ADAPTEC_STARFIRE=y
# CONFIG_ADAPTEC_STARFIRE_NAPI is not set
# CONFIG_AC3200 is not set
# CONFIG_APRICOT is not set
# CONFIG_B44 is not set
# CONFIG_CS89x0 is not set
# CONFIG_DGRS is not set
# CONFIG_EEPRO100 is not set
# CONFIG_E100 is not set
# CONFIG_FEALNX is not set
# CONFIG_NATSEMI is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_8139CP is not set
# CONFIG_8139TOO is not set
# CONFIG_SIS900 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SUNDANCE is not set
# CONFIG_TLAN is not set
# CONFIG_VIA_RHINE is not set
# CONFIG_NET_POCKET is not set

#
# Ethernet (1000 Mbit)
#
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
# CONFIG_E1000 is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
# CONFIG_R8169 is not set
# CONFIG_SK98LIN is not set
# CONFIG_TIGON3 is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set

#
# Wireless LAN (non-hamradio)
#
# CONFIG_NET_RADIO is not set

#
# Token Ring devices (depends on LLC=y)
#
# CONFIG_NET_FC is not set
# CONFIG_RCPCI is not set
# CONFIG_SHAPER is not set

#
# Wan interfaces
#
# CONFIG_WAN is not set

#
# Amateur Radio support
#
# CONFIG_HAMRADIO is not set

#
# IrDA (infrared) support
#
# CONFIG_IRDA is not set

#
# ISDN subsystem
#
# CONFIG_ISDN_BOOL is not set

#
# Telephony Support
#
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y

#
# Userland interfaces
#
# CONFIG_INPUT_MOUSEDEV is not set
# CONFIG_INPUT_JOYDEV is not set
# CONFIG_INPUT_TSDEV is not set
# CONFIG_INPUT_EVDEV is not set
# CONFIG_INPUT_EVBUG is not set

#
# Input I/O drivers
#
# CONFIG_GAMEPORT is not set
CONFIG_SOUND_GAMEPORT=y
# CONFIG_SERIO is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_INPUT_MOUSE=y
# CONFIG_MOUSE_INPORT is not set
# CONFIG_MOUSE_LOGIBM is not set
# CONFIG_MOUSE_PC110PAD is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
CONFIG_INPUT_PCSPKR=y
# CONFIG_INPUT_UINPUT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
# CONFIG_SERIAL_NONSTANDARD is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
# CONFIG_SERIAL_8250_EXTENDED is not set

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_UNIX98_PTYS=y
CONFIG_UNIX98_PTY_COUNT=256

#
# I2C support
#
# CONFIG_I2C is not set

#
# I2C Hardware Sensors Mainboard support
#

#
# I2C Hardware Sensors Chip support
#

#
# Mice
#
# CONFIG_BUSMOUSE is not set
# CONFIG_QIC02_TAPE is not set

#
# IPMI
#
# CONFIG_IPMI_HANDLER is not set

#
# Watchdog Cards
#
# CONFIG_WATCHDOG is not set
# CONFIG_INTEL_RNG is not set
# CONFIG_AMD_RNG is not set
# CONFIG_NVRAM is not set
# CONFIG_RTC is not set
CONFIG_GEN_RTC=y
# CONFIG_GEN_RTC_X is not set
# CONFIG_DTLK is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_SONYPI is not set

#
# Ftape, the floppy tape device driver
#
# CONFIG_FTAPE is not set
# CONFIG_AGP is not set
# CONFIG_DRM is not set
# CONFIG_MWAVE is not set
# CONFIG_RAW_DRIVER is not set
# CONFIG_HANGCHECK_TIMER is not set

#
# Multimedia devices
#
# CONFIG_VIDEO_DEV is not set

#
# File systems
#
# CONFIG_QUOTA is not set
# CONFIG_AUTOFS_FS is not set
CONFIG_AUTOFS4_FS=y
# CONFIG_REISERFS_FS is not set
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EXT3_FS is not set
# CONFIG_JBD is not set
# CONFIG_FAT_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_CRAMFS is not set
CONFIG_TMPFS=y
CONFIG_RAMFS=y
CONFIG_HUGETLBFS=y
CONFIG_ISO9660_FS=y
# CONFIG_JOLIET is not set
# CONFIG_ZISOFS is not set
CONFIG_JFS_FS=y
# CONFIG_JFS_POSIX_ACL is not set
# CONFIG_JFS_DEBUG is not set
# CONFIG_JFS_STATISTICS is not set
CONFIG_MINIX_FS=y
# CONFIG_VXFS_FS is not set
# CONFIG_NTFS_FS is not set
# CONFIG_HPFS_FS is not set
CONFIG_PROC_FS=y
# CONFIG_DEVFS_FS is not set
CONFIG_DEVPTS_FS=y
# CONFIG_QNX4FS_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_EXT2_FS=y
# CONFIG_EXT2_FS_XATTR is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UDF_FS is not set
# CONFIG_UFS_FS is not set
# CONFIG_XFS_FS is not set

#
# Network File Systems
#
# CONFIG_CODA_FS is not set
# CONFIG_INTERMEZZO_FS is not set
CONFIG_NFS_FS=y
# CONFIG_NFS_V3 is not set
# CONFIG_NFS_V4 is not set
CONFIG_NFSD=y
# CONFIG_NFSD_V3 is not set
# CONFIG_NFSD_TCP is not set
CONFIG_SUNRPC=y
# CONFIG_SUNRPC_GSS is not set
CONFIG_LOCKD=y
CONFIG_EXPORTFS=y
# CONFIG_CIFS is not set
# CONFIG_SMB_FS is not set
# CONFIG_NCP_FS is not set
# CONFIG_AFS_FS is not set

#
# Partition Types
#
# CONFIG_PARTITION_ADVANCED is not set
CONFIG_MSDOS_PARTITION=y
CONFIG_NLS=y

#
# Native Language Support
#
CONFIG_NLS_DEFAULT="iso8859-1"
# CONFIG_NLS_CODEPAGE_437 is not set
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
# CONFIG_NLS_ISO8859_1 is not set
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_UTF8 is not set

#
# Graphics support
#
# CONFIG_FB is not set
# CONFIG_VIDEO_SELECT is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
# CONFIG_MDA_CONSOLE is not set
CONFIG_DUMMY_CONSOLE=y

#
# Sound
#
# CONFIG_SOUND is not set

#
# USB support
#
# CONFIG_USB is not set

#
# Bluetooth support
#
# CONFIG_BT is not set

#
# Profiling support
#
# CONFIG_PROFILING is not set

#
# Kernel hacking
#
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_STACKOVERFLOW=y
# CONFIG_DEBUG_SLAB is not set
# CONFIG_DEBUG_IOVIRT is not set
CONFIG_MAGIC_SYSRQ=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_HIGHMEM=y
CONFIG_KALLSYMS=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
CONFIG_FRAME_POINTER=y
CONFIG_X86_EXTRA_IRQS=y
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y

#
# Security options
#
# CONFIG_SECURITY is not set

#
# Cryptographic options
#
# CONFIG_CRYPTO is not set

#
# Library routines
#
CONFIG_CRC32=y
CONFIG_X86_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: percpu-2.5.63-bk5-1 (properly generated)
  2003-03-02 22:10         ` William Lee Irwin III
@ 2003-03-02 23:13           ` Martin J. Bligh
  2003-03-02 23:42             ` William Lee Irwin III
  0 siblings, 1 reply; 16+ messages in thread
From: Martin J. Bligh @ 2003-03-02 23:13 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel

>> Did you actually read the previous email? 
>> Same config file? Same tree? same compiler (gcc 2.95.4?)
> 
> gcc2.95.4; 2.5.63-bk5 w/& w/o, no patchkits prior, .config below

Wildly different config being compile tested => difference in speed.ls
 
>> I think we're talking about different things:
>> 1. Need to isolate what's causing the 6s improvement you're seeing.
>> Can you generate profiles & time output for before and after the patch,
>> and describe the test you're running (presumably make -j).
>> 2. SDET degredation. I'll try the additional patch you sent out on that.
> 
> It's not hard to figure out.

Part 2 may not be ... part 1 is ;-)

>>         60   125.0% page_address
>>         12    63.2% __pagevec_lru_add_active
>>         11    47.8% bad_range
>>         10    15.9% kmap_atomic
> 
> All users of page_zone(). The question you're (hopefully) about to
> answer is whether it was the division or something else like codesize
> or the newly introduced indirection.
> 
> If that is still seeing page_zone() suckage, I'll rip zone_table[] out
> of it entirely.

Still degraded: diffprofile:

       781     1.6% total
       346     1.0% default_idle
       217    10.1% __down
        79    12.0% __wake_up
        51    70.8% page_address
        32    66.7% kmap_atomic
        24     5.3% page_remove_rmap
        16    19.3% clear_page_tables
        14     4.6% release_pages
        13    33.3% path_release
        13     6.7% __copy_to_user_ll
        13   260.0% bad_range
        11     1.3% do_schedule
        10    15.6% pte_alloc_one

M.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: percpu-2.5.63-bk5-1 (properly generated)
  2003-03-02 23:13           ` Martin J. Bligh
@ 2003-03-02 23:42             ` William Lee Irwin III
  2003-03-03  0:07               ` Martin J. Bligh
  0 siblings, 1 reply; 16+ messages in thread
From: William Lee Irwin III @ 2003-03-02 23:42 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: linux-kernel

At some point in the past, I wrote:
>> All users of page_zone(). The question you're (hopefully) about to
>> answer is whether it was the division or something else like codesize
>> or the newly introduced indirection.
>> If that is still seeing page_zone() suckage, I'll rip zone_table[] out
>> of it entirely.

On Sun, Mar 02, 2003 at 03:13:22PM -0800, Martin J. Bligh wrote:
> Still degraded: diffprofile:
>        781     1.6% total
>        346     1.0% default_idle
>        217    10.1% __down
>         79    12.0% __wake_up
>         51    70.8% page_address
>         32    66.7% kmap_atomic
>         24     5.3% page_remove_rmap
>         16    19.3% clear_page_tables
>         14     4.6% release_pages
>         13    33.3% path_release
>         13     6.7% __copy_to_user_ll
>         13   260.0% bad_range
>         11     1.3% do_schedule
>         10    15.6% pte_alloc_one

The largest issue is probably idle time, which appears to have gone up
enormously in absolute terms. I'll split the pieces out and see what
happens. From this it looks like the indirection is a slowdown, but the
cost in absolute terms is insignificant, as there aren't enough samples.

There's no clear reason __down() should have become more expensive,
nor __wake_up(). I'd really like an instruction-level profile. AFAICT
node_nr_running is 100% harmless instruction-wise, unless the copy
propagated a nonzero value (which would be a bug), and per_cpu
runqueues are largely unknown, but would be accountable to schedule(),
which is not particularly offensive wrt. additional cpu time.

Some kind of dump of internal scheduler statistics to verify they've
been faithfully preserved would help also. Instruction-level cpu and
cache profiling would also be helpful. There may very well be an odd
cache coloring conflict at work here. If it's too big to take on, I
might need some kind of help or a pointer to a package so I don't have
to crap all over userspace for the benchmark. I may also need a .config
in order to reproduce the usual bullcrap like (#@%$ing) link order.


-- wli

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: percpu-2.5.63-bk5-1 (properly generated)
  2003-03-02 23:42             ` William Lee Irwin III
@ 2003-03-03  0:07               ` Martin J. Bligh
  2003-03-03  1:43                 ` William Lee Irwin III
  0 siblings, 1 reply; 16+ messages in thread
From: Martin J. Bligh @ 2003-03-03  0:07 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel

>> Still degraded: diffprofile:
>>        781     1.6% total
>>        346     1.0% default_idle
>>        217    10.1% __down
>>         79    12.0% __wake_up
>>         51    70.8% page_address
>>         32    66.7% kmap_atomic
>>         24     5.3% page_remove_rmap
>>         16    19.3% clear_page_tables
>>         14     4.6% release_pages
>>         13    33.3% path_release
>>         13     6.7% __copy_to_user_ll
>>         13   260.0% bad_range
>>         11     1.3% do_schedule
>>         10    15.6% pte_alloc_one
> 
> The largest issue is probably idle time, which appears to have gone up
> enormously in absolute terms. I'll split the pieces out and see what
> happens. From this it looks like the indirection is a slowdown, but the
> cost in absolute terms is insignificant, as there aren't enough samples.
> 
> There's no clear reason __down() should have become more expensive,
> nor __wake_up(). I'd really like an instruction-level profile. AFAICT
> node_nr_running is 100% harmless instruction-wise, unless the copy
> propagated a nonzero value (which would be a bug), and per_cpu
> runqueues are largely unknown, but would be accountable to schedule(),
> which is not particularly offensive wrt. additional cpu time.
> 
> Some kind of dump of internal scheduler statistics to verify they've
> been faithfully preserved would help also. Instruction-level cpu and
> cache profiling would also be helpful. There may very well be an odd
> cache coloring conflict at work here. If it's too big to take on, I
> might need some kind of help or a pointer to a package so I don't have
> to crap all over userspace for the benchmark. I may also need a .config
> in order to reproduce the usual bullcrap like (#@%$ing) link order.

I think you'd be better off profiling the improvement you saw, and working
out where that comes from. 

Failing that, if you can split it into 3 or 4 patches along the lines I
suggested earlier, I'll try benching each bit seperately for you.

M.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: percpu-2.5.63-bk5-1 (properly generated)
  2003-03-03  0:07               ` Martin J. Bligh
@ 2003-03-03  1:43                 ` William Lee Irwin III
  2003-03-03 17:40                   ` Martin J. Bligh
  0 siblings, 1 reply; 16+ messages in thread
From: William Lee Irwin III @ 2003-03-03  1:43 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: linux-kernel

On Sun, Mar 02, 2003 at 04:07:01PM -0800, Martin J. Bligh wrote:
> Failing that, if you can split it into 3 or 4 patches along the lines I
> suggested earlier, I'll try benching each bit seperately for you.

Last ditch effort. No per_node stuff at all, and no new per_cpu users.


-- wli


diff -urpN linux-2.5.63-bk5/arch/i386/mm/discontig.c pernode-2.5.63-bk5-3/arch/i386/mm/discontig.c
--- linux-2.5.63-bk5/arch/i386/mm/discontig.c	2003-03-02 01:05:07.000000000 -0800
+++ pernode-2.5.63-bk5-3/arch/i386/mm/discontig.c	2003-03-02 16:11:07.000000000 -0800
@@ -48,8 +48,6 @@ extern unsigned long max_low_pfn;
 extern unsigned long totalram_pages;
 extern unsigned long totalhigh_pages;
 
-#define LARGE_PAGE_BYTES (PTRS_PER_PTE * PAGE_SIZE)
-
 unsigned long node_remap_start_pfn[MAX_NUMNODES];
 unsigned long node_remap_size[MAX_NUMNODES];
 unsigned long node_remap_offset[MAX_NUMNODES];
@@ -67,6 +65,44 @@ static void __init find_max_pfn_node(int
 		node_end_pfn[nid] = max_pfn;
 }
 
+extern char __per_cpu_start[], __per_cpu_end[];
+unsigned long __per_cpu_offset[NR_CPUS];
+
+#define PER_CPU_PAGES	PFN_UP((unsigned long)(__per_cpu_end-__per_cpu_start))
+#define MEM_MAP_SIZE(n)	PFN_UP((node_end_pfn[n]-node_start_pfn[n]+1)*sizeof(struct page))
+
+static void __init allocate_per_cpu_pages(int cpu)
+{
+	int cpu_in_node, node = cpu_to_node(cpu);
+	unsigned long vaddr, nodemask = node_to_cpumask(node);
+
+	if (!PER_CPU_PAGES || node >= numnodes)
+		return;
+
+	if (!node) {
+		vaddr  = (unsigned long)alloc_bootmem(PER_CPU_PAGES*PAGE_SIZE);
+		__per_cpu_offset[cpu] = vaddr - (unsigned long)__per_cpu_start;
+	} else {
+		vaddr = (unsigned long)node_remap_start_vaddr[node];
+		cpu_in_node = hweight32(nodemask & ((1UL << cpu) - 1));
+		__per_cpu_offset[cpu] = vaddr + PAGE_SIZE*MEM_MAP_SIZE(node)
+					+ PAGE_SIZE*PFN_UP(sizeof(pg_data_t))
+					+ PAGE_SIZE*PER_CPU_PAGES*cpu_in_node
+					- (unsigned long)__per_cpu_start;
+	}
+	memcpy(RELOC_HIDE((char *)__per_cpu_start, __per_cpu_offset[cpu]),
+			__per_cpu_start,
+			PER_CPU_PAGES*PAGE_SIZE);
+}
+
+void __init setup_per_cpu_areas(void)
+{
+	int cpu;
+	for (cpu = 0; cpu < NR_CPUS; ++cpu)
+		allocate_per_cpu_pages(cpu);
+}
+
+
 /* 
  * Allocate memory for the pg_data_t via a crude pre-bootmem method
  * We ought to relocate these onto their own node later on during boot.
@@ -144,13 +180,11 @@ static unsigned long calculate_numa_rema
 	unsigned long size, reserve_pages = 0;
 
 	for (nid = 1; nid < numnodes; nid++) {
-		/* calculate the size of the mem_map needed in bytes */
-		size = (node_end_pfn[nid] - node_start_pfn[nid] + 1) 
-			* sizeof(struct page) + sizeof(pg_data_t);
-		/* convert size to large (pmd size) pages, rounding up */
-		size = (size + LARGE_PAGE_BYTES - 1) / LARGE_PAGE_BYTES;
-		/* now the roundup is correct, convert to PAGE_SIZE pages */
-		size = size * PTRS_PER_PTE;
+		/* calculate the size of the mem_map needed in pages */
+		size = MEM_MAP_SIZE(nid) + PFN_UP(sizeof(pg_data_t))
+			+ PER_CPU_PAGES*MAX_NODE_CPUS;
+		/* round up to nearest pmd boundary */
+		size = (size + PTRS_PER_PTE - 1) & ~(PTRS_PER_PTE - 1);
 		printk("Reserving %ld pages of KVA for lmem_map of node %d\n",
 				size, nid);
 		node_remap_size[nid] = size;
diff -urpN linux-2.5.63-bk5/include/asm-generic/percpu.h pernode-2.5.63-bk5-3/include/asm-generic/percpu.h
--- linux-2.5.63-bk5/include/asm-generic/percpu.h	2003-02-24 11:05:13.000000000 -0800
+++ pernode-2.5.63-bk5-3/include/asm-generic/percpu.h	2003-03-02 02:55:14.000000000 -0800
@@ -25,8 +25,8 @@ extern unsigned long __per_cpu_offset[NR
     __typeof__(type) name##__per_cpu
 #endif
 
-#define per_cpu(var, cpu)			((void)cpu, var##__per_cpu)
-#define __get_cpu_var(var)			var##__per_cpu
+#define per_cpu(var, cpu)		( (void)(cpu), var##__per_cpu )
+#define __get_cpu_var(var)		var##__per_cpu
 
 #endif	/* SMP */
 
diff -urpN linux-2.5.63-bk5/include/asm-i386/numaq.h pernode-2.5.63-bk5-3/include/asm-i386/numaq.h
--- linux-2.5.63-bk5/include/asm-i386/numaq.h	2003-03-02 01:05:09.000000000 -0800
+++ pernode-2.5.63-bk5-3/include/asm-i386/numaq.h	2003-03-02 02:55:14.000000000 -0800
@@ -39,8 +39,9 @@
 extern int physnode_map[];
 #define pfn_to_nid(pfn)	({ physnode_map[(pfn) / PAGES_PER_ELEMENT]; })
 #define pfn_to_pgdat(pfn) NODE_DATA(pfn_to_nid(pfn))
-#define PHYSADDR_TO_NID(pa) pfn_to_nid(pa >> PAGE_SHIFT)
-#define MAX_NUMNODES		8
+#define PHYSADDR_TO_NID(pa) pfn_to_nid((pa) >> PAGE_SHIFT)
+#define MAX_NUMNODES		16
+#define MAX_NODE_CPUS		4
 extern void get_memcfg_numaq(void);
 #define get_memcfg_numa() get_memcfg_numaq()
 
@@ -169,9 +170,9 @@ struct sys_cfg_data {
         struct	eachquadmem eq[MAX_NUMNODES];	/* indexed by quad id */
 };
 
-static inline unsigned long get_zholes_size(int nid)
+static inline unsigned long *get_zholes_size(int nid)
 {
-	return 0;
+	return NULL;
 }
 #endif /* CONFIG_X86_NUMAQ */
 #endif /* NUMAQ_H */
diff -urpN linux-2.5.63-bk5/include/asm-i386/percpu.h pernode-2.5.63-bk5-3/include/asm-i386/percpu.h
--- linux-2.5.63-bk5/include/asm-i386/percpu.h	2003-02-24 11:05:44.000000000 -0800
+++ pernode-2.5.63-bk5-3/include/asm-i386/percpu.h	2003-03-02 02:55:14.000000000 -0800
@@ -3,4 +3,9 @@
 
 #include <asm-generic/percpu.h>
 
+#ifdef CONFIG_NUMA
+#undef	__GENERIC_PER_CPU
+void setup_per_cpu_areas(void);
+#endif
+
 #endif /* __ARCH_I386_PERCPU__ */
diff -urpN linux-2.5.63-bk5/include/asm-i386/srat.h pernode-2.5.63-bk5-3/include/asm-i386/srat.h
--- linux-2.5.63-bk5/include/asm-i386/srat.h	2003-03-02 01:05:09.000000000 -0800
+++ pernode-2.5.63-bk5-3/include/asm-i386/srat.h	2003-03-02 02:55:14.000000000 -0800
@@ -37,8 +37,9 @@
 extern int pfnnode_map[];
 #define pfn_to_nid(pfn) ({ pfnnode_map[PFN_TO_ELEMENT(pfn)]; })
 #define pfn_to_pgdat(pfn) NODE_DATA(pfn_to_nid(pfn))
-#define PHYSADDR_TO_NID(pa) pfn_to_nid(pa >> PAGE_SHIFT)
+#define PHYSADDR_TO_NID(pa) pfn_to_nid((pa) >> PAGE_SHIFT)
 #define MAX_NUMNODES		8
+#define MAX_NODE_CPUS		4
 extern void get_memcfg_from_srat(void);
 extern unsigned long *get_zholes_size(int);
 #define get_memcfg_numa() get_memcfg_from_srat()

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: percpu-2.5.63-bk5-1 (properly generated)
  2003-03-03  1:43                 ` William Lee Irwin III
@ 2003-03-03 17:40                   ` Martin J. Bligh
  2003-03-03 22:51                     ` William Lee Irwin III
  0 siblings, 1 reply; 16+ messages in thread
From: Martin J. Bligh @ 2003-03-03 17:40 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel


> On Sun, Mar 02, 2003 at 04:07:01PM -0800, Martin J. Bligh wrote:
>> Failing that, if you can split it into 3 or 4 patches along the lines I
>> suggested earlier, I'll try benching each bit seperately for you.
> 
> Last ditch effort. No per_node stuff at all, and no new per_cpu users.
 

OK, that seems to get rid of the SDET degradation, but I rigged up the
same test you were doing (make -j) and see only marginal improvement
from the full patch (pernode2) ... not the 6s you were seeing.

Kernbench: (make -j N vmlinux, where N = 2 x num_cpus)
                              Elapsed      System        User         CPU
              2.5.63-mjb2       44.09       94.38      557.26     1477.00
     2.5.63-mjb2-pernode2       44.54       96.38      557.30     1466.75
     2.5.63-mjb2-pernode3       44.01       95.22      556.69     1481.25

Kernbench: (make -j N vmlinux, where N = 16 x num_cpus)
                              Elapsed      System        User         CPU
              2.5.63-mjb2       45.53      118.06      560.48     1489.50
     2.5.63-mjb2-pernode2       45.25      116.68      561.28     1497.50
     2.5.63-mjb2-pernode3       45.30      116.91      559.82     1492.00

Kernbench: (make -j vmlinux, maximal tasks)
                              Elapsed      System        User         CPU
              2.5.63-mjb2       45.17      117.80      560.62     1500.50
     2.5.63-mjb2-pernode2       44.91      115.95      560.98     1505.25
     2.5.63-mjb2-pernode3       45.47      118.07      560.25     1491.75


-pernode2 was your full patch with the fix you sent, -pernode3 was the 
smaller patch you sent last. Can you try to reproduce the improvment
were seeing, and grab a before and after profile? I don't seem to be 
able to replicate it.

M.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: percpu-2.5.63-bk5-1 (properly generated)
  2003-03-03 17:40                   ` Martin J. Bligh
@ 2003-03-03 22:51                     ` William Lee Irwin III
  2003-03-03 23:30                       ` Martin J. Bligh
  0 siblings, 1 reply; 16+ messages in thread
From: William Lee Irwin III @ 2003-03-03 22:51 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: linux-kernel

On Mon, Mar 03, 2003 at 09:40:01AM -0800, Martin J. Bligh wrote:
> OK, that seems to get rid of the SDET degradation, but I rigged up the
> same test you were doing (make -j) and see only marginal improvement
> from the full patch (pernode2) ... not the 6s you were seeing.
> -pernode2 was your full patch with the fix you sent, -pernode3 was the 
> smaller patch you sent last. Can you try to reproduce the improvment
> were seeing, and grab a before and after profile? I don't seem to be 
> able to replicate it.

Then there must have been something important in the new per_cpu users.


-- wli

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: percpu-2.5.63-bk5-1 (properly generated)
  2003-03-03 22:51                     ` William Lee Irwin III
@ 2003-03-03 23:30                       ` Martin J. Bligh
  2003-03-04  0:14                         ` William Lee Irwin III
  0 siblings, 1 reply; 16+ messages in thread
From: Martin J. Bligh @ 2003-03-03 23:30 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel

>> OK, that seems to get rid of the SDET degradation, but I rigged up the
>> same test you were doing (make -j) and see only marginal improvement
>> from the full patch (pernode2) ... not the 6s you were seeing.
>> -pernode2 was your full patch with the fix you sent, -pernode3 was the 
>> smaller patch you sent last. Can you try to reproduce the improvment
>> were seeing, and grab a before and after profile? I don't seem to be 
>> able to replicate it.
> 
> Then there must have been something important in the new per_cpu users.

-pernode2 had all your changes ... but I still don't see anything like
the order of magnitude of benefit you were seeing.

M.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: percpu-2.5.63-bk5-1 (properly generated)
  2003-03-03 23:30                       ` Martin J. Bligh
@ 2003-03-04  0:14                         ` William Lee Irwin III
  0 siblings, 0 replies; 16+ messages in thread
From: William Lee Irwin III @ 2003-03-04  0:14 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: linux-kernel

At some point in the past, I wrote:
>> Then there must have been something important in the new per_cpu users.

On Mon, Mar 03, 2003 at 03:30:18PM -0800, Martin J. Bligh wrote:
> -pernode2 had all your changes ... but I still don't see anything like
> the order of magnitude of benefit you were seeing.

Well, something in the mix of new per_cpu and/or per_node users caused
a regression on "that unmentionable benchmark". There's something
different about 2.5.x and 2.4.x kernel compiles that makes the numbers
incomparable. And since the total sum of the benefit of the new
per_cpu/per_node users is negligible along with the total benefit of
the entire thing, there must be something different going on.

Maybe the effect is just tiny, maybe there isn't enough locality of
reference for this to ever do anything, or maybe 2.4.x and 2.5.x
kernel compiles are really that different.

I haven't really got the patience for that kind of an investigation.
It wasn't a slam dunk so I'd rather not bother with it anymore now.


-- wli

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: percpu-2.5.63-bk5-1 (properly generated)
  2003-03-02 11:07 William Lee Irwin III
@ 2003-03-02 13:15 ` William Lee Irwin III
  0 siblings, 0 replies; 16+ messages in thread
From: William Lee Irwin III @ 2003-03-02 13:15 UTC (permalink / raw)
  To: linux-kernel

On Sun, Mar 02, 2003 at 03:07:47AM -0800, William Lee Irwin III wrote:
> This patch does 3 different things:
> (1) shoves per-cpu areas into node-local memory
> (2) creates a new per-node thing analogous to per-cpu
> (3) uses (1) and (2) to shove several frequently-accessed things into
>         node-local memory
> Tested, boots, and runs on NUMA-Q. Trims 6s of 41s off kernel compiles.
> Compiletested for walmart x86 SMP/UP, and could use runtime testing.
> A few non-x86 arches probably need fixups for per_cpu irq_stat[].
> Also available at:
> ftp://ftp.kernel.org/pub/linux/kernel/people/wli/percpu/

Okay, I got requests for a more detailed changelog, so here it is:
This patch does 19 different things when put under the microscope
and/or in an excessively finegrained subdivision of simple concepts.

(1) reuse the arch/i386/discontigmem.c per-node mem_map[] virtual remap
	to remap node-local memory backing per_cpu and per_node areas
(2)  make irq_stat[] per_cpu, x86-only
(3)  make mmu_gathers[] per_cpu, with comcomitant divorce from asm-generic
(4)  delay discontig zone_sizes_init() to dodge bootstrap order issues
(5)  add .data.pernode section handling in vmlinux.lds.S
(6)  per_cpu()/__get_cpu_var() needs to parenthesize the cpu arg
(7)  introduced asm-generic/pernode.h to do similar things as percpu.h
(8)  added MAX_NODE_CPUS for pessimistic sizing of virtual remapping arenas
(9)  fix return type error in NUMA-Q get_zholes_size()
(10) #undef asm-i386/per{cpu,node}.h's __GENERIC_PER_{CPU,NODE}
(11) declare setup_per_cpu_areas() in asm-i386/percpu.h
(12) make an asm-i386/pernode.h stub header like include/asm-generic/pernode.h
(13) declare MAX_NODE_CPUS in include/asm-i386/srat.h
(14) make zone_table[] per_node
(15) call setup_per_node_areas() in init/main.c, with analogous hooks
(16) make task_cache per_cpu
(17) make runqueues[] per_cpu
(18) make node_nr_running[] per_node
(19) make reap_timers[] per_cpu

-- wli

^ permalink raw reply	[flat|nested] 16+ messages in thread

* percpu-2.5.63-bk5-1 (properly generated)
@ 2003-03-02 11:07 William Lee Irwin III
  2003-03-02 13:15 ` William Lee Irwin III
  0 siblings, 1 reply; 16+ messages in thread
From: William Lee Irwin III @ 2003-03-02 11:07 UTC (permalink / raw)
  To: linux-kernel

This patch does 3 different things:
(1) shoves per-cpu areas into node-local memory
(2) creates a new per-node thing analogous to per-cpu
(3) uses (1) and (2) to shove several frequently-accessed things into
        node-local memory

Tested, boots, and runs on NUMA-Q. Trims 6s of 41s off kernel compiles.
Compiletested for walmart x86 SMP/UP, and could use runtime testing.
A few non-x86 arches probably need fixups for per_cpu irq_stat[].

Also available at:
ftp://ftp.kernel.org/pub/linux/kernel/people/wli/percpu/

-- wli


 arch/i386/kernel/apic.c       |    2 
 arch/i386/kernel/io_apic.c    |    2 
 arch/i386/kernel/irq.c        |    2 
 arch/i386/kernel/nmi.c        |    4 -
 arch/i386/kernel/process.c    |    2 
 arch/i386/mm/discontig.c      |   83 ++++++++++++++++++++++++---
 arch/i386/mm/init.c           |    4 -
 arch/i386/vmlinux.lds.S       |    4 +
 include/asm-generic/percpu.h  |    4 -
 include/asm-generic/pernode.h |   39 ++++++++++++
 include/asm-i386/numaq.h      |    9 +-
 include/asm-i386/percpu.h     |    5 +
 include/asm-i386/pernode.h    |   11 +++
 include/asm-i386/srat.h       |    3 
 include/asm-i386/tlb.h        |  128 +++++++++++++++++++++++++++++++++++++++++-
 include/linux/irq_cpustat.h   |   10 +--
 include/linux/mm.h            |    6 +
 init/main.c                   |   30 +++++++++
 kernel/fork.c                 |   10 +--
 kernel/ksyms.c                |    2 
 kernel/sched.c                |   18 ++---
 kernel/softirq.c              |    2 
 mm/page_alloc.c               |    6 -
 mm/slab.c                     |    6 -
 24 files changed, 338 insertions(+), 54 deletions(-)


diff -urpN linux-2.5.63-bk5/arch/i386/kernel/apic.c pernode-2.5.63-bk5-1/arch/i386/kernel/apic.c
--- linux-2.5.63-bk5/arch/i386/kernel/apic.c	2003-03-02 01:05:07.000000000 -0800
+++ pernode-2.5.63-bk5-1/arch/i386/kernel/apic.c	2003-03-02 02:55:14.000000000 -0800
@@ -1060,7 +1060,7 @@ void smp_apic_timer_interrupt(struct pt_
 	/*
 	 * the NMI deadlock-detector uses this.
 	 */
-	irq_stat[cpu].apic_timer_irqs++;
+	per_cpu(irq_stat, cpu).apic_timer_irqs++;
 
 	/*
 	 * NOTE! We'd better ACK the irq immediately,
diff -urpN linux-2.5.63-bk5/arch/i386/kernel/io_apic.c pernode-2.5.63-bk5-1/arch/i386/kernel/io_apic.c
--- linux-2.5.63-bk5/arch/i386/kernel/io_apic.c	2003-03-02 01:05:07.000000000 -0800
+++ pernode-2.5.63-bk5-1/arch/i386/kernel/io_apic.c	2003-03-02 02:55:14.000000000 -0800
@@ -237,7 +237,7 @@ struct irq_cpu_info {
 #define IRQ_DELTA(cpu,irq) 	(irq_cpu_data[cpu].irq_delta[irq])
 
 #define IDLE_ENOUGH(cpu,now) \
-		(idle_cpu(cpu) && ((now) - irq_stat[(cpu)].idle_timestamp > 1))
+		(idle_cpu(cpu) && ((now) - per_cpu(irq_stat, cpu).idle_timestamp > 1))
 
 #define IRQ_ALLOWED(cpu,allowed_mask) \
 		((1 << cpu) & (allowed_mask))
diff -urpN linux-2.5.63-bk5/arch/i386/kernel/irq.c pernode-2.5.63-bk5-1/arch/i386/kernel/irq.c
--- linux-2.5.63-bk5/arch/i386/kernel/irq.c	2003-03-02 01:05:07.000000000 -0800
+++ pernode-2.5.63-bk5-1/arch/i386/kernel/irq.c	2003-03-02 02:55:14.000000000 -0800
@@ -171,7 +171,7 @@ int show_interrupts(struct seq_file *p, 
 	seq_printf(p, "LOC: ");
 	for (j = 0; j < NR_CPUS; j++)
 		if (cpu_online(j))
-			p += seq_printf(p, "%10u ", irq_stat[j].apic_timer_irqs);
+			p += seq_printf(p, "%10u ", per_cpu(irq_stat, j).apic_timer_irqs);
 	seq_putc(p, '\n');
 #endif
 	seq_printf(p, "ERR: %10u\n", atomic_read(&irq_err_count));
diff -urpN linux-2.5.63-bk5/arch/i386/kernel/nmi.c pernode-2.5.63-bk5-1/arch/i386/kernel/nmi.c
--- linux-2.5.63-bk5/arch/i386/kernel/nmi.c	2003-03-02 01:05:07.000000000 -0800
+++ pernode-2.5.63-bk5-1/arch/i386/kernel/nmi.c	2003-03-02 02:55:14.000000000 -0800
@@ -76,7 +76,7 @@ int __init check_nmi_watchdog (void)
 	printk(KERN_INFO "testing NMI watchdog ... ");
 
 	for (cpu = 0; cpu < NR_CPUS; cpu++)
-		prev_nmi_count[cpu] = irq_stat[cpu].__nmi_count;
+		prev_nmi_count[cpu] = per_cpu(irq_stat, cpu).__nmi_count;
 	local_irq_enable();
 	mdelay((10*1000)/nmi_hz); // wait 10 ticks
 
@@ -358,7 +358,7 @@ void nmi_watchdog_tick (struct pt_regs *
 	 */
 	int sum, cpu = smp_processor_id();
 
-	sum = irq_stat[cpu].apic_timer_irqs;
+	sum = per_cpu(irq_stat, cpu).apic_timer_irqs;
 
 	if (last_irq_sums[cpu] == sum) {
 		/*
diff -urpN linux-2.5.63-bk5/arch/i386/kernel/process.c pernode-2.5.63-bk5-1/arch/i386/kernel/process.c
--- linux-2.5.63-bk5/arch/i386/kernel/process.c	2003-02-24 11:05:04.000000000 -0800
+++ pernode-2.5.63-bk5-1/arch/i386/kernel/process.c	2003-03-02 02:55:14.000000000 -0800
@@ -141,7 +141,7 @@ void cpu_idle (void)
 		void (*idle)(void) = pm_idle;
 		if (!idle)
 			idle = default_idle;
-		irq_stat[smp_processor_id()].idle_timestamp = jiffies;
+		per_cpu(irq_stat, smp_processor_id()).idle_timestamp = jiffies;
 		while (!need_resched())
 			idle();
 		schedule();
diff -urpN linux-2.5.63-bk5/arch/i386/mm/discontig.c pernode-2.5.63-bk5-1/arch/i386/mm/discontig.c
--- linux-2.5.63-bk5/arch/i386/mm/discontig.c	2003-03-02 01:05:07.000000000 -0800
+++ pernode-2.5.63-bk5-1/arch/i386/mm/discontig.c	2003-03-02 02:55:14.000000000 -0800
@@ -48,8 +48,6 @@ extern unsigned long max_low_pfn;
 extern unsigned long totalram_pages;
 extern unsigned long totalhigh_pages;
 
-#define LARGE_PAGE_BYTES (PTRS_PER_PTE * PAGE_SIZE)
-
 unsigned long node_remap_start_pfn[MAX_NUMNODES];
 unsigned long node_remap_size[MAX_NUMNODES];
 unsigned long node_remap_offset[MAX_NUMNODES];
@@ -67,6 +65,74 @@ static void __init find_max_pfn_node(int
 		node_end_pfn[nid] = max_pfn;
 }
 
+extern char __per_cpu_start[], __per_cpu_end[];
+extern char __per_node_start[], __per_node_end[];
+unsigned long __per_cpu_offset[NR_CPUS], __per_node_offset[MAX_NR_NODES];
+
+#define PER_CPU_PAGES	PFN_UP((unsigned long)(__per_cpu_end-__per_cpu_start))
+#define PER_NODE_PAGES	PFN_UP((unsigned long)(__per_node_end-__per_node_start))
+#define MEM_MAP_SIZE(n)	PFN_UP((node_end_pfn[n]-node_start_pfn[n]+1)*sizeof(struct page))
+
+static void __init allocate_per_cpu_pages(int cpu)
+{
+	int cpu_in_node, node = cpu_to_node(cpu);
+	unsigned long vaddr, nodemask = node_to_cpumask(node);
+
+	if (!PER_CPU_PAGES || node >= numnodes)
+		return;
+
+	if (!node) {
+		vaddr  = (unsigned long)alloc_bootmem(PER_CPU_PAGES*PAGE_SIZE);
+		__per_cpu_offset[cpu] = vaddr - (unsigned long)__per_cpu_start;
+	} else {
+		vaddr = (unsigned long)node_remap_start_vaddr[node];
+		cpu_in_node = hweight32(nodemask & ((1UL << cpu) - 1));
+		__per_cpu_offset[cpu] = vaddr + PAGE_SIZE*MEM_MAP_SIZE(node)
+					+ PAGE_SIZE*PFN_UP(sizeof(pg_data_t))
+					+ PAGE_SIZE*PER_NODE_PAGES
+					+ PAGE_SIZE*PER_CPU_PAGES*cpu_in_node
+					- (unsigned long)__per_cpu_start;
+	}
+	memcpy(RELOC_HIDE((char *)__per_cpu_start, __per_cpu_offset[cpu]),
+			__per_cpu_start,
+			PER_CPU_PAGES*PAGE_SIZE);
+}
+
+static void __init allocate_per_node_pages(int node)
+{
+	unsigned long vaddr;
+
+	if (!node) {
+		vaddr = (unsigned long)alloc_bootmem(PER_NODE_PAGES*PAGE_SIZE);
+		__per_node_offset[node] = vaddr - (unsigned long)__per_node_start;
+	} else {
+		vaddr = (unsigned long)node_remap_start_vaddr[node];
+		__per_node_offset[node] = vaddr + PAGE_SIZE*MEM_MAP_SIZE(node)
+					+ PAGE_SIZE*PFN_UP(sizeof(pg_data_t))
+					- (unsigned long)__per_node_start;
+	}
+	memcpy(RELOC_HIDE((char *)__per_node_start, __per_node_offset[node]),
+			__per_node_start,
+			PER_NODE_PAGES*PAGE_SIZE);
+}
+
+void __init setup_per_cpu_areas(void)
+{
+	int cpu;
+	for (cpu = 0; cpu < NR_CPUS; ++cpu)
+		allocate_per_cpu_pages(cpu);
+}
+
+void __init setup_per_node_areas(void)
+{
+	int node;
+	void zone_sizes_init(void);
+
+	for (node = 0; node < numnodes; ++node)
+		allocate_per_node_pages(node);
+	zone_sizes_init();
+}
+
 /* 
  * Allocate memory for the pg_data_t via a crude pre-bootmem method
  * We ought to relocate these onto their own node later on during boot.
@@ -144,13 +210,12 @@ static unsigned long calculate_numa_rema
 	unsigned long size, reserve_pages = 0;
 
 	for (nid = 1; nid < numnodes; nid++) {
-		/* calculate the size of the mem_map needed in bytes */
-		size = (node_end_pfn[nid] - node_start_pfn[nid] + 1) 
-			* sizeof(struct page) + sizeof(pg_data_t);
-		/* convert size to large (pmd size) pages, rounding up */
-		size = (size + LARGE_PAGE_BYTES - 1) / LARGE_PAGE_BYTES;
-		/* now the roundup is correct, convert to PAGE_SIZE pages */
-		size = size * PTRS_PER_PTE;
+		/* calculate the size of the mem_map needed in pages */
+		size = MEM_MAP_SIZE(nid) + PFN_UP(sizeof(pg_data_t))
+			+ PER_NODE_PAGES
+			+ PER_CPU_PAGES*MAX_NODE_CPUS;
+		/* round up to nearest pmd boundary */
+		size = (size + PTRS_PER_PTE - 1) & ~(PTRS_PER_PTE - 1);
 		printk("Reserving %ld pages of KVA for lmem_map of node %d\n",
 				size, nid);
 		node_remap_size[nid] = size;
diff -urpN linux-2.5.63-bk5/arch/i386/mm/init.c pernode-2.5.63-bk5-1/arch/i386/mm/init.c
--- linux-2.5.63-bk5/arch/i386/mm/init.c	2003-02-24 11:05:39.000000000 -0800
+++ pernode-2.5.63-bk5-1/arch/i386/mm/init.c	2003-03-02 02:55:14.000000000 -0800
@@ -41,7 +41,7 @@
 #include <asm/tlbflush.h>
 #include <asm/sections.h>
 
-struct mmu_gather mmu_gathers[NR_CPUS];
+DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
 unsigned long highstart_pfn, highend_pfn;
 
 /*
@@ -372,7 +372,9 @@ void __init paging_init(void)
 	__flush_tlb_all();
 
 	kmap_init();
+#ifndef CONFIG_DISCONTIGMEM
 	zone_sizes_init();
+#endif
 }
 
 /*
diff -urpN linux-2.5.63-bk5/arch/i386/vmlinux.lds.S pernode-2.5.63-bk5-1/arch/i386/vmlinux.lds.S
--- linux-2.5.63-bk5/arch/i386/vmlinux.lds.S	2003-02-24 11:05:11.000000000 -0800
+++ pernode-2.5.63-bk5-1/arch/i386/vmlinux.lds.S	2003-03-02 02:55:14.000000000 -0800
@@ -83,6 +83,10 @@ SECTIONS
   .data.percpu  : { *(.data.percpu) }
   __per_cpu_end = .;
   . = ALIGN(4096);
+  __per_node_start = .;
+  .data.pernode  : { *(.data.pernode) }
+  __per_node_end = .;
+  . = ALIGN(4096);
   __init_end = .;
   /* freed after init ends here */
 	
diff -urpN linux-2.5.63-bk5/include/asm-generic/percpu.h pernode-2.5.63-bk5-1/include/asm-generic/percpu.h
--- linux-2.5.63-bk5/include/asm-generic/percpu.h	2003-02-24 11:05:13.000000000 -0800
+++ pernode-2.5.63-bk5-1/include/asm-generic/percpu.h	2003-03-02 02:55:14.000000000 -0800
@@ -25,8 +25,8 @@ extern unsigned long __per_cpu_offset[NR
     __typeof__(type) name##__per_cpu
 #endif
 
-#define per_cpu(var, cpu)			((void)cpu, var##__per_cpu)
-#define __get_cpu_var(var)			var##__per_cpu
+#define per_cpu(var, cpu)		( (void)(cpu), var##__per_cpu )
+#define __get_cpu_var(var)		var##__per_cpu
 
 #endif	/* SMP */
 
diff -urpN linux-2.5.63-bk5/include/asm-generic/pernode.h pernode-2.5.63-bk5-1/include/asm-generic/pernode.h
--- linux-2.5.63-bk5/include/asm-generic/pernode.h	1969-12-31 16:00:00.000000000 -0800
+++ pernode-2.5.63-bk5-1/include/asm-generic/pernode.h	2003-03-02 02:55:14.000000000 -0800
@@ -0,0 +1,39 @@
+#ifndef _ASM_GENERIC_PERNODE_H_
+#define _ASM_GENERIC_PERNODE_H_
+#include <linux/config.h>
+#include <linux/compiler.h>
+
+#define __GENERIC_PER_NODE
+#ifdef CONFIG_DISCONTIGMEM
+
+extern unsigned long __per_node_offset[MAX_NR_NODES];
+
+/* Separate out the type, so (int[3], foo) works. */
+#ifndef MODULE
+#define DEFINE_PER_NODE(type, name) \
+    __attribute__((__section__(".data.pernode"))) __typeof__(type) name##__per_node
+#endif
+
+/* var is in discarded region: offset to particular copy we want */
+#define per_node(var, node) (*RELOC_HIDE(&var##__per_node, __per_node_offset[node]))
+#define __get_node_var(var) per_node(var, numa_node_id())
+
+#else /* !CONFIG_DISCONTIGMEM */
+
+/* Can't define per-node variables in modules.  Sorry -- wli */
+#ifndef MODULE
+#define DEFINE_PER_NODE(type, name) \
+    __typeof__(type) name##__per_node
+#endif
+
+#define per_node(var, node)		( (void)(node), var##__per_node )
+#define __get_node_var(var)		var##__per_node
+
+#endif	/* CONFIG_DISCONTIGMEM */
+
+#define DECLARE_PER_NODE(type, name) extern __typeof__(type) name##__per_node
+
+#define EXPORT_PER_NODE_SYMBOL(var) EXPORT_SYMBOL(var##__per_node)
+#define EXPORT_PER_NODE_SYMBOL_GPL(var) EXPORT_SYMBOL_GPL(var##__per_node)
+
+#endif /* _ASM_GENERIC_PERNODE_H_ */
diff -urpN linux-2.5.63-bk5/include/asm-i386/numaq.h pernode-2.5.63-bk5-1/include/asm-i386/numaq.h
--- linux-2.5.63-bk5/include/asm-i386/numaq.h	2003-03-02 01:05:09.000000000 -0800
+++ pernode-2.5.63-bk5-1/include/asm-i386/numaq.h	2003-03-02 02:55:14.000000000 -0800
@@ -39,8 +39,9 @@
 extern int physnode_map[];
 #define pfn_to_nid(pfn)	({ physnode_map[(pfn) / PAGES_PER_ELEMENT]; })
 #define pfn_to_pgdat(pfn) NODE_DATA(pfn_to_nid(pfn))
-#define PHYSADDR_TO_NID(pa) pfn_to_nid(pa >> PAGE_SHIFT)
-#define MAX_NUMNODES		8
+#define PHYSADDR_TO_NID(pa) pfn_to_nid((pa) >> PAGE_SHIFT)
+#define MAX_NUMNODES		16
+#define MAX_NODE_CPUS		4
 extern void get_memcfg_numaq(void);
 #define get_memcfg_numa() get_memcfg_numaq()
 
@@ -169,9 +170,9 @@ struct sys_cfg_data {
         struct	eachquadmem eq[MAX_NUMNODES];	/* indexed by quad id */
 };
 
-static inline unsigned long get_zholes_size(int nid)
+static inline unsigned long *get_zholes_size(int nid)
 {
-	return 0;
+	return NULL;
 }
 #endif /* CONFIG_X86_NUMAQ */
 #endif /* NUMAQ_H */
diff -urpN linux-2.5.63-bk5/include/asm-i386/percpu.h pernode-2.5.63-bk5-1/include/asm-i386/percpu.h
--- linux-2.5.63-bk5/include/asm-i386/percpu.h	2003-02-24 11:05:44.000000000 -0800
+++ pernode-2.5.63-bk5-1/include/asm-i386/percpu.h	2003-03-02 02:55:14.000000000 -0800
@@ -3,4 +3,9 @@
 
 #include <asm-generic/percpu.h>
 
+#ifdef CONFIG_NUMA
+#undef	__GENERIC_PER_CPU
+void setup_per_cpu_areas(void);
+#endif
+
 #endif /* __ARCH_I386_PERCPU__ */
diff -urpN linux-2.5.63-bk5/include/asm-i386/pernode.h pernode-2.5.63-bk5-1/include/asm-i386/pernode.h
--- linux-2.5.63-bk5/include/asm-i386/pernode.h	1969-12-31 16:00:00.000000000 -0800
+++ pernode-2.5.63-bk5-1/include/asm-i386/pernode.h	2003-03-02 02:55:14.000000000 -0800
@@ -0,0 +1,11 @@
+#ifndef __ARCH_I386_PERNODE__
+#define __ARCH_I386_PERNODE__
+
+#include <asm-generic/pernode.h>
+
+#ifdef CONFIG_DISCONTIGMEM
+#undef	__GENERIC_PER_NODE
+void setup_per_node_areas(void);
+#endif
+
+#endif /* __ARCH_I386_PERNODE__ */
diff -urpN linux-2.5.63-bk5/include/asm-i386/srat.h pernode-2.5.63-bk5-1/include/asm-i386/srat.h
--- linux-2.5.63-bk5/include/asm-i386/srat.h	2003-03-02 01:05:09.000000000 -0800
+++ pernode-2.5.63-bk5-1/include/asm-i386/srat.h	2003-03-02 02:55:14.000000000 -0800
@@ -37,8 +37,9 @@
 extern int pfnnode_map[];
 #define pfn_to_nid(pfn) ({ pfnnode_map[PFN_TO_ELEMENT(pfn)]; })
 #define pfn_to_pgdat(pfn) NODE_DATA(pfn_to_nid(pfn))
-#define PHYSADDR_TO_NID(pa) pfn_to_nid(pa >> PAGE_SHIFT)
+#define PHYSADDR_TO_NID(pa) pfn_to_nid((pa) >> PAGE_SHIFT)
 #define MAX_NUMNODES		8
+#define MAX_NODE_CPUS		4
 extern void get_memcfg_from_srat(void);
 extern unsigned long *get_zholes_size(int);
 #define get_memcfg_numa() get_memcfg_from_srat()
diff -urpN linux-2.5.63-bk5/include/asm-i386/tlb.h pernode-2.5.63-bk5-1/include/asm-i386/tlb.h
--- linux-2.5.63-bk5/include/asm-i386/tlb.h	2003-02-24 11:05:14.000000000 -0800
+++ pernode-2.5.63-bk5-1/include/asm-i386/tlb.h	2003-03-02 02:55:14.000000000 -0800
@@ -1,6 +1,10 @@
 #ifndef _I386_TLB_H
 #define _I386_TLB_H
 
+#include <linux/config.h>
+#include <asm/tlbflush.h>
+#include <asm/percpu.h>
+
 /*
  * x86 doesn't need any special per-pte or
  * per-vma handling..
@@ -15,6 +19,128 @@
  */
 #define tlb_flush(tlb) flush_tlb_mm((tlb)->mm)
 
-#include <asm-generic/tlb.h>
+/*
+ * For UP we don't need to worry about TLB flush
+ * and page free order so much..
+ */
+#ifdef CONFIG_SMP
+  #define FREE_PTE_NR	506
+  #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U)
+#else
+  #define FREE_PTE_NR	1
+  #define tlb_fast_mode(tlb) 1
+#endif
+
+/* struct mmu_gather is an opaque type used by the mm code for passing around
+ * any data needed by arch specific code for tlb_remove_page.  This structure
+ * can be per-CPU or per-MM as the page table lock is held for the duration of
+ * TLB shootdown.
+ */
+struct mmu_gather {
+	struct mm_struct	*mm;
+	unsigned int		nr;	/* set to ~0U means fast mode */
+	unsigned int		need_flush;/* Really unmapped some ptes? */
+	unsigned int		fullmm; /* non-zero means full mm flush */
+	unsigned long		freed;
+	struct page *		pages[FREE_PTE_NR];
+};
+
+/* Users of the generic TLB shootdown code must declare this storage space. */
+DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
+
+/* tlb_gather_mmu
+ *	Return a pointer to an initialized struct mmu_gather.
+ */
+static inline struct mmu_gather *
+tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+{
+	struct mmu_gather *tlb = &per_cpu(mmu_gathers, smp_processor_id());
+
+	tlb->mm = mm;
+
+	/* Use fast mode if only one CPU is online */
+	tlb->nr = num_online_cpus() > 1 ? 0U : ~0U;
+
+	tlb->fullmm = full_mm_flush;
+	tlb->freed = 0;
+
+	return tlb;
+}
+
+static inline void
+tlb_flush_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
+{
+	if (!tlb->need_flush)
+		return;
+	tlb->need_flush = 0;
+	tlb_flush(tlb);
+	if (!tlb_fast_mode(tlb)) {
+		free_pages_and_swap_cache(tlb->pages, tlb->nr);
+		tlb->nr = 0;
+	}
+}
+
+/* tlb_finish_mmu
+ *	Called at the end of the shootdown operation to free up any resources
+ *	that were required.  The page table lock is still held at this point.
+ */
+static inline void
+tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
+{
+	int freed = tlb->freed;
+	struct mm_struct *mm = tlb->mm;
+	int rss = mm->rss;
+
+	if (rss < freed)
+		freed = rss;
+	mm->rss = rss - freed;
+	tlb_flush_mmu(tlb, start, end);
+
+	/* keep the page table cache within bounds */
+	check_pgt_cache();
+}
+
+
+/* void tlb_remove_page(struct mmu_gather *tlb, pte_t *ptep, unsigned long addr)
+ *	Must perform the equivalent to __free_pte(pte_get_and_clear(ptep)), while
+ *	handling the additional races in SMP caused by other CPUs caching valid
+ *	mappings in their TLBs.
+ */
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+	tlb->need_flush = 1;
+	if (tlb_fast_mode(tlb)) {
+		free_page_and_swap_cache(page);
+		return;
+	}
+	tlb->pages[tlb->nr++] = page;
+	if (tlb->nr >= FREE_PTE_NR)
+		tlb_flush_mmu(tlb, 0, 0);
+}
+
+/**
+ * tlb_remove_tlb_entry - remember a pte unmapping for later tlb invalidation.
+ *
+ * Record the fact that pte's were really umapped in ->need_flush, so we can
+ * later optimise away the tlb invalidate.   This helps when userspace is
+ * unmapping already-unmapped pages, which happens quite a lot.
+ */
+#define tlb_remove_tlb_entry(tlb, ptep, address)		\
+	do {							\
+		tlb->need_flush = 1;				\
+		__tlb_remove_tlb_entry(tlb, ptep, address);	\
+	} while (0)
+
+#define pte_free_tlb(tlb, ptep)					\
+	do {							\
+		tlb->need_flush = 1;				\
+		__pte_free_tlb(tlb, ptep);			\
+	} while (0)
+
+#define pmd_free_tlb(tlb, pmdp)					\
+	do {							\
+		tlb->need_flush = 1;				\
+		__pmd_free_tlb(tlb, pmdp);			\
+	} while (0)
 
 #endif
diff -urpN linux-2.5.63-bk5/include/linux/irq_cpustat.h pernode-2.5.63-bk5-1/include/linux/irq_cpustat.h
--- linux-2.5.63-bk5/include/linux/irq_cpustat.h	2003-02-24 11:05:44.000000000 -0800
+++ pernode-2.5.63-bk5-1/include/linux/irq_cpustat.h	2003-03-02 02:55:14.000000000 -0800
@@ -17,14 +17,12 @@
  * definitions instead of differing sets for each arch.
  */
 
-extern irq_cpustat_t irq_stat[];			/* defined in asm/hardirq.h */
+/* defined in kernel/softirq.c */
+DECLARE_PER_CPU(irq_cpustat_t, irq_stat);
 
 #ifndef __ARCH_IRQ_STAT /* Some architectures can do this more efficiently */ 
-#ifdef CONFIG_SMP
-#define __IRQ_STAT(cpu, member)	(irq_stat[cpu].member)
-#else
-#define __IRQ_STAT(cpu, member)	((void)(cpu), irq_stat[0].member)
-#endif	
+
+#define __IRQ_STAT(cpu, member)	(per_cpu(irq_stat, cpu).member)
 #endif
 
   /* arch independent irq_stat fields */
diff -urpN linux-2.5.63-bk5/include/linux/mm.h pernode-2.5.63-bk5-1/include/linux/mm.h
--- linux-2.5.63-bk5/include/linux/mm.h	2003-03-02 01:05:09.000000000 -0800
+++ pernode-2.5.63-bk5-1/include/linux/mm.h	2003-03-02 02:55:14.000000000 -0800
@@ -26,6 +26,7 @@ extern int page_cluster;
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/atomic.h>
+#include <asm/pernode.h>
 
 /*
  * Linux kernel virtual memory manager primitives.
@@ -318,11 +319,12 @@ static inline void put_page(struct page 
 #define ZONE_SHIFT (BITS_PER_LONG - 8)
 
 struct zone;
-extern struct zone *zone_table[];
+DECLARE_PER_NODE(struct zone *[MAX_NR_ZONES], zone_table);
 
 static inline struct zone *page_zone(struct page *page)
 {
-	return zone_table[page->flags >> ZONE_SHIFT];
+	unsigned long zone = page->flags >> ZONE_SHIFT;
+	return per_node(zone_table, zone/MAX_NR_ZONES)[zone % MAX_NR_ZONES];
 }
 
 static inline void set_page_zone(struct page *page, unsigned long zone_num)
diff -urpN linux-2.5.63-bk5/init/main.c pernode-2.5.63-bk5-1/init/main.c
--- linux-2.5.63-bk5/init/main.c	2003-02-24 11:05:11.000000000 -0800
+++ pernode-2.5.63-bk5-1/init/main.c	2003-03-02 02:55:14.000000000 -0800
@@ -29,6 +29,7 @@
 #include <linux/tty.h>
 #include <linux/gfp.h>
 #include <linux/percpu.h>
+#include <asm/pernode.h>
 #include <linux/kernel_stat.h>
 #include <linux/security.h>
 #include <linux/workqueue.h>
@@ -277,6 +278,10 @@ __setup("init=", init_setup);
 extern void setup_arch(char **);
 extern void cpu_idle(void);
 
+#ifndef CONFIG_NUMA
+static inline void setup_per_node_areas(void) { }
+#endif
+
 #ifndef CONFIG_SMP
 
 #ifdef CONFIG_X86_LOCAL_APIC
@@ -317,6 +322,30 @@ static void __init setup_per_cpu_areas(v
 }
 #endif /* !__GENERIC_PER_CPU */
 
+#if defined(__GENERIC_PER_NODE) && defined(CONFIG_NUMA)
+unsigned long __per_node_offset[MAX_NR_NODES];
+
+static void __init setup_per_node_areas(void)
+{
+	unsigned long size, i;
+	char *ptr;
+	/* Created by linker magic */
+	extern char __per_node_start[], __per_node_end[];
+
+	/* Copy section for each CPU (we discard the original) */
+	size = ALIGN(__per_node_end - __per_node_start, SMP_CACHE_BYTES);
+	if (!size)
+		return;
+
+	ptr = alloc_bootmem(size * MAX_NR_NODES);
+
+	for (i = 0; i < MAX_NR_NODES; i++, ptr += size) {
+		__per_node_offset[i] = ptr - __per_node_start;
+		memcpy(ptr, __per_node_start, size);
+	}
+}
+#endif /* __GENERIC_PER_NODE && CONFIG_NUMA */
+
 /* Called by boot processor to activate the rest. */
 static void __init smp_init(void)
 {
@@ -376,6 +405,7 @@ asmlinkage void __init start_kernel(void
 	printk(linux_banner);
 	setup_arch(&command_line);
 	setup_per_cpu_areas();
+	setup_per_node_areas();
 
 	/*
 	 * Mark the boot cpu "online" so that it can call console drivers in
diff -urpN linux-2.5.63-bk5/kernel/fork.c pernode-2.5.63-bk5-1/kernel/fork.c
--- linux-2.5.63-bk5/kernel/fork.c	2003-03-02 01:05:09.000000000 -0800
+++ pernode-2.5.63-bk5-1/kernel/fork.c	2003-03-02 02:55:14.000000000 -0800
@@ -58,7 +58,7 @@ rwlock_t tasklist_lock __cacheline_align
  * the very last portion of sys_exit() is executed with
  * preemption turned off.
  */
-static task_t *task_cache[NR_CPUS] __cacheline_aligned;
+static DEFINE_PER_CPU(task_t *, task_cache);
 
 int nr_processes(void)
 {
@@ -86,12 +86,12 @@ static void free_task_struct(struct task
 	} else {
 		int cpu = get_cpu();
 
-		tsk = task_cache[cpu];
+		tsk = per_cpu(task_cache, cpu);
 		if (tsk) {
 			free_thread_info(tsk->thread_info);
 			kmem_cache_free(task_struct_cachep,tsk);
 		}
-		task_cache[cpu] = current;
+		per_cpu(task_cache, cpu) = current;
 		put_cpu();
 	}
 }
@@ -214,8 +214,8 @@ static struct task_struct *dup_task_stru
 	struct thread_info *ti;
 	int cpu = get_cpu();
 
-	tsk = task_cache[cpu];
-	task_cache[cpu] = NULL;
+	tsk = per_cpu(task_cache, cpu);
+	per_cpu(task_cache, cpu) = NULL;
 	put_cpu();
 	if (!tsk) {
 		ti = alloc_thread_info();
diff -urpN linux-2.5.63-bk5/kernel/ksyms.c pernode-2.5.63-bk5-1/kernel/ksyms.c
--- linux-2.5.63-bk5/kernel/ksyms.c	2003-02-24 11:05:05.000000000 -0800
+++ pernode-2.5.63-bk5-1/kernel/ksyms.c	2003-03-02 02:55:14.000000000 -0800
@@ -405,7 +405,7 @@ EXPORT_SYMBOL(add_timer);
 EXPORT_SYMBOL(del_timer);
 EXPORT_SYMBOL(request_irq);
 EXPORT_SYMBOL(free_irq);
-EXPORT_SYMBOL(irq_stat);
+EXPORT_PER_CPU_SYMBOL(irq_stat);
 
 /* waitqueue handling */
 EXPORT_SYMBOL(add_wait_queue);
diff -urpN linux-2.5.63-bk5/kernel/sched.c pernode-2.5.63-bk5-1/kernel/sched.c
--- linux-2.5.63-bk5/kernel/sched.c	2003-02-24 11:05:40.000000000 -0800
+++ pernode-2.5.63-bk5-1/kernel/sched.c	2003-03-02 02:55:14.000000000 -0800
@@ -32,6 +32,7 @@
 #include <linux/delay.h>
 #include <linux/timer.h>
 #include <linux/rcupdate.h>
+#include <asm/pernode.h>
 
 /*
  * Convert user-nice values [ -20 ... 0 ... 19 ]
@@ -166,9 +167,9 @@ struct runqueue {
 	atomic_t nr_iowait;
 } ____cacheline_aligned;
 
-static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
+static DEFINE_PER_CPU(struct runqueue, runqueues) = {{ 0 }};
 
-#define cpu_rq(cpu)		(runqueues + (cpu))
+#define cpu_rq(cpu)		(&per_cpu(runqueues, cpu))
 #define this_rq()		cpu_rq(smp_processor_id())
 #define task_rq(p)		cpu_rq(task_cpu(p))
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
@@ -189,12 +190,11 @@ static struct runqueue runqueues[NR_CPUS
  * Keep track of running tasks.
  */
 
-static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
-	{[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+static DEFINE_PER_NODE(atomic_t, node_nr_running) = ATOMIC_INIT(0);
 
 static inline void nr_running_init(struct runqueue *rq)
 {
-	rq->node_nr_running = &node_nr_running[0];
+	rq->node_nr_running = &per_node(node_nr_running, 0);
 }
 
 static inline void nr_running_inc(runqueue_t *rq)
@@ -214,7 +214,7 @@ __init void node_nr_running_init(void)
 	int i;
 
 	for (i = 0; i < NR_CPUS; i++)
-		cpu_rq(i)->node_nr_running = &node_nr_running[cpu_to_node(i)];
+		cpu_rq(i)->node_nr_running = &per_node(node_nr_running, cpu_to_node(i));
 }
 
 #else /* !CONFIG_NUMA */
@@ -748,7 +748,7 @@ static int sched_best_cpu(struct task_st
 
 	minload = 10000000;
 	for (i = 0; i < numnodes; i++) {
-		load = atomic_read(&node_nr_running[i]);
+		load = atomic_read(&per_node(node_nr_running, i));
 		if (load < minload) {
 			minload = load;
 			node = i;
@@ -790,13 +790,13 @@ static int find_busiest_node(int this_no
 	int i, node = -1, load, this_load, maxload;
 	
 	this_load = maxload = (this_rq()->prev_node_load[this_node] >> 1)
-		+ atomic_read(&node_nr_running[this_node]);
+		+ atomic_read(&per_node(node_nr_running, this_node));
 	this_rq()->prev_node_load[this_node] = this_load;
 	for (i = 0; i < numnodes; i++) {
 		if (i == this_node)
 			continue;
 		load = (this_rq()->prev_node_load[i] >> 1)
-			+ atomic_read(&node_nr_running[i]);
+			+ atomic_read(&per_node(node_nr_running, i));
 		this_rq()->prev_node_load[i] = load;
 		if (load > maxload && (100*load > NODE_THRESHOLD*this_load)) {
 			maxload = load;
diff -urpN linux-2.5.63-bk5/kernel/softirq.c pernode-2.5.63-bk5-1/kernel/softirq.c
--- linux-2.5.63-bk5/kernel/softirq.c	2003-02-24 11:05:12.000000000 -0800
+++ pernode-2.5.63-bk5-1/kernel/softirq.c	2003-03-02 02:55:14.000000000 -0800
@@ -32,7 +32,7 @@
    - Tasklets: serialized wrt itself.
  */
 
-irq_cpustat_t irq_stat[NR_CPUS] ____cacheline_aligned;
+DEFINE_PER_CPU(irq_cpustat_t, irq_stat);
 
 static struct softirq_action softirq_vec[32] __cacheline_aligned_in_smp;
 
diff -urpN linux-2.5.63-bk5/mm/page_alloc.c pernode-2.5.63-bk5-1/mm/page_alloc.c
--- linux-2.5.63-bk5/mm/page_alloc.c	2003-02-24 11:05:06.000000000 -0800
+++ pernode-2.5.63-bk5-1/mm/page_alloc.c	2003-03-02 02:55:14.000000000 -0800
@@ -44,8 +44,8 @@ int sysctl_lower_zone_protection = 0;
  * Used by page_zone() to look up the address of the struct zone whose
  * id is encoded in the upper bits of page->flags
  */
-struct zone *zone_table[MAX_NR_ZONES*MAX_NR_NODES];
-EXPORT_SYMBOL(zone_table);
+DEFINE_PER_NODE(struct zone *[MAX_NR_ZONES], zone_table);
+EXPORT_PER_NODE_SYMBOL(zone_table);
 
 static char *zone_names[MAX_NR_ZONES] = { "DMA", "Normal", "HighMem" };
 static int zone_balance_ratio[MAX_NR_ZONES] __initdata = { 128, 128, 128, };
@@ -1170,7 +1170,7 @@ static void __init free_area_init_core(s
 		unsigned long size, realsize;
 		unsigned long batch;
 
-		zone_table[nid * MAX_NR_ZONES + j] = zone;
+		per_node(zone_table, nid)[j] = zone;
 		realsize = size = zones_size[j];
 		if (zholes_size)
 			realsize -= zholes_size[j];
diff -urpN linux-2.5.63-bk5/mm/slab.c pernode-2.5.63-bk5-1/mm/slab.c
--- linux-2.5.63-bk5/mm/slab.c	2003-03-02 01:05:09.000000000 -0800
+++ pernode-2.5.63-bk5-1/mm/slab.c	2003-03-02 02:55:14.000000000 -0800
@@ -462,7 +462,7 @@ enum {
 	FULL
 } g_cpucache_up;
 
-static struct timer_list reap_timers[NR_CPUS];
+static DEFINE_PER_CPU(struct timer_list, reap_timers);
 
 static void reap_timer_fnc(unsigned long data);
 
@@ -516,7 +516,7 @@ static void __slab_error(const char *fun
  */
 static void start_cpu_timer(int cpu)
 {
-	struct timer_list *rt = &reap_timers[cpu];
+	struct timer_list *rt = &per_cpu(reap_timers, cpu);
 
 	if (rt->function == NULL) {
 		init_timer(rt);
@@ -2180,7 +2180,7 @@ next:
 static void reap_timer_fnc(unsigned long data)
 {
 	int cpu = smp_processor_id();
-	struct timer_list *rt = &reap_timers[cpu];
+	struct timer_list *rt = &per_cpu(reap_timers, cpu);
 
 	cache_reap();
 	mod_timer(rt, jiffies + REAPTIMEOUT_CPUC + cpu);

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2003-03-04  0:04 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-03-02 18:24 percpu-2.5.63-bk5-1 (properly generated) Martin J. Bligh
2003-03-02 20:24 ` William Lee Irwin III
2003-03-02 20:46   ` Martin J. Bligh
2003-03-02 21:06     ` William Lee Irwin III
2003-03-02 21:58       ` Martin J. Bligh
2003-03-02 22:10         ` William Lee Irwin III
2003-03-02 23:13           ` Martin J. Bligh
2003-03-02 23:42             ` William Lee Irwin III
2003-03-03  0:07               ` Martin J. Bligh
2003-03-03  1:43                 ` William Lee Irwin III
2003-03-03 17:40                   ` Martin J. Bligh
2003-03-03 22:51                     ` William Lee Irwin III
2003-03-03 23:30                       ` Martin J. Bligh
2003-03-04  0:14                         ` William Lee Irwin III
  -- strict thread matches above, loose matches on Subject: below --
2003-03-02 11:07 William Lee Irwin III
2003-03-02 13:15 ` William Lee Irwin III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).