linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
@ 2003-07-08 22:45 Ingo Molnar
  2003-07-09  1:29 ` William Lee Irwin III
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Ingo Molnar @ 2003-07-08 22:45 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm


i'm pleased to announce the first public release of the "4GB/4GB VM split"
patch, for the 2.5.74 Linux kernel:

   http://redhat.com/~mingo/4g-patches/4g-2.5.74-F8

The 4G/4G split feature is primarily intended for large-RAM x86 systems,
which want to (or have to) get more kernel/user VM, at the expense of
per-syscall TLB-flush overhead.

on x86, the total amount of virtual memory - as we all know - is limited
to 4GB. Of this total 4GB VM, userspace uses 3GB (0x00000000-0xbfffffff),
the kernel uses 1GB (0xc0000000-0xffffffff). This is VM scheme is called
the 3/1 split. This split works perfecly fine up until 1 GB of RAM - and
it works adequately well even after that, due to 'highmem', which moves
various larger caches (and objects) into the high memory area.

But as the amount of RAM increases, the 3/1 split becomes a real
bottleneck. Despite highmem being utilized by a number of large-size
caches, one of the most crutial data structures, the mem_map[], is
allocated out of the 1 GB kernel VM. With 32 GB of RAM the remaining 0.5
GB lowmem area is quite limited and only represents 1.5% of all RAM.
Various common workloads exhaust the lowmem area and create artificial
bottlenecks. With 64 GB RAM, the mem_map[] alone takes up nearly 1 GB of
RAM, making the kernel unable to boot. Relocating the mem_map[] to highmem
is very impractical, due to the deep integration of this central data
structure into the whole kernel - the VM, lowlevel arch code, drivers,
filesystems, etc.

with the 4G/4G patch, the kernel can be compiled in 4G/4G mode, in which
case there's a full, separate 4GB VM for the kernel, and there are
separate full (and per-process) 4GB VMs for user-space.

A typical /proc/PID/maps file of a process running on a 4G/4G kernel shows
a full 4GB address-space:

 00e80000-00faf000 r-xp 00000000 03:01 175909     /lib/tls/libc-2.3.2.so
 00faf000-00fb2000 rw-p 0012f000 03:01 175909     /lib/tls/libc-2.3.2.so
 [...]
 feffe000-ff000000 rwxp fffff000 00:00 0

the stack ends at 0xff000000 (4GB minus 16MB). The kernel has a 4GB lowmem
area, of which 3.1 GB is still usable even with 64 GB of RAM:

 MemTotal:     66052020 kB
 MemFree:      65958260 kB
 HighTotal:    62914556 kB
 HighFree:     62853140 kB
 LowTotal:      3137464 kB
 LowFree:       3105120 kB

the amount of lowmem is still more than 3 times the amount of lowmem
available to a 4GB system. It's more than 6 times the amount of lowmem a
32 GB system gets with the 3/1 split.

Performance impact of the 4G/4G feature:

There's a runtime cost with the 4G/4G patch: to implement separate address
spaces for the kernel and userspace VM, the entry/exit code has to switch
between the kernel pagetables and the user pagetables. This causes TLB
flushes, which are quite expensive, not so much in terms of TLB misses
(which are quite fast on Intel CPUs if they come from caches), but in
terms of the direct TLB flushing cost (%cr3 manipulation) done on
system-entry.

RAM limits:

in theory, the 4G/4G patch could provide a mem_map[] for 200 GB (!) of
physical RAM on x86, while still having 1 GB of lowmem left. So it gives
quite some legroom. While the right solution for lots of RAM is to use a
proper 64-bit system, there's alot of existing x86 hardware, and x86
servers will still be sold in the next couple of years, so we ought to
support them maximally.

The patch is orthogonal to wli's pgcl patch - both patches try to achieve
the same, with different methods. I can very well imagine workloads where
we want to have the combination of the two patches.

Implementational details:

the patch implements/touches a number of new lowlevel x86 infrastructures:

 - it moves the GDT, IDT, TSS, LDT, vsyscall page and kernel stack up into
   a high virtual memory window (trampoline) at the top 16 MB of the
   4GB address space. This 16 MB window is the only area that is shared 
   between user-space and kernel-space pagetables.

 - it splits out atomic kmaps from highmem dependencies.

 - it makes LDT(s) atomic-kmap-ed.

 - (and lots of other smaller details, like increasing the size of the
   initial mappings and fixing the PAE code to map the full 4GB of kernel
   VM.)

Whenever we do a syscall (or any other trap) from user-mode, the
high-address trampoline code starts to run, with a high-address esp0. This
code switches over to the kernel pagetable, then it switches the 'virtual
kernel stack' to the regular (real) kernel stack. On syscall-exit it does
it the other way around.

there are a few generic kernel changes as well:

 - it implements 'indirect uaccess' primitives and implements all the
   get_user/put_user/copy_to_user/... functions without relying on direct 
   access to user-space. This feature uncovered a number of bugs in the
   lowlevel x86 code already, there was still code that accessed
   user-space memory directly.

 - it splits up PAGE_OFFSET into PAGE_OFFSET_USER and PAGE_OFFSET (kernel)

 - fixes a couple of assumptions about PAGE_OFFSET being PMD_SIZE aligned.

but the generic-kernel impact of the patch is quite low.

the patch optimizes kernel<->kernel context switches and does not flush
the TLB, also, IRQ entry only cases a TLB flush if a userspace pagetable
is loaded.

the typical cost of 4G/4G on typical x86 servers is +3 usecs of syscall
latency (this is in addition to the ~1 usec null syscall latency).
Depending on the workload this can cause a typical measurable wall-clock
overhead from 0% to 30%, for typical application workloads (DB workload,
networking workload, etc.). Isolated microbenchmarks can show a bigger
slowdown as well - due to the syscall latency increase.

i'd guess that the 4G/4G patch is not worth the overhead for systems with
less than 16 GB of RAM (although exceptions might exist, for particularly
lowmem-intensive/sensitive workloads). 32 GB RAM systems run into lowmem
limitations quite frequently so the 4G/4G patch is quite recommended
there, and for 64 GB and larger systems it's a must i think.

Status, future plans:

The patch is a work-in-progress snapshot - it still has a few TODOs and
FIXMEs, but it compiles & works fine for me. Be careful with it
nevertheless - it's an experimental patch which does very intrusive
changes to the lowlevel x86 code.

There are a couple of performance enhancements ontop of this patch that
i'll integrate into this patch in the next couple of days, but i first
wanted to release the base patch.

In any case, enjoy the patch - and as usual, comments and suggestions are
more than welcome,

	Ingo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
  2003-07-08 22:45 [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support Ingo Molnar
@ 2003-07-09  1:29 ` William Lee Irwin III
  2003-07-09  5:13 ` Martin J. Bligh
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: William Lee Irwin III @ 2003-07-09  1:29 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, linux-mm

On Wed, Jul 09, 2003 at 12:45:52AM +0200, Ingo Molnar wrote:
> The patch is orthogonal to wli's pgcl patch - both patches try to achieve
> the same, with different methods. I can very well imagine workloads where
> we want to have the combination of the two patches.

Well, your patch does have the advantage of not being a "break all
drivers" affair.

Also, even though pgcl scales "perfectly" wrt. highmem (nm the code
being a train wreck), the raw capacity increase is needed. There are
enough other reasons to go through with ABI-preserving page clustering
that they're not really in competition with each other.

Looks good to me. I'll spin it up tonight.


-- wli

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
  2003-07-08 22:45 [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support Ingo Molnar
  2003-07-09  1:29 ` William Lee Irwin III
@ 2003-07-09  5:13 ` Martin J. Bligh
  2003-07-09  5:19   ` William Lee Irwin III
  2003-07-09  6:42   ` Ingo Molnar
  2003-07-09  5:16 ` Dave Hansen
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 23+ messages in thread
From: Martin J. Bligh @ 2003-07-09  5:13 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: linux-mm

> i'm pleased to announce the first public release of the "4GB/4GB VM split"
> patch, for the 2.5.74 Linux kernel:
> 
>    http://redhat.com/~mingo/4g-patches/4g-2.5.74-F8

I presume this was for -bk something as it applies clean to -bk6, but not
virgin. 

However, it crashes before console_init on NUMA ;-( I'll shove early printk
in there later.

M.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
  2003-07-08 22:45 [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support Ingo Molnar
  2003-07-09  1:29 ` William Lee Irwin III
  2003-07-09  5:13 ` Martin J. Bligh
@ 2003-07-09  5:16 ` Dave Hansen
  2003-07-09  7:08 ` Geert Uytterhoeven
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Dave Hansen @ 2003-07-09  5:16 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linux Kernel Mailing List, linux-mm

Looks very interesting.  A few concerns, though, some stylish.  Although
I know, if I had done something half as complex, it would look much
worse :) If you're still planning on doing cleanups I can wait, but
otherwise, I can send patches.

Have you looked at the impact on high interrupt load workloads?  I saw
you mention the per-syscall TLB overhead, but you only mentioned the
interrupt overhead in passing.  Doesn't this make it increasingly
important to coalesce interrupts, especially when you're running with
lots of user time?  Any particular workloads have you've tested this
on?  I can try to get a couple of large webserver benchmark runs in on
it, if you like.

It's a lot harder now to drop back to 4k stacks, because of the
hard-coded 2 page kmap sequences.  But those patches are out-of-tree, so
they're of relatively little consequence.  

It might be nice to some more abstraction of the size of the trampoline
window.  There's a stuff this:
        pgd[PTRS_PER_PGD-2] = swapper_pg_dir[PTRS_PER_PGD-2];
        pgd[PTRS_PER_PGD-1] = swapper_pg_dir[PTRS_PER_PGD-1];
Being clever, I think some of these can be the same as the generic code.
The sepmd and banana_split patches in -mjb demonstrate some relatively 
nice ways to do this.

There seems to be quite a bit of duplication of code in the new 
__kmap_atomic* functions.  __kmap_atomic_vaddr() could replace all of
the
duplicated 
        idx = type + KM_TYPE_NR*smp_processor_id();
        vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
lines.  Also, it might nice to combine __kmap_atomic{,_noflush}()

Are you hoping to get this integrated for 2.6, or will it be more of an 
add-on for 2.6 distro releases?
-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
  2003-07-09  5:13 ` Martin J. Bligh
@ 2003-07-09  5:19   ` William Lee Irwin III
  2003-07-09  5:43     ` William Lee Irwin III
  2003-07-09  6:42   ` Ingo Molnar
  1 sibling, 1 reply; 23+ messages in thread
From: William Lee Irwin III @ 2003-07-09  5:19 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Ingo Molnar, linux-kernel, linux-mm

At some point in the past, mingo wrote:
>> i'm pleased to announce the first public release of the "4GB/4GB VM split"
>> patch, for the 2.5.74 Linux kernel:
>>    http://redhat.com/~mingo/4g-patches/4g-2.5.74-F8

On Tue, Jul 08, 2003 at 10:13:12PM -0700, Martin J. Bligh wrote:
> I presume this was for -bk something as it applies clean to -bk6, but not
> virgin. 
> However, it crashes before console_init on NUMA ;-( I'll shove early printk
> in there later.

Don't worry, I'm debugging it.


-- wli

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
  2003-07-09  5:19   ` William Lee Irwin III
@ 2003-07-09  5:43     ` William Lee Irwin III
  2003-07-12 23:58       ` Davide Libenzi
  0 siblings, 1 reply; 23+ messages in thread
From: William Lee Irwin III @ 2003-07-09  5:43 UTC (permalink / raw)
  To: Martin J. Bligh, Ingo Molnar, linux-kernel, linux-mm

On Tue, Jul 08, 2003 at 10:13:12PM -0700, Martin J. Bligh wrote:
>> I presume this was for -bk something as it applies clean to -bk6, but not
>> virgin. 
>> However, it crashes before console_init on NUMA ;-( I'll shove early printk
>> in there later.

On Tue, Jul 08, 2003 at 10:19:41PM -0700, William Lee Irwin III wrote:
> Don't worry, I'm debugging it.

Rather predictably, the NUMA KVA remapping shat itself:


Script started on Tue Jul  8 22:28:53 2003
\x0f$ sscreen -x
^[[?1049h^[[r^[[H^[[?7h^[[?1;4;6l^[[4l^[[?1h^[=\x0f^[(B^[[1;27r^[[H^[[HRecovering nvi editor sessions... done.
Setting up X server socket directory /tmp/.X11-unix...done.
INIT: Entering runlevel: 2
Starting system log daemon: syslogd.
Starting kernel log daemon: klogd.
Starting internet superserver: inetd.
Starting printer spooler: lpd.
Starting network benchmark server: netserver.
Not starting NFS kernel daemon: No exports.
Starting OpenBSD Secure Shell server: sshd.
Starting the system activity data collector: sadc.
Starting NFS common utilities: statd lockd.
Starting periodic command scheduler: cron.

Debian GNU/Linux testing/unstable megeira ttyS0

megeira login: root
Password:
Last login: Tue Jul  8 21:56:18 2003 on ttyS0
Linux megeira 2.5.74 #1 SMP Mon Jul 7 22:15:57 PDT 2003 i686 GNU/Linux
megeira:~# mount /mnt/g
megeira:~# !ec
echo 1 > /proc/sys/vm/overcommit_memory ; echo 1 > /proc/sys/vm/swappiness ; echo 360000 > /proc/sys/vm/dirty_expire_centisecs ; echo 360000 > /proc/sys/vm/dirty_writeback_centisecs ; echo 99 > /proc/sys/vm/dirty_background_ratio ; echo 1 > /proc/profile
megeira:~# shutdown -h now
^[[?5hBroadcast message from root (ttyS0) (Tue Jul  8 22:29:37 2003):

The system is going down for system halt NOW!
INIT: INIT: Sending processes the TERM signal
megeira:~# INIT:Stopping periodic command scheduler: cron.
Stopping internet superserver: inetd.
Stopping printer spooler: lpd.
Stopping network benchmark server: netserver.
Stopping OpenBSD Secure Shell server: sshd.
Saving the System Clock time to the Hardware Clock...
Hardware Clock updated to Tue Jul  8 22:30:04 PDT 2003.
Stopping NFS common utilities: lockd statd.
Stopping NFS kernel daemon: mountd nfsd.
Unexporting directories for NFS kernel daemon...done.
Stopping kernel log daemon: klogd.
Stopping system log daemon: syslogd.
Stopping portmap daemon: portmap.
Sending all processes the TERM signal... done.
Sending all processes the KILL signal... done.
Saving random seed... done.
Unmounting remote filesystems... done.
Deconfiguring network interfaces... done.
Deactivating swap... done.
Unmounting local filesystems... mount: proc already mounted
done.
Shutting down devices
Power down.
Press any key to continue.
Press any key to continue.
Press any key to continue.
Press any key to continue.
Press any key to continue.
Press any key to continue.
Press any key to continue.
^[[H^[[H
    GRUB  version 0.92  (639K lower / 3668992K upper memory)

 +-------------------------------------------------------------------------+ 
|| 
|| 
|| 
|| 
|| 
|| 
|| 
|| 
|| 
|| 
|| 
|| 
+-------------------------------------------------------------------------+
      Use the ^ and v keys to select which entry is highlighted.
      Press enter to boot the selected OS, 'e' to edit the
      commands before booting, or 'c' for a command-line.^[[5;78H  | Boot Safe Kernel                                                        | 
Boot check Kernel                                                       | 
Boot latest kernel                                                      | 
boot latest kernel from elm3b96                                         | 
2.5.44                                                                  | 
2.5.44-mm4                                                              | 
2.5.44-mm4-erich                                                        | 
2.5.44-mm4-michael                                                      | 
2.5.47-stock                                                            | 
2.5.47-sched                                                            | 
2.5.50-sched                                                            | 
2.5.50-stock                                                            | v^[[8;3H boot latest kernel from elm3b96                                         \x0f^[[H^[[H
    GRUB  version 0.92  (639K lower / 3668992K upper memory)

 [ Minimal BASH-like line editing is supported.  For the first word, TAB
   lists possible command completions.  Anywhere else TAB lists the possible
   completions of a device/filename.  ESC at any time exits. ]

grub>                                                                          root (hd0,1)
 Filesystem type is ext2fs, partition type 0x83

grub>                                                                          kernel /home/wli/vmlinuz-ingo root=/dev/sda2 console=ttyS0,38400n8 prof<00n8 profi                                                                    le=1grub> kernel /home/wli/vmlinuz-ingo root=/dev/sda2 console=ttyS0,38400n8 profi>
   [Linux-bzImage, setup=0xa00, size=0x1407e4]

grub>                                                                          boot
Linux version 2.5.74-mm2 (wli@megeira) (gcc version 3.3 (Debian)) #1 SMP Tue Jul 8 22:28:26 PDT 2003
Video mode to be used for restore is ffff
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
 BIOS-e820: 0000000000100000 - 00000000e0000000 (usable)
 BIOS-e820: 00000000fec00000 - 00000000fec09000 (reserved)
 BIOS-e820: 00000000ffe80000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000800000000 (usable)
user-defined physical RAM map:
 user: 0000000000000000 - 000000000009fc00 (usable)
 user: 0000000000100000 - 00000000e0000000 (usable)
 user: 00000000fec00000 - 00000000fec09000 (reserved)
 user: 00000000ffe80000 - 0000000100000000 (reserved)
 user: 0000000100000000 - 0000000800000000 (usable)
Reserving 23040 pages of KVA for lmem_map of node 1
Shrinking node 1 from 4194304 pages to 4171264 pages
Reserving 23040 pages of KVA for lmem_map of node 2
Shrinking node 2 from 6291456 pages to 6268416 pages
Reserving 23040 pages of KVA for lmem_map of node 3
Shrinking node 3 from 8388608 pages to 8365568 pages
Reserving total of 69120 pages for numa KVA remap
28832MB HIGHMEM available.
3666MB LOWMEM available.
min_low_pfn = 1045, max_low_pfn = 938496, highstart_pfn = 1007616
Low memory ends at vaddr e7200000
node 0 will remap to vaddr f8000000 - f8000000
node 1 will remap to vaddr f2600000 - f8000000
node 2 will remap to vaddr ecc00000 - f2600000
node 3 will remap to vaddr e7200000 - ecc00000
High memory starts at vaddr f8000000
found SMP MP-table at 000f6040
hm, page 000f6000 reserved twice.
hm, page 000f7000 reserved twice.
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Unknown interrupt
Un^[[?1l^[>^[[27;1H
^[[?1049l[detached]
\x0f$ 

Script done on Tue Jul  8 22:36:58 2003

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
  2003-07-09  5:13 ` Martin J. Bligh
  2003-07-09  5:19   ` William Lee Irwin III
@ 2003-07-09  6:42   ` Ingo Molnar
  1 sibling, 0 replies; 23+ messages in thread
From: Ingo Molnar @ 2003-07-09  6:42 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: linux-kernel, linux-mm


On Tue, 8 Jul 2003, Martin J. Bligh wrote:

> > i'm pleased to announce the first public release of the "4GB/4GB VM split"
> > patch, for the 2.5.74 Linux kernel:
> > 
> >    http://redhat.com/~mingo/4g-patches/4g-2.5.74-F8
> 
> I presume this was for -bk something as it applies clean to -bk6, but
> not virgin.

indeed - it's for BK-curr.

> However, it crashes before console_init on NUMA ;-( I'll shove early
> printk in there later.

wli found the bug meanwhile - i'll do a new patch later today.

	Ingo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
  2003-07-08 22:45 [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support Ingo Molnar
                   ` (2 preceding siblings ...)
  2003-07-09  5:16 ` Dave Hansen
@ 2003-07-09  7:08 ` Geert Uytterhoeven
  2003-07-10  1:36 ` Martin J. Bligh
  2003-07-13 22:05 ` Petr Vandrovec
  5 siblings, 0 replies; 23+ messages in thread
From: Geert Uytterhoeven @ 2003-07-09  7:08 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linux Kernel Development, linux-mm

On Wed, 9 Jul 2003, Ingo Molnar wrote:
> i'm pleased to announce the first public release of the "4GB/4GB VM split"
> patch, for the 2.5.74 Linux kernel:
> 
>    http://redhat.com/~mingo/4g-patches/4g-2.5.74-F8
> 
> The 4G/4G split feature is primarily intended for large-RAM x86 systems,
> which want to (or have to) get more kernel/user VM, at the expense of
> per-syscall TLB-flush overhead.

Great! Another enterprise feature stolen from SCO? :-)

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
  2003-07-08 22:45 [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support Ingo Molnar
                   ` (3 preceding siblings ...)
  2003-07-09  7:08 ` Geert Uytterhoeven
@ 2003-07-10  1:36 ` Martin J. Bligh
  2003-07-10 13:36   ` Martin J. Bligh
  2003-07-13 22:05 ` Petr Vandrovec
  5 siblings, 1 reply; 23+ messages in thread
From: Martin J. Bligh @ 2003-07-10  1:36 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: linux-mm

> i'm pleased to announce the first public release of the "4GB/4GB VM split"
> patch, for the 2.5.74 Linux kernel:
> 
>    http://redhat.com/~mingo/4g-patches/4g-2.5.74-F8
> 
> The 4G/4G split feature is primarily intended for large-RAM x86 systems,
> which want to (or have to) get more kernel/user VM, at the expense of
> per-syscall TLB-flush overhead.

wli pointed out that the only problem with the NUMA boxen was that you
left out "remap_numa_kva();" from pagetable_init - sticking it back at the
end works fine.

Preliminary benchmark results:

2.5.74-bk6-44 is with the patch applied
2.5.74-bk6-44-on is with the patch applied and config option turned on.

Kernbench: (make -j vmlinux, maximal tasks)
                              Elapsed      System        User         CPU
                   2.5.74       46.11      115.86      571.77     1491.50
            2.5.74-bk6-44       45.92      115.71      570.35     1494.75
         2.5.74-bk6-44-on       48.11      134.51      583.88     1491.75

SDET 128  (see disclaimer)
                           Throughput    Std. Dev
                   2.5.74       100.0%         0.1%
            2.5.74-bk6-44       100.3%         0.7%
         2.5.74-bk6-44-on        92.1%         0.2%

Which isn't too bad at all, considering ... highpte does this to it:

Kernbench: (make -j vmlinux, maximal tasks)
                              Elapsed      System        User         CPU
               2.5.73-mm3       45.38      114.91      565.81     1497.75
       2.5.73-mm3-highpte       46.54      130.41      566.84     1498.00

SDET 16  (see disclaimer)
                           Throughput    Std. Dev
               2.5.73-mm3       100.0%         0.3%
       2.5.73-mm3-highpte        94.8%         0.6%

(I don't have highpte results for higher SDET right now - I'll run 
'em later).

diffprofile for kernbench (- is better with 4/4 on, + worse)

     15066     9.2% total
     10883     0.0% rw_vm
      3686   170.3% do_page_fault
      1652     3.4% default_idle
      1380     0.0% str_vm
      1256     0.0% follow_page
      1012     7.2% do_anonymous_page
       669   119.7% kmap_atomic
       611    78.1% handle_mm_fault
       563     0.0% get_user_pages
       418    41.4% clear_page_tables
       338    66.3% page_address
       304    16.9% buffered_rmqueue
       263   222.9% kunmap_atomic
       161     2.0% __d_lookup
       152    21.8% sys_brk
       151    26.3% find_vma
       138    24.3% pgd_alloc
       135     9.3% schedule
       133     0.6% page_remove_rmap
       128     0.0% put_user_size
       123     3.4% find_get_page
       121     8.4% free_hot_cold_page
       106     3.3% zap_pte_range
        99    11.1% filemap_nopage
        97     1.5% page_add_rmap
        84     7.5% file_move
        79     6.6% release_pages
        65     0.0% get_user_size
        59    15.7% file_kill
        52     0.0% find_extend_vma
...
       -50   -47.2% kmap_high
       -63   -10.8% fd_install
       -76  -100.0% bad_get_user
       -86   -11.6% pte_alloc_one
      -109  -100.0% direct_strncpy_from_user
      -151  -100.0% __copy_user_intel
      -878  -100.0% direct_strnlen_user
     -3505  -100.0% __copy_from_user_ll
     -5368  -100.0% __copy_to_user_ll

and for SDET:

     63719     8.1% total
     39097     9.8% default_idle
     12494     0.0% rw_vm
      4820   192.6% do_page_fault
      3587    36.4% clear_page_tables
      3341     0.0% follow_page
      1744     0.0% str_vm
      1297   138.4% kmap_atomic
      1026    43.8% pgd_alloc
      1010     0.0% get_user_pages
       932    27.6% do_anonymous_page
       877   100.2% handle_mm_fault
       828    14.2% path_lookup
       605    42.9% page_address
       552    13.3% do_wp_page
       496   216.6% kunmap_atomic
       455     4.1% __d_lookup
       441     2.5% zap_pte_range
       415    12.8% do_no_page
       408    36.7% __block_prepare_write
       349     2.5% copy_page_range
       331    12.3% filemap_nopage
       308     0.0% put_user_size
       305    43.9% find_vma
       266    35.7% update_atime
       212     2.3% find_get_page
       209     8.4% proc_pid_stat
       196     9.1% schedule
       188     7.7% buffered_rmqueue
       186     5.2% pte_alloc_one
       166    13.7% __find_get_block
       162    15.1% __mark_inode_dirty
       159     9.1% current_kernel_time
       155    18.1% grab_block
       149     1.5% release_pages
       124     2.6% follow_mount
       118     7.6% ext2_new_inode
       117     5.6% path_release
       113    28.2% __free_pages
       113     0.0% get_user_size
       107    12.1% dnotify_parent
       105    20.8% __alloc_pages
       102    18.4% generic_file_aio_write_nolock
       102     4.7% file_move
...
      -101    -6.5% __set_page_dirty_buffers
      -102   -30.7% kunmap_high
      -104   -13.4% .text.lock.base
      -108    -3.9% copy_process
      -114   -13.4% unmap_vmas
      -121    -5.0% link_path_walk
      -127   -10.5% __read_lock_failed
      -128   -24.3% set_page_address
      -180  -100.0% bad_get_user
      -237   -11.6% .text.lock.namei
      -243  -100.0% direct_strncpy_from_user
      -262    -0.3% page_remove_rmap
      -310    -5.6% kmem_cache_free
      -332    -4.4% atomic_dec_and_lock
      -365   -35.3% kmap_high
      -458   -15.7% .text.lock.dcache
      -583   -22.8% .text.lock.filemap
      -609   -13.4% .text.lock.dec_and_lock
      -649   -54.9% .text.lock.highmem
      -848  -100.0% direct_strnlen_user
      -877  -100.0% __copy_user_intel
      -958  -100.0% __copy_from_user_ll
     -1098    -2.7% page_add_rmap
     -6746  -100.0% __copy_to_user_ll


I'll play around some more with it later. Presumably things like
disk / network intensive workloads that generate a lot of interrupts
will be bad ... but NAPI would help?

What I *really* like is that without the config option on, there's
no degredation ;-)

M.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
  2003-07-10  1:36 ` Martin J. Bligh
@ 2003-07-10 13:36   ` Martin J. Bligh
  0 siblings, 0 replies; 23+ messages in thread
From: Martin J. Bligh @ 2003-07-10 13:36 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: linux-mm

Results now with highpte

2.5.74-bk6-44 is with the patch applied
2.5.74-bk6-44-on is with the patch applied and 4/4 config option.
2.5.74-bk6-44-hi is with the patch applied and with highpte instead.

Overhead of 4/4 isn't much higher, and is much more generally useful.

Kernbench: (make -j vmlinux, maximal tasks)
                              Elapsed      System        User         CPU
                   2.5.74       46.11      115.86      571.77     1491.50
            2.5.74-bk6-44       45.92      115.71      570.35     1494.75
         2.5.74-bk6-44-on       48.11      134.51      583.88     1491.75
         2.5.74-bk6-44-hi       47.06      131.13      570.79     1491.50

SDET 128  (see disclaimer)
                           Throughput    Std. Dev
                   2.5.74       100.0%         0.1%
            2.5.74-bk6-44       100.3%         0.7%
         2.5.74-bk6-44-on        92.1%         0.2%
         2.5.74-bk6-44-hi        94.5%         0.1%



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
  2003-07-09  5:43     ` William Lee Irwin III
@ 2003-07-12 23:58       ` Davide Libenzi
  2003-07-13  0:11         ` William Lee Irwin III
  0 siblings, 1 reply; 23+ messages in thread
From: Davide Libenzi @ 2003-07-12 23:58 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Linux Kernel Mailing List

On Tue, 8 Jul 2003, William Lee Irwin III wrote:
   ^^^^^^^^^^^^^^^

Is it just me that is receiving dups from lkml or it's a common disease ?



- Davide


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
  2003-07-12 23:58       ` Davide Libenzi
@ 2003-07-13  0:11         ` William Lee Irwin III
  2003-07-13  8:13           ` jw schultz
  0 siblings, 1 reply; 23+ messages in thread
From: William Lee Irwin III @ 2003-07-13  0:11 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Linux Kernel Mailing List

On Tue, 8 Jul 2003, William Lee Irwin III wrote:
>    ^^^^^^^^^^^^^^^

On Sat, Jul 12, 2003 at 04:58:55PM -0700, Davide Libenzi wrote:
> Is it just me that is receiving dups from lkml or it's a common disease ?

The story is all in the headers:

Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S269995AbTGLXao (ORCPT <rfc822;wli@holomorphy.com>);
        Sat, 12 Jul 2003 19:30:44 -0400
Received: (majordomo@vger.kernel.org) by vger.kernel.org id S270018AbTGLXao
        (ORCPT <rfc822;linux-kernel-outgoing>);
        Sat, 12 Jul 2003 19:30:44 -0400
Received: from pip15.ptt.js.cn ([61.155.13.245]:9629 "HELO jlonline.com")
        by vger.kernel.org with SMTP id S269995AbTGLXa2 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Sat, 12 Jul 2003 19:30:28 -0400
Received: from jlonline.com([10.100.0.6]) by js.cn(AIMC 2.9.5.2)
        with SMTP id jm43f10fd18; Sun, 13 Jul 2003 07:33:56 +0800
Received: from kanga.kvack.org([216.138.200.138]) by js.cn(AIMC 2.9.5.2)
        with SMTP id jm343f0c0ba8; Wed, 09 Jul 2003 13:35:56 +0800
Received: (root@kanga.kvack.org) by kvack.org id <S26870AbTGIFmJ>;
        Wed, 9 Jul 2003 01:42:09 -0400
Received: from holomorphy.com ([66.224.33.161]:56232 "EHLO holomorphy")
        by kvack.org with ESMTP id <S26867AbTGIFlw> convert rfc822-to-8bit;
        Wed, 9 Jul 2003 01:41:52 -0400
Received: from wli by holomorphy with local (Exim 3.36 #1 (Debian))
        id 19a7jQ-0004Pg-00; Tue, 08 Jul 2003 22:43:08 -0700

It's clearly well upstream from me.

-- wli

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
  2003-07-13  0:11         ` William Lee Irwin III
@ 2003-07-13  8:13           ` jw schultz
  0 siblings, 0 replies; 23+ messages in thread
From: jw schultz @ 2003-07-13  8:13 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: William Lee Irwin III, Davide Libenzi

On Sat, Jul 12, 2003 at 05:11:23PM -0700, William Lee Irwin III wrote:
> On Tue, 8 Jul 2003, William Lee Irwin III wrote:
> >    ^^^^^^^^^^^^^^^
> 
> On Sat, Jul 12, 2003 at 04:58:55PM -0700, Davide Libenzi wrote:
> > Is it just me that is receiving dups from lkml or it's a common disease ?
> 
> The story is all in the headers:
> 
> Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
>         id S269995AbTGLXao (ORCPT <rfc822;wli@holomorphy.com>);
>         Sat, 12 Jul 2003 19:30:44 -0400
> Received: (majordomo@vger.kernel.org) by vger.kernel.org id S270018AbTGLXao
>         (ORCPT <rfc822;linux-kernel-outgoing>);
>         Sat, 12 Jul 2003 19:30:44 -0400
> Received: from pip15.ptt.js.cn ([61.155.13.245]:9629 "HELO jlonline.com")
>         by vger.kernel.org with SMTP id S269995AbTGLXa2 (ORCPT
>         <rfc822;linux-kernel@vger.kernel.org>);
>         Sat, 12 Jul 2003 19:30:28 -0400
> Received: from jlonline.com([10.100.0.6]) by js.cn(AIMC 2.9.5.2)
>         with SMTP id jm43f10fd18; Sun, 13 Jul 2003 07:33:56 +0800
> Received: from kanga.kvack.org([216.138.200.138]) by js.cn(AIMC 2.9.5.2)
>         with SMTP id jm343f0c0ba8; Wed, 09 Jul 2003 13:35:56 +0800
> Received: (root@kanga.kvack.org) by kvack.org id <S26870AbTGIFmJ>;
>         Wed, 9 Jul 2003 01:42:09 -0400
> Received: from holomorphy.com ([66.224.33.161]:56232 "EHLO holomorphy")
>         by kvack.org with ESMTP id <S26867AbTGIFlw> convert rfc822-to-8bit;
>         Wed, 9 Jul 2003 01:41:52 -0400
> Received: from wli by holomorphy with local (Exim 3.36 #1 (Debian))
>         id 19a7jQ-0004Pg-00; Tue, 08 Jul 2003 22:43:08 -0700
> 
> It's clearly well upstream from me.

More to the point:

| To: "Martin J. Bligh" <mbligh@aracnet.com>,
|         Ingo Molnar <mingo@elte.hu>,
| linux-kernel@vger.kernel.org,
|         linux-mm@kvack.org

and finally:

| --
| To unsubscribe, send a message with 'unsubscribe linux-mm' in
| the body to majordomo@kvack.org.  For more info on Linux MM,
| see: http://www.linux-mm.org/ .
| Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at
| http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at  http://www.tux.org/lkml/

linux-mm is forwarding to linux-kernel without adequate
checking to see if it is already going there.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw@pegasys.ws

		Remember Cernan and Schmitt

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
  2003-07-08 22:45 [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support Ingo Molnar
                   ` (4 preceding siblings ...)
  2003-07-10  1:36 ` Martin J. Bligh
@ 2003-07-13 22:05 ` Petr Vandrovec
  5 siblings, 0 replies; 23+ messages in thread
From: Petr Vandrovec @ 2003-07-13 22:05 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, linux-mm

On Wed, Jul 09, 2003 at 12:45:52AM +0200, Ingo Molnar wrote:
> 
> i'm pleased to announce the first public release of the "4GB/4GB VM split"
> patch, for the 2.5.74 Linux kernel:
> 
>    http://redhat.com/~mingo/4g-patches/4g-2.5.74-F8

FYI, VMware's vmmon/vmnet I maintain for 2.5.x kernels at 
http://platan.vc.cvut.cz/ftp/pub/vmware (currently 
.../vmware-any-any-update37.tar.gz) were updated to work correctly
with 4G/4G kernel configuration.
						Best regards,
							Petr Vandrovec
							vandrove@vc.cvut.cz

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
  2003-07-09 10:58 "Kirill Korotaev" 
  2003-07-09 15:14 ` Ingo Molnar
@ 2003-07-14 20:24 ` Ingo Molnar
  1 sibling, 0 replies; 23+ messages in thread
From: Ingo Molnar @ 2003-07-14 20:24 UTC (permalink / raw)
  To: "Kirill Korotaev" ; +Cc: linux-kernel


On Wed, 9 Jul 2003, [koi8-r] "Kirill Korotaev[koi8-r] "  wrote:

> 11. machine_real_restart()

> BTW, as for me this path didn't reboot at all until I fixed it. Check
> your kernel with option "reboot=b"

is this problem 4G/4G specific, or is there a generic kernel bug here?

	Ingo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
  2003-07-10 11:35     ` Russell King
@ 2003-07-10 12:33       ` Kirill Korotaev
  0 siblings, 0 replies; 23+ messages in thread
From: Kirill Korotaev @ 2003-07-10 12:33 UTC (permalink / raw)
  To: Russell King; +Cc: linux-kernel

Hi!

> I haven't read the patches, but this caught my attention.
>
> Wasn't the use of cr3 there to ensure that we used the right page tables
> when fixing up page faults occuring in the middle of a context switch for
> interrupt handlers in kernel modules?
When 4gb split is used cr3 always(!) points to swapper pgdir when kernel code 
is executing, so it is not an issue.

Kirill


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
  2003-07-10 10:50   ` Kirill Korotaev
  2003-07-10 10:59     ` Ingo Molnar
@ 2003-07-10 11:35     ` Russell King
  2003-07-10 12:33       ` Kirill Korotaev
  1 sibling, 1 reply; 23+ messages in thread
From: Russell King @ 2003-07-10 11:35 UTC (permalink / raw)
  To: Kirill Korotaev; +Cc: Ingo Molnar, linux-kernel

On Thu, Jul 10, 2003 at 02:50:42PM +0400, Kirill Korotaev wrote:
> > > - I changed do_page_fault to setup vmalloced pages to current->mm->pgd
> > >  instead of cr3 context.

I haven't read the patches, but this caught my attention.

Wasn't the use of cr3 there to ensure that we used the right page tables
when fixing up page faults occuring in the middle of a context switch for
interrupt handlers in kernel modules?

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
  2003-07-10 10:50   ` Kirill Korotaev
@ 2003-07-10 10:59     ` Ingo Molnar
  2003-07-10 11:35     ` Russell King
  1 sibling, 0 replies; 23+ messages in thread
From: Ingo Molnar @ 2003-07-10 10:59 UTC (permalink / raw)
  To: Kirill Korotaev; +Cc: linux-kernel


On Thu, 10 Jul 2003, Kirill Korotaev wrote:

> > yes, it is also visible to user-space, otherwise user-space LDT
> > descriptors would not work. The top 16 MB of virtual memory also hosts the
> > 'virtual LDT'.
> > the top 16 MB is split up into per-CPU areas, each CPU has its own. Check
> > out atomic_kmap.h for details.

> But you need to remap it every time a switch occurs. [...]

only if the task uses an LDT - which with NPTL is very rare these days.  
But _if_ something relies on LDTs heavily then it can be done limit-less.
(we used to have nasty LDT limits due to vmalloc() limitations).

> 8. NMI is special. The only difference is that unlike usual interrupts
> NMI can interrupt returning to user-space either. This is solved easily
> by saving current cr3/esp on enter and restoring them if required on
> leave ("required"  == esp >= 0xffxxxxxx).

ok, "esp >= 0xff000000" is a good check, and it's a constant check so it
can be done very early in the entry code. The double-cr3-load is not an
issue, this is a rare race situation. I'll rework this area to be faster,
but first i wanted to concentrate on robustness. Thanks for you comments,
they are really useful!

	Ingo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
  2003-07-09 15:14 ` Ingo Molnar
@ 2003-07-10 10:50   ` Kirill Korotaev
  2003-07-10 10:59     ` Ingo Molnar
  2003-07-10 11:35     ` Russell King
  0 siblings, 2 replies; 23+ messages in thread
From: Kirill Korotaev @ 2003-07-10 10:50 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

Hi!

> > b. As far as I see you are mapping LDT in kernel-space at default addr
> > using fixmap/kmap, but is LDT mapped in user-space? [...]
> yes, it is also visible to user-space, otherwise user-space LDT
> descriptors would not work. The top 16 MB of virtual memory also hosts the
> 'virtual LDT'.
> the top 16 MB is split up into per-CPU areas, each CPU has its own. Check
> out atomic_kmap.h for details.
But you need to remap it every time a switch occurs. Your task switch is very 
much slower a normal one. This reason I don't map per-CPU LDT, only per-CPU 
task structs (2 ptes), which is less than 100 cpu clocks.

> > Now I do as follows:
> > - I map default_ldt at some fixed addr in all processes, including
> > swapper (one GLOBAL page)
> > - I don't change LDT allocation/loading...
> > - But LDT is allocated via vmalloc, so I changed vmalloc to return
> > addresses higher than TASK_SIZE (i.e. > 3GB in my case)
> > - I changed do_page_fault to setup vmalloced pages to current->mm->pgd
> >  instead of cr3 context.
> ugh. I think the LDT mapping scheme in my patch is cleaner and more
> robust.
Not sure. Your patch changes a lot of code which requires a lot of testing. 
Mine changes only 2 lines (do_page_fault (cr3 changes to current->mm->pgd) 
and VMALLOC_START)!

> on x86 it's not possible to switch into PAE mode and to switch to a new
> pagetable, atomically. So when PAE mode is enabled, it has to use the same
> pagetable as the 2-level pagetables - if for nothing else but a split
> moment. The PAE root pagetable is 4x 8 byte entries (32 bytes) at the %cr3
> address. The 2-level root pagetable is 4096 bytes at %cr3. The first 32
> bytes thus overlap - these 32 bytes cover 32 MB of RAM. So to keep the
> switchover as simple as possible, we have to leave out the first 32 MB.
> (this whole problem could be avoided by writing some trampoline code which
> uses temporary pagetables.)
Ugggghhh... You are right. The smallest offset is thus 32Mb :(

> > 9. entry.S
> > - %cr3 vs %esp check. I've found in Intel docs that "movl %cr3, reg"
> > takes a long time (I didn't check it btw myself), so as for me I check
> > esp here, instead of cr3. Your RESTORE_ALL is too long, global vars and
> > markers can be avoided here.
> sure - but the TLB flush dominates the overhead anyway. I intend to cut
> down on the overhead in this path, but there's just so much to be won.
My tests showed me that TLB flush in a hard instruction, but every additional 
instruction influence quite noticebly (~0.2%-0.6%) either.
At least I saw every instruction influence when was measuring gettimeofday 
speed.

> > - Why do you reload %esp every time? It's reload can be avoided as well
> > as reload of cr3 if called from kernel (The problem with NMI is
> > solvable)
> have you solved the NMI problem? Does it work under load?
Yes. I solved it. I do not reload esp and cr3 every time, only when I'm sure 
it requires reloading. And I do not mask interrupts for it via call gates, 
nor do I introduce any magic marks or vars.
It works under load fine, I perfomed quite a big number of long time tests to 
compare perfomance and check stablitiy.
The idea is the following:
1. cr3+esp reloading is not atomic, so we need to guarentee that interrupting 
this sequence should detect it somehow and do reloading again.
2. all task structs are mapped at high addresses 0xffxxxxxx
3. We have two flags indicating that reloading has begun: cr3 and esp.
Whether we finished with reloading or not depends on reloading order.
4. Using previous I decided to reload cr3 and then esp.
5. There 3 situations:
a. we didn't relaod cr3, nor esp
b. we reloaded cr3, but not esp
c. we reloaded both.
Situations a and b require esp reloading. Situation c doesn't require any 
actions.
6. We can detect both situations a and b checking that esp is not higher than 
0xffxxxxxx and overload it if required. When such situations are detected I 
do reload cr3 either. So with some small probability I can reload cr3 twice. 
Not dangerous. If you don't like double cr3-reloads, you can check cr3 
instead of esp, but my tests showed that checking esp gives a bit better 
perfomance and is easier.
7. I do cli before SWITCH+RESTORE_ALL preventing interrupts in this sequence. 
eflags are restore by iret anyway. I.e. interrupts can't interrupt switching 
to user-space, only to kernel-space switching.
8. NMI is special. The only difference is that unlike usual interrupts NMI can 
interrupt returning to user-space either. This is solved easily by saving 
current cr3/esp on enter and restoring them if required on leave ("required" 
== esp >= 0xffxxxxxx).

> just boot a 64 GB box with mem=nopentium.
I think disabling PSE is not a good idea for such amount of memory.
But for 4Kb pages you are right....

Kirill


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
       [not found]   ` <20030709164852.523093a3.ak@suse.de>
@ 2003-07-09 15:17     ` Kirill Korotaev
  0 siblings, 0 replies; 23+ messages in thread
From: Kirill Korotaev @ 2003-07-09 15:17 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

> I added a printf for the actual stack

> thread_self=4000
> my stack 0xbffff4df
> attr_init stack=00000000, stack_size=001FF000

> That's the output for pthread_attr_init. The manpage says it should fill in
> the default values and a 0 base is not that unreasonable for it.
0 is ok, the problem is in get_stackaddr.

> stack rlimit=1ff000
> thread=4000, stack=40035480, stack_size=40035480
> For the main() thread it's wrong.

exactly! For the main thread pthread_get_stackaddr returns a bull shit always 
:(

But at least java 1.3 has workaround inside and handles this magic value 
separatly and doesn't crash whis 3/1GB split (this value depends on TASK_SIZE 
or more preciesly on current stack value aligned to some big boundary (AFAIR, 
1GB)).

Anyway it's definietly a bug. I have a fix in a preloading .so library for 
glibc which overrides pthread_getstack_addr symbol. If you wish I can send it 
to you.

> my stack 0xbf7ffaab
> attr_init stack=00000000, stack_size=001FF000
> stack rlimit=1ff000
> thread=4002, stack=BF800000, stack_size=001FF000
>
> For the others everything is correct.
true.

[skip]

Kirill


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
  2003-07-09 10:58 "Kirill Korotaev" 
@ 2003-07-09 15:14 ` Ingo Molnar
  2003-07-10 10:50   ` Kirill Korotaev
  2003-07-14 20:24 ` Ingo Molnar
  1 sibling, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2003-07-09 15:14 UTC (permalink / raw)
  To: "Kirill Korotaev" ; +Cc: linux-kernel


On Wed, 9 Jul 2003, [koi8-r] "Kirill Korotaev[koi8-r] "  wrote:

> b. As far as I see you are mapping LDT in kernel-space at default addr
> using fixmap/kmap, but is LDT mapped in user-space? [...]

yes, it is also visible to user-space, otherwise user-space LDT
descriptors would not work. The top 16 MB of virtual memory also hosts the
'virtual LDT'.

> [...] How LDTs from different processes on different CPUs are
> non-overlapped?

the top 16 MB is split up into per-CPU areas, each CPU has its own. Check 
out atomic_kmap.h for details.

> Now I do as follows:
> - I map default_ldt at some fixed addr in all processes, including swapper
> (one GLOBAL page)
> - I don't change LDT allocation/loading...
> - But LDT is allocated via vmalloc, so I changed vmalloc to return addresses
> higher than TASK_SIZE (i.e. > 3GB in my case)
> - I changed do_page_fault to setup vmalloced pages to current->mm->pgd
>  instead of cr3 context.

ugh. I think the LDT mapping scheme in my patch is cleaner and more
robust.

> 6. PAGE_OFFSET
> 
> + * Note: on PAE the kernel must never go below 32 MB, we use the
> + * first 8 entries of the 2-level boot pgd for PAE magic.
>
> Could you please help me to understand where this magic is?
> I use now 64Mb offset, but I failed to understand why it refused to boot with
> 16MB offset (AFAIR even w/o PAE).

on x86 it's not possible to switch into PAE mode and to switch to a new
pagetable, atomically. So when PAE mode is enabled, it has to use the same
pagetable as the 2-level pagetables - if for nothing else but a split
moment. The PAE root pagetable is 4x 8 byte entries (32 bytes) at the %cr3
address. The 2-level root pagetable is 4096 bytes at %cr3. The first 32
bytes thus overlap - these 32 bytes cover 32 MB of RAM. So to keep the
switchover as simple as possible, we have to leave out the first 32 MB.  
(this whole problem could be avoided by writing some trampoline code which
uses temporary pagetables.)


> 7. X_TRAMP_ADDR, TODO: do boot-time code fixups not these runtime fixups.)
> 
> I did it another way: I introduced a new section which is mapped at high
> addresses in all pgds, and fit all the entry code from
> interrupts/exceptions/syscalls there. No relocations/fixups/trampolines
> are required with such approach.

yeah - i'll do something like this too. (Initially i wanted to have a
per-CPU trampoline, but it's not necessary anymore.)

> 9. entry.S
> 
> - %cr3 vs %esp check. I've found in Intel docs that "movl %cr3, reg"
> takes a long time (I didn't check it btw myself), so as for me I check
> esp here, instead of cr3. Your RESTORE_ALL is too long, global vars and
> markers can be avoided here.

sure - but the TLB flush dominates the overhead anyway. I intend to cut
down on the overhead in this path, but there's just so much to be won.

> - Why have you cut lcall7/lcall27? Due to call gate doesn't cli
> interrupts? Bad!! really bad :)

i have cut it because nothing i care about uses it. Feel free to add it
back.

> - Better to remove macro call_SYMBOL_NAME_ABS and many other hacks due
> to code relocation. Use vmlinux.lds to specify code offset.

yeah, agreed.

> - Why do you reload %esp every time? It's reload can be avoided as well
> as reload of cr3 if called from kernel (The problem with NMI is
> solvable)

have you solved the NMI problem? Does it work under load?

> 10. Bug in init_entry_mappings()?
> 
> +BUG_ON(sizeof(struct desc_struct)*NR_CPUS*GDT_ENTRIES > 2*PAGE_SIZE);
> AFAIK more than 1 entry per CPU is used (at least in 2.4.x).

what do you mean?

> Why do you think that warm-reboot is impossible?

it's not impossible, i just didnt fix it yet.

> 12. 8MB/16MB startup mapping.
> 
> As far as I understand 16MB startup mapping is not required here, am I
> wrong? Memory is mapped via 4MB pages so only a few pgd/pmd pages (1+4)
> are required. What else could consume memory so much before we mapped it
> all?

just boot a 64 GB box with mem=nopentium.

> 14.debug regs
> 
> do you catch watchpoints on kernel in do_debug()?
> Hardware breakpoints are using linear addresses and can be setup'ed by user
>  to kernel code... in this case %dr7 should be cleared and restored on
>  user-space returning...

indeed, i'll fix this.

> 15. perfomance
> 
>     Have you measured perfomance with your patch?
> I found that PAE-enabled kernel executes sys_gettimeofday (I've chosen it for
> measuring in my tests) ~2 times slower than non-PAE kernel, and 5.2 times
> slower than std kernel.

yes, i've written about the basic syscall-latency observations.  
(gettimeofday()  latency is very close to null-syscall latency.)

>     But in real-life tests (jbb2000, web-server stress tests) I found
> that 4GB splitted kernel perfomance is <1-2% worse than w/o it. Looks
> very good, I think?

yeah, i've seen similar general real-life impact of 4G/4G, which is good -
but i wanted to point out the worst-case as well.

	Ingo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
       [not found] <E19aCeB-000ICs-00.kksx-mail-ru@f23.mail.ru.suse.lists.linux.kernel>
@ 2003-07-09 12:44 ` Andi Kleen
       [not found] ` <200307091851.33438.dev@sw.ru>
  1 sibling, 0 replies; 23+ messages in thread
From: Andi Kleen @ 2003-07-09 12:44 UTC (permalink / raw)
  To: Kirill Korotaev; +Cc: mingo, linux-kernel

"Kirill Korotaev"  <kksx@mail.ru> writes:

> I didn't change TASK_SIZE in my patch, since there is a bug in libpthread,
> which causes SIGSEGV when java on non-standart kernel split is run :(((
> Test it with your kernel pls if not yet. You can find a jbb2000 test in
> internet.

That's fixed in the Sun JVM 1.4.2, or in Blackdown 1.4.1. I had the same
problem on AMD64. Currently it has a special "3GB" personality to 
deal with the older JVMs. All the personality does is to move the top
of stack, an application could still place mmaps behind it.

-Andi

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
@ 2003-07-09 10:58 "Kirill Korotaev" 
  2003-07-09 15:14 ` Ingo Molnar
  2003-07-14 20:24 ` Ingo Molnar
  0 siblings, 2 replies; 23+ messages in thread
From: "Kirill Korotaev"  @ 2003-07-09 10:58 UTC (permalink / raw)
  To: "Ingo Molnar" ; +Cc: linux-kernel

Hi!

> yeah - i wrote the 4G/4G patch a couple of weeks ago. I'll send it to lkml
> soon, feel free to comment on it. How does your patch currently look like?

My patch is also ready a couple of weeks ago :))) , but I haven't got
company's permission to publish it yet :((((( And mine is for 2.4.x
kernels... But can be quite easialy adopted for 2.5.x and even for 2.2.x :)))

I have quiete a lot of questions/comments on your patch...
Some of them can be irrelevant since I didn't deal with 2.5.x kernels but I don't think that there are much changes in this part of code.

1. TASK_SIZE, PAGE_OFFSET_USER

I didn't change TASK_SIZE in my patch, since there is a bug in libpthread,
which causes SIGSEGV when java on non-standart kernel split is run :(((
Test it with your kernel pls if not yet. You can find a jbb2000 test in
internet.

2. LDT

a. why have you changed this line? Have you tested your patch with apps using
LDT? I would recommend you to install libc2.5 with TLS support in libpthreads
and insert printk in write_ldt to see if LDT is really used and run a
pthread-aware app, e.g. java.

-		if (unlikely(prev->context.ldt != next->context.ldt))
+		if (unlikely(prev->context.size + next->context.size))

b. As far as I see you are mapping LDT in kernel-space at default addr using
fixmap/kmap, but is LDT mapped in user-space? How LDTs from different processes on different CPUs are non-overlapped?

I triead two solutions with LDT:
- I mapped process's LDT at it's fixed addr both to user-space and
kernel-space and remapped it every time task switch occured, but even though
this addrs were different for different CPUs it didn't work on my machine (no
matter UP/SMP). It worked fine with bochs (x86 emulator), but refused to work
in real life, I think there is some CPU caching which can't be controlled
easialy.
- I tried to reload LDT and  %fs,%gs on returning to user-space, but it
 didn't helped either. And It was quite a messy solution, since fs/gs is
saved/restored in some places in kernel, but I didn't want just simply save
it on kernel-entering and restore on kernel-leaving. So I give up with this
one either...

Now I do as follows:
- I map default_ldt at some fixed addr in all processes, including swapper
(one GLOBAL page)
- I don't change LDT allocation/loading...
- But LDT is allocated via vmalloc, so I changed vmalloc to return addresses
higher than TASK_SIZE (i.e. > 3GB in my case)
- I changed do_page_fault to setup vmalloced pages to current->mm->pgd
 instead of cr3 context.

3. csum_partial_xxx

It looks almost the same, but I plan to optimize it to use get_user_pages()
and to so to avoid double memory access (coping and after that checksumming
=> checksumming with inline copying).

4.  TASK_SIZE

It sounds strange: PAGE_OFFSET_USER. User-space have no offset (=0).
+#define TASK_SIZE	(PAGE_OFFSET_USER)

5. LDT_SIZE

+#define MAX_LDT_PAGES 16
Can be defined as PAGE_ALIGN(LDT_ENTRIES*LDT_ENTRY_SIZE)

6. PAGE_OFFSET

+ * Note: on PAE the kernel must never go below 32 MB, we use the
+ * first 8 entries of the 2-level boot pgd for PAE magic.
Could you please help me to understand where this magic is?
I use now 64Mb offset, but I failed to understand why it refused to boot with
16MB offset (AFAIR even w/o PAE).

7. X_TRAMP_ADDR, TODO: do boot-time code fixups not these runtime fixups.)

I did it another way:
I introduced a new section which is mapped at high addresses in all pgds, and
fit all the entry code from interrupts/exceptions/syscalls there. No
relocations/fixups/trampolines are required with such approach.

8. thread_info.h, /* offsets into the thread_info struct for assembly code
access */

I added offset.c file which is preprocessed first and which generates
 offset.h with offsets to all required struct fields (for me it is: tsk->xxx,
tsk->thread->xxx)

9. entry.S

- %cr3 vs %esp check.
I've found in Intel docs that "movl %cr3, reg" takes a long time (I didn't check it btw myself), so as for me I check esp here, instead of cr3. Your RESTORE_ALL is too long, global vars and markers can be avoided here.
- Why have you cut lcall7/lcall27? Due to call gate doesn't cli interrupts? Bad!! really bad :)
- Better to remove macro call_SYMBOL_NAME_ABS and many other hacks due to code relocation. Use vmlinux.lds to specify code offset.
- Why do you reload %esp every time? It's reload can be avoided as well as reload of cr3 if called from kernel (The problem with NMI is solvable)

10. Bug in init_entry_mappings()?

+BUG_ON(sizeof(struct desc_struct)*NR_CPUS*GDT_ENTRIES > 2*PAGE_SIZE);
AFAIK more than 1 entry per CPU is used (at least in 2.4.x).

11. machine_real_restart()

+       /*
+        * NOTE: this is a wrong 4G/4G PAE assumption. But it will triple
+        * fault the CPU (ie. reboot it) in a guaranteed way so we dont
+        * lose anything but the ability to warm-reboot. (which doesnt
+        * work on those big boxes using 4G/4G PAE anyway.)
+        */
Why do you think that warm-reboot is impossible?
BTW, as for me this path didn't reboot at all until I fixed it. Check your kernel with option "reboot=b"

12. 8MB/16MB startup mapping.

As far as I understand 16MB startup mapping is not required here, am I wrong? Memory is mapped via 4MB pages so only a few pgd/pmd pages (1+4) are required. What else could consume memory so much before we mapped it all?

13. Code style

Don't use magic constants like 8191, 8192 (THREAD_SIZE), 4096 (PAGE_SIZE) or smth like that. Looks wierd.

14.debug regs

do you catch watchpoints on kernel in do_debug()?
Hardware breakpoints are using linear addresses and can be setup'ed by user
 to kernel code... in this case %dr7 should be cleared and restored on
 user-space returning...

15. perfomance

    Have you measured perfomance with your patch?
I found that PAE-enabled kernel executes sys_gettimeofday (I've chosen it for
measuring in my tests) ~2 times slower than non-PAE kernel, and 5.2 times
slower than std kernel.
So for a simple loop with sys_gettimeofday():
PAE 4GB:	3.0 sec
4GB:		1.43 sec
original:	0.57 sec

    But in real-life tests (jbb2000, web-server stress tests) I found that
 4GB splitted kernel perfomance is <1-2% worse than w/o it. Looks very good,
 I think?

WBR, Kirill


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2003-07-14 20:23 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-07-08 22:45 [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support Ingo Molnar
2003-07-09  1:29 ` William Lee Irwin III
2003-07-09  5:13 ` Martin J. Bligh
2003-07-09  5:19   ` William Lee Irwin III
2003-07-09  5:43     ` William Lee Irwin III
2003-07-12 23:58       ` Davide Libenzi
2003-07-13  0:11         ` William Lee Irwin III
2003-07-13  8:13           ` jw schultz
2003-07-09  6:42   ` Ingo Molnar
2003-07-09  5:16 ` Dave Hansen
2003-07-09  7:08 ` Geert Uytterhoeven
2003-07-10  1:36 ` Martin J. Bligh
2003-07-10 13:36   ` Martin J. Bligh
2003-07-13 22:05 ` Petr Vandrovec
2003-07-09 10:58 "Kirill Korotaev" 
2003-07-09 15:14 ` Ingo Molnar
2003-07-10 10:50   ` Kirill Korotaev
2003-07-10 10:59     ` Ingo Molnar
2003-07-10 11:35     ` Russell King
2003-07-10 12:33       ` Kirill Korotaev
2003-07-14 20:24 ` Ingo Molnar
     [not found] <E19aCeB-000ICs-00.kksx-mail-ru@f23.mail.ru.suse.lists.linux.kernel>
2003-07-09 12:44 ` Andi Kleen
     [not found] ` <200307091851.33438.dev@sw.ru>
     [not found]   ` <20030709164852.523093a3.ak@suse.de>
2003-07-09 15:17     ` Kirill Korotaev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).