linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
@ 2003-07-09 10:58 "Kirill Korotaev" 
  2003-07-09 15:14 ` Ingo Molnar
  2003-07-14 20:24 ` Ingo Molnar
  0 siblings, 2 replies; 23+ messages in thread
From: "Kirill Korotaev"  @ 2003-07-09 10:58 UTC (permalink / raw)
  To: "Ingo Molnar" ; +Cc: linux-kernel

Hi!

> yeah - i wrote the 4G/4G patch a couple of weeks ago. I'll send it to lkml
> soon, feel free to comment on it. How does your patch currently look like?

My patch is also ready a couple of weeks ago :))) , but I haven't got
company's permission to publish it yet :((((( And mine is for 2.4.x
kernels... But can be quite easialy adopted for 2.5.x and even for 2.2.x :)))

I have quiete a lot of questions/comments on your patch...
Some of them can be irrelevant since I didn't deal with 2.5.x kernels but I don't think that there are much changes in this part of code.

1. TASK_SIZE, PAGE_OFFSET_USER

I didn't change TASK_SIZE in my patch, since there is a bug in libpthread,
which causes SIGSEGV when java on non-standart kernel split is run :(((
Test it with your kernel pls if not yet. You can find a jbb2000 test in
internet.

2. LDT

a. why have you changed this line? Have you tested your patch with apps using
LDT? I would recommend you to install libc2.5 with TLS support in libpthreads
and insert printk in write_ldt to see if LDT is really used and run a
pthread-aware app, e.g. java.

-		if (unlikely(prev->context.ldt != next->context.ldt))
+		if (unlikely(prev->context.size + next->context.size))

b. As far as I see you are mapping LDT in kernel-space at default addr using
fixmap/kmap, but is LDT mapped in user-space? How LDTs from different processes on different CPUs are non-overlapped?

I triead two solutions with LDT:
- I mapped process's LDT at it's fixed addr both to user-space and
kernel-space and remapped it every time task switch occured, but even though
this addrs were different for different CPUs it didn't work on my machine (no
matter UP/SMP). It worked fine with bochs (x86 emulator), but refused to work
in real life, I think there is some CPU caching which can't be controlled
easialy.
- I tried to reload LDT and  %fs,%gs on returning to user-space, but it
 didn't helped either. And It was quite a messy solution, since fs/gs is
saved/restored in some places in kernel, but I didn't want just simply save
it on kernel-entering and restore on kernel-leaving. So I give up with this
one either...

Now I do as follows:
- I map default_ldt at some fixed addr in all processes, including swapper
(one GLOBAL page)
- I don't change LDT allocation/loading...
- But LDT is allocated via vmalloc, so I changed vmalloc to return addresses
higher than TASK_SIZE (i.e. > 3GB in my case)
- I changed do_page_fault to setup vmalloced pages to current->mm->pgd
 instead of cr3 context.

3. csum_partial_xxx

It looks almost the same, but I plan to optimize it to use get_user_pages()
and to so to avoid double memory access (coping and after that checksumming
=> checksumming with inline copying).

4.  TASK_SIZE

It sounds strange: PAGE_OFFSET_USER. User-space have no offset (=0).
+#define TASK_SIZE	(PAGE_OFFSET_USER)

5. LDT_SIZE

+#define MAX_LDT_PAGES 16
Can be defined as PAGE_ALIGN(LDT_ENTRIES*LDT_ENTRY_SIZE)

6. PAGE_OFFSET

+ * Note: on PAE the kernel must never go below 32 MB, we use the
+ * first 8 entries of the 2-level boot pgd for PAE magic.
Could you please help me to understand where this magic is?
I use now 64Mb offset, but I failed to understand why it refused to boot with
16MB offset (AFAIR even w/o PAE).

7. X_TRAMP_ADDR, TODO: do boot-time code fixups not these runtime fixups.)

I did it another way:
I introduced a new section which is mapped at high addresses in all pgds, and
fit all the entry code from interrupts/exceptions/syscalls there. No
relocations/fixups/trampolines are required with such approach.

8. thread_info.h, /* offsets into the thread_info struct for assembly code
access */

I added offset.c file which is preprocessed first and which generates
 offset.h with offsets to all required struct fields (for me it is: tsk->xxx,
tsk->thread->xxx)

9. entry.S

- %cr3 vs %esp check.
I've found in Intel docs that "movl %cr3, reg" takes a long time (I didn't check it btw myself), so as for me I check esp here, instead of cr3. Your RESTORE_ALL is too long, global vars and markers can be avoided here.
- Why have you cut lcall7/lcall27? Due to call gate doesn't cli interrupts? Bad!! really bad :)
- Better to remove macro call_SYMBOL_NAME_ABS and many other hacks due to code relocation. Use vmlinux.lds to specify code offset.
- Why do you reload %esp every time? It's reload can be avoided as well as reload of cr3 if called from kernel (The problem with NMI is solvable)

10. Bug in init_entry_mappings()?

+BUG_ON(sizeof(struct desc_struct)*NR_CPUS*GDT_ENTRIES > 2*PAGE_SIZE);
AFAIK more than 1 entry per CPU is used (at least in 2.4.x).

11. machine_real_restart()

+       /*
+        * NOTE: this is a wrong 4G/4G PAE assumption. But it will triple
+        * fault the CPU (ie. reboot it) in a guaranteed way so we dont
+        * lose anything but the ability to warm-reboot. (which doesnt
+        * work on those big boxes using 4G/4G PAE anyway.)
+        */
Why do you think that warm-reboot is impossible?
BTW, as for me this path didn't reboot at all until I fixed it. Check your kernel with option "reboot=b"

12. 8MB/16MB startup mapping.

As far as I understand 16MB startup mapping is not required here, am I wrong? Memory is mapped via 4MB pages so only a few pgd/pmd pages (1+4) are required. What else could consume memory so much before we mapped it all?

13. Code style

Don't use magic constants like 8191, 8192 (THREAD_SIZE), 4096 (PAGE_SIZE) or smth like that. Looks wierd.

14.debug regs

do you catch watchpoints on kernel in do_debug()?
Hardware breakpoints are using linear addresses and can be setup'ed by user
 to kernel code... in this case %dr7 should be cleared and restored on
 user-space returning...

15. perfomance

    Have you measured perfomance with your patch?
I found that PAE-enabled kernel executes sys_gettimeofday (I've chosen it for
measuring in my tests) ~2 times slower than non-PAE kernel, and 5.2 times
slower than std kernel.
So for a simple loop with sys_gettimeofday():
PAE 4GB:	3.0 sec
4GB:		1.43 sec
original:	0.57 sec

    But in real-life tests (jbb2000, web-server stress tests) I found that
 4GB splitted kernel perfomance is <1-2% worse than w/o it. Looks very good,
 I think?

WBR, Kirill


^ permalink raw reply	[flat|nested] 23+ messages in thread
[parent not found: <E19aCeB-000ICs-00.kksx-mail-ru@f23.mail.ru.suse.lists.linux.kernel>]
* [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support
@ 2003-07-08 22:45 Ingo Molnar
  2003-07-09  1:29 ` William Lee Irwin III
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Ingo Molnar @ 2003-07-08 22:45 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm


i'm pleased to announce the first public release of the "4GB/4GB VM split"
patch, for the 2.5.74 Linux kernel:

   http://redhat.com/~mingo/4g-patches/4g-2.5.74-F8

The 4G/4G split feature is primarily intended for large-RAM x86 systems,
which want to (or have to) get more kernel/user VM, at the expense of
per-syscall TLB-flush overhead.

on x86, the total amount of virtual memory - as we all know - is limited
to 4GB. Of this total 4GB VM, userspace uses 3GB (0x00000000-0xbfffffff),
the kernel uses 1GB (0xc0000000-0xffffffff). This is VM scheme is called
the 3/1 split. This split works perfecly fine up until 1 GB of RAM - and
it works adequately well even after that, due to 'highmem', which moves
various larger caches (and objects) into the high memory area.

But as the amount of RAM increases, the 3/1 split becomes a real
bottleneck. Despite highmem being utilized by a number of large-size
caches, one of the most crutial data structures, the mem_map[], is
allocated out of the 1 GB kernel VM. With 32 GB of RAM the remaining 0.5
GB lowmem area is quite limited and only represents 1.5% of all RAM.
Various common workloads exhaust the lowmem area and create artificial
bottlenecks. With 64 GB RAM, the mem_map[] alone takes up nearly 1 GB of
RAM, making the kernel unable to boot. Relocating the mem_map[] to highmem
is very impractical, due to the deep integration of this central data
structure into the whole kernel - the VM, lowlevel arch code, drivers,
filesystems, etc.

with the 4G/4G patch, the kernel can be compiled in 4G/4G mode, in which
case there's a full, separate 4GB VM for the kernel, and there are
separate full (and per-process) 4GB VMs for user-space.

A typical /proc/PID/maps file of a process running on a 4G/4G kernel shows
a full 4GB address-space:

 00e80000-00faf000 r-xp 00000000 03:01 175909     /lib/tls/libc-2.3.2.so
 00faf000-00fb2000 rw-p 0012f000 03:01 175909     /lib/tls/libc-2.3.2.so
 [...]
 feffe000-ff000000 rwxp fffff000 00:00 0

the stack ends at 0xff000000 (4GB minus 16MB). The kernel has a 4GB lowmem
area, of which 3.1 GB is still usable even with 64 GB of RAM:

 MemTotal:     66052020 kB
 MemFree:      65958260 kB
 HighTotal:    62914556 kB
 HighFree:     62853140 kB
 LowTotal:      3137464 kB
 LowFree:       3105120 kB

the amount of lowmem is still more than 3 times the amount of lowmem
available to a 4GB system. It's more than 6 times the amount of lowmem a
32 GB system gets with the 3/1 split.

Performance impact of the 4G/4G feature:

There's a runtime cost with the 4G/4G patch: to implement separate address
spaces for the kernel and userspace VM, the entry/exit code has to switch
between the kernel pagetables and the user pagetables. This causes TLB
flushes, which are quite expensive, not so much in terms of TLB misses
(which are quite fast on Intel CPUs if they come from caches), but in
terms of the direct TLB flushing cost (%cr3 manipulation) done on
system-entry.

RAM limits:

in theory, the 4G/4G patch could provide a mem_map[] for 200 GB (!) of
physical RAM on x86, while still having 1 GB of lowmem left. So it gives
quite some legroom. While the right solution for lots of RAM is to use a
proper 64-bit system, there's alot of existing x86 hardware, and x86
servers will still be sold in the next couple of years, so we ought to
support them maximally.

The patch is orthogonal to wli's pgcl patch - both patches try to achieve
the same, with different methods. I can very well imagine workloads where
we want to have the combination of the two patches.

Implementational details:

the patch implements/touches a number of new lowlevel x86 infrastructures:

 - it moves the GDT, IDT, TSS, LDT, vsyscall page and kernel stack up into
   a high virtual memory window (trampoline) at the top 16 MB of the
   4GB address space. This 16 MB window is the only area that is shared 
   between user-space and kernel-space pagetables.

 - it splits out atomic kmaps from highmem dependencies.

 - it makes LDT(s) atomic-kmap-ed.

 - (and lots of other smaller details, like increasing the size of the
   initial mappings and fixing the PAE code to map the full 4GB of kernel
   VM.)

Whenever we do a syscall (or any other trap) from user-mode, the
high-address trampoline code starts to run, with a high-address esp0. This
code switches over to the kernel pagetable, then it switches the 'virtual
kernel stack' to the regular (real) kernel stack. On syscall-exit it does
it the other way around.

there are a few generic kernel changes as well:

 - it implements 'indirect uaccess' primitives and implements all the
   get_user/put_user/copy_to_user/... functions without relying on direct 
   access to user-space. This feature uncovered a number of bugs in the
   lowlevel x86 code already, there was still code that accessed
   user-space memory directly.

 - it splits up PAGE_OFFSET into PAGE_OFFSET_USER and PAGE_OFFSET (kernel)

 - fixes a couple of assumptions about PAGE_OFFSET being PMD_SIZE aligned.

but the generic-kernel impact of the patch is quite low.

the patch optimizes kernel<->kernel context switches and does not flush
the TLB, also, IRQ entry only cases a TLB flush if a userspace pagetable
is loaded.

the typical cost of 4G/4G on typical x86 servers is +3 usecs of syscall
latency (this is in addition to the ~1 usec null syscall latency).
Depending on the workload this can cause a typical measurable wall-clock
overhead from 0% to 30%, for typical application workloads (DB workload,
networking workload, etc.). Isolated microbenchmarks can show a bigger
slowdown as well - due to the syscall latency increase.

i'd guess that the 4G/4G patch is not worth the overhead for systems with
less than 16 GB of RAM (although exceptions might exist, for particularly
lowmem-intensive/sensitive workloads). 32 GB RAM systems run into lowmem
limitations quite frequently so the 4G/4G patch is quite recommended
there, and for 64 GB and larger systems it's a must i think.

Status, future plans:

The patch is a work-in-progress snapshot - it still has a few TODOs and
FIXMEs, but it compiles & works fine for me. Be careful with it
nevertheless - it's an experimental patch which does very intrusive
changes to the lowlevel x86 code.

There are a couple of performance enhancements ontop of this patch that
i'll integrate into this patch in the next couple of days, but i first
wanted to release the base patch.

In any case, enjoy the patch - and as usual, comments and suggestions are
more than welcome,

	Ingo


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2003-07-14 20:23 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-07-09 10:58 [announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support "Kirill Korotaev" 
2003-07-09 15:14 ` Ingo Molnar
2003-07-10 10:50   ` Kirill Korotaev
2003-07-10 10:59     ` Ingo Molnar
2003-07-10 11:35     ` Russell King
2003-07-10 12:33       ` Kirill Korotaev
2003-07-14 20:24 ` Ingo Molnar
     [not found] <E19aCeB-000ICs-00.kksx-mail-ru@f23.mail.ru.suse.lists.linux.kernel>
2003-07-09 12:44 ` Andi Kleen
     [not found] ` <200307091851.33438.dev@sw.ru>
     [not found]   ` <20030709164852.523093a3.ak@suse.de>
2003-07-09 15:17     ` Kirill Korotaev
  -- strict thread matches above, loose matches on Subject: below --
2003-07-08 22:45 Ingo Molnar
2003-07-09  1:29 ` William Lee Irwin III
2003-07-09  5:13 ` Martin J. Bligh
2003-07-09  5:19   ` William Lee Irwin III
2003-07-09  5:43     ` William Lee Irwin III
2003-07-12 23:58       ` Davide Libenzi
2003-07-13  0:11         ` William Lee Irwin III
2003-07-13  8:13           ` jw schultz
2003-07-09  6:42   ` Ingo Molnar
2003-07-09  5:16 ` Dave Hansen
2003-07-09  7:08 ` Geert Uytterhoeven
2003-07-10  1:36 ` Martin J. Bligh
2003-07-10 13:36   ` Martin J. Bligh
2003-07-13 22:05 ` Petr Vandrovec

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).