linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.4.23aa2 (bugfixes and important VM improvements for the high end)
@ 2004-02-27  1:33 Andrea Arcangeli
  2004-02-27  4:38 ` Rik van Riel
  0 siblings, 1 reply; 100+ messages in thread
From: Andrea Arcangeli @ 2004-02-27  1:33 UTC (permalink / raw)
  To: linux-kernel

this includes some relevant fix for 2.4, if I missed something important
please let me know. I'm not following 2.4 mainline anymore since it's a
lot of work and I'm trying to ship with a 2.6-aa soon.

The most interesting part of this update is a vm improvement for the
high end machines, this makes it possible to swap several gigs
efficiently while doing I/O on the very high end
>=32G machines, this is for all archs (it's unrelated to x86 or the zone
normal). See below for details.

URL:

	http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.23aa2.gz

Diff between 2.4.23aa1 and 2.4.23aa2:

Only in 2.4.23aa2: 00_blkdev-eof-1

	Allow reading last block (write not).

Only in 2.4.23aa1: 00_csum-trail-1

	Obsoleted (can't trigger anymore).

Only in 2.4.23aa2: 00_elf-interp-check-arch-1

	Make sure interpreter is of the right architecture.

Only in 2.4.23aa1: 00_extraversion-33
Only in 2.4.23aa2: 00_extraversion-34

	Rediff.

Only in 2.4.23aa2: 00_mremap-1

	Various mremap fixes.

Only in 2.4.23aa2: 00_ncpfs-1

	Limit dentry name length (from Andi).

Only in 2.4.23aa2: 00_rawio-crash-1

	Handle partial get_user_pages correctly.

Only in 2.4.23aa2: 05_vm_28-shmem-big-iron-1

	Make it possible to swap efficiently huge amounts of shm. The
	stock 2.4 VM algorithm aren't capable of dealing with huge amounts
	of shm in huge machines. This has been a showstopper in production on
	32G boxes swapping regularly (the shm in this case wasn't pure fs cache
	like in oracle so it really had to be swapped out).

	There are three basic problems that interact in non obvious manners, all
	three fixed in this patch.

	1) This one is well known and infact it's already fixed in 2.6 mainline,
	too bad the way it's fixed in 2.6 mainline makes 2.6 unusable at all
	in a workload like the one running in these 32G high end machines and
	2.6 now will have to be changed because of that. Anyways returning to
	2.4: the swap_out loop with pagetable walking doesn't scale for huge
	shared memory. This is obvious. With half a million of pages mapped
	some hundred times in the address space, when we've to swap shm, before
	we can writepage a single shm dirty page with page_count(page) == 1,
	we've first to walk and destroy the whole address space. It doesn't
	only unmap 1 single page, but it unmaps everything else too. All the
	address spaces in the machine are destroyed in order to make single
	shared shm pages freeable, and that generates a constant flood of
	expensive minor page faults. The way to fix it without any downside
	(except purerly theorical ones exposed by Andrew last year, and you
	can maliciously waste the same cpu using truncate) is a mix between
	objrmap (note: objrmap has nothing to with the rmap as in 2.6), and the
	pagetable walking code. In short during the pagetable walking I check
	if a page is freeable and if it's not yet, I execute the objrmap on it to
	make it immediatly freeable if the trylocking permits. This allow
	swap_out to make progress and to unmap one shared page at time, instead
	of unmapping all of them from all address spaces, before the first one
	becomes freeable/swappable. Some lib function in the patch is taken
	from the objrmap patch for 2.6 in the mbligh tree implemented by IBM
	(thanks to Martin and IBM for maintaining that patch uptodate for 2.6,
	that is a must-have starting point for the 2.6 VM too).  The original
	idea of using objrmap for the vm unmapping procedure is from David
	Miller (objrmap itself has always existed in every linux kernel out
	there, to provide mmap coherency through vmtruncate, and now
	it is being used from the 2.4-aa swap_out pagetable walking too).
	To give an idea the top profiled function during swapping 4G of shm on
	the 32G box now is try_to_unmap_shared_vma.

	2) the writepage callback of the shared memory converts a shm page
	into a swapcache page, but it doesn't start the I/O on the swapcache
	immediatly, this is a nosense and it's easy to fix by simply calling
	the swapcache writepage within the shm_writepage before returning
	(then of course we must not unlock the page anymore before returning,
	the I/O completion will unlock it).

	Not starting the I/O it means we'll start the I/O after another million
	pages and then after starting the I/O we'll notice the finally free
	page after another million page pass.

	There was a super swap storm was happening because of these issues,
	especially the interaction between point 1 and point 2 was detrimental
	(with 1 and 2 I mean the phase from unmapping the page to starting the
	I/O, the last phase from starting the I/O to effectivly noticing a
	freeable clean swapcache page is addressed in point 3 below).

	In short if you had to swap 4k of shm, the kernel would immadiatly move
	into swapcache more than 4G (G is not a typo ;) of shm, and then it was
	starting the I/O to swap all those 4G out because it was all dirty and
	freeable cache (from the shmfs fs) queued contigously in the lru. So
	after the first 4k of shm to be swapped out, every further allocation
	would generate a swapout for a very very long time. Reading a file
	would involve swapping the amount of data read as well, not to tell if
	you were reading into empty vmas, that would generate two times more
	swapping than the I/O itself. So machines that were swapping slightly
	(around 4G on a 32G box) were entering a swap storm mode with very huge
	stalls lasting several dozen minutes (you can imagine if every memory
	allocation was executing I/O before returning).

	Here a trace of that scenario as soon as the first 35M of address space
	become freeable and moved into swapcache, this is the point
	where basically all shm is already freeable but dirty, you see,
	machine hangs with 100% system load (8 cpus doing nothing but
	keeping throwing away address space with the background
	swap_out, and calling shm_writepage until all shm is converted
	to swapcache).

	procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
	r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
	33  1  35156 507544  80232 13523880    0  116     0   772  223 53603 24 76  0  0
	29  3  78944 505116  72184 13342024    0  124     0  2092  153 52579 27 73  0  0
	29  0 114916 505612  70576 13174984    0    0     4 10556  415 26109 31 69  0  0
	29  0 147924 506796  70588 13150352    0    0    12  7360  717  3944 32 68  0  0
	25  1 108764 507192  68968 12942260    0    0     8 10989 1111  4886 19 81  0  0
	27  0 266504 505104  64964 12717276    0    0     3  3122  223  2598 56 44  0  0
	24  0 418204 505184  29956 12550032    0    4     1  3426  224  5242 56 44  0  0
	16  0 613844 505084  29916 12361204    0    0     4    80  135  4522 23 77  0  0
	20  2 781416 505048  29916 12197904    0    0     0     0  111 10961 13 87  0  0
	22 0 1521712 505020  29880 11482256    0    0    40   839 1208  4182 13 87  0  0
	24 0 1629888 505108  29852 11374864    0    0     0    24  377   537 13 87  0  0
	27 0 1743756 505052  29852 11261732    0    0     0     0  364   598 11 89  0  0
	25 0 1870012 505136  29900 11135420    0    0     4   254  253   491  7 93  0  0
	24 0 2024436 505160  29900 10981496    0    0     0    24  125   484 10 90  0  0
	25 0 2287968 505280  29840 10718172    0    0     0     0  116  2603 11 89  0  0
	23 0 2436032 505316  29840 10570856    0    0     0     0  122   418  3 97  0  0
	25 0 2536336 505380  29840 10470516    0    0     0    24  115   389  2 98  0  0
	26 0 2691544 505564  29792 10316112    0    0     0     0  125   443  8 92  0  0
	27 0 2847872 505508  29836 10159752    0    0     0   226  138   482  7 93  0  0
	24 0 2985480 505380  29836 10022836    0    0     0   524  146   491  5 95  0  0
	24 0 3123600 505048  29792 9885668    0    0     0   149  112   397  2 98  0  0
	27 0 3274800 505116  29792 9734396    0    0     0     0  112   377  5 95  0  0
	28 0 3415516 505156  29792 9593624    0    0     0    24  109  1822  8 92  0  0
	26 0 3551020 505272  29700 9458072    0    0     0     0  113   453  7 93  0  0
	26 0 3682576 505052  29744 9326488    0    0     0   462  140   461  5 95  0  0
	26 0 3807984 505016  29744 9200960    0    0     0    24  125   458  3 97  0  0
	27 0 3924152 505020  29700 9085040    0    0     0     0  121   514  3 97  0  0
	28 0 4109196 505072  29700 8900812    0    0     0     0  120   417  6 94  0  0
	23 2 4451156 505284  29596 8558780    0    0     4    24  122   491  9 91  0  0
	23 0 4648772 505044  29600 8361080    0    0     4     0  124   544  5 95  0  0
	26 2 4821844 505068  29572 8187904    0    0     0   314  132   492  3 97  0  0

	After all address space has been destroyed multiple times
	and all shm has been converted into swapcache, the kernel can
	finally swapout the first 4kbyte of shm:

	6 23 4560356 506160  22188 8440104    0 64888   16 67280 3006 534109 6 88  5  0

	The rest is infinite swapping and machine total hung, since those 4.8G
	of swapcache are now freeable and dirty, and the kernel will not
	notice the freeable and clean swapcache generated by this 64M
	swapout, since it's being queued at the opposite side of the lru
	and we'll find another million pages to swap (i.e. writepage)
	before noticing that. So the kernel will effectively start
	generating the first 4k of free memory only after those half
	million pages have been swapped out.

	So point 1 and fixes the kernel so that we unmap only 64M of shm, and we
	swapout then immediatly, but we can still run into huge problems in
	having half million pages of shm queued in a row in the lru.

	3) After fixing 1 and 2 a file copy was still not running (yeah it was running
	but some 10 or 100 times slower than it had to, preventing any backup
	activity to work etc..) after the machine was 4G into swap. But at least
	the machine was swapping just fine, so the database and the
	application itself was doing pretty good (no swap storms anymore
	after fixing 1 and 2), it's the other apps allocating further
	cache that didn't work yet.

	So I tried reading an huge multi gigs file, I left the cp
	running for a dozen minutes at a terribly slow rate (despite the
	data was on the SAN), and I noticed this cp was pushing the
	database into swap another few gigs more, precisely as much as
	the size of the file. During the read the swapout greatly
	exceeded the reads from the disk (so-bi from vmstat). In short
	the cache allocated on the huge file, was beahaving like a
	memory leak. But it was not a memory leak, it was a very fine
	clean and freeable cache reacheable from the lru, too bad we
	still had an hundred thousand shm pages being unmapped (by the
	background obrjmap passes) to walk and to shm->writepage and
	swapcache->writepage before noticing the immediatly freeable
	gigs of clean cache of the file.

	In short the system was doing fine, but due the ordering of the lru, it was
	preferring to swap gigs and gigs of shm pushing the running db into
	swap, instead of shrinking the several unused (and absolutely
	not aged) gigs of clean cache. This wasn't too good also for
	efficient swapping because it was taking a long time before the
	vm could notice the clean swapcache, after it started the I/O on
	it.

	It was pretty clear after that, that we've to prioritize and to
	prefer discarding memory that is zerocost to collect, than to do
	extremely expensive things to release free memory instead. This
	change isn't absolutely fair, but the current behaviour of the
	vm is an order of magnitude worse in the high end. So the fix I
	implemented is to run a inactive_list/vm_cache_scan_ratio pass
	on the clean immediatly freeable cache in the inactive list
	before going into the ->writepage business and whala, copies
	were running back at 80M/sec like if no 4G were being swapped
	out at the same time. Since swap, data and binaries are all on
	different places (data on the SAN, swap on a local IDE, binaries
	on another local IDE), swapping while copying didn't hurt too
	much either.

	A better fix would be to have an anchor in the lru (can be a per-lru
	page_t with a PG_anchor set) and to avoid the clean-cache search to
	alter the point where we keep swapping with writepage, but it
	shouldn't matter that much and 2.4 being obsolete isn't very
	worthwhile to make it even better.

	On the low end (<8G) these effects that hangs machines for hours and makes
	any I/O impossible weren't visible because the lru is more mixed and there
	are never too many shm pages in a row. The less ram the less this effect
	is visible. However even on small machines now swapping shm should be
	more efficient now. The downside is that now the pure clean
	cache is penalized against dirty cache but I believe it worth to
	pay for this downside to survive and handle the very high end
	workloads.

	I didn't find very attractive doing these changes in 2.4 at this time,
	but 2.6 has no way to run in those workloads, and no 2.6 kernel is
	certified for some products yet etc.. This code is now running in
	production and people seems happy. I think this is the first
	linux ever with a chance of handling properly this high end
	workload on high end hardware, so if you had problems give this
	a spin. 2.6 obviously will lockup immediatly in this workloads
	due the integration of rmap in 2.6 (already verified just to be
	sure my math was correct) and 2.4 mainline as well will run oom
	(but no lockup in 2.4 mainline since it can handle zone normal
	failures gracefully unlike 2.6) since it lacks pte-highmem.

	If this slowdown the VM on the low end (i.e. 32M machine,
	probably the smaller box where I tested it is my laptop with 256M ;)
	try increasing the vm_cache_scan_ratio sysctl and for any
	regression please notify me. Thank you!

Only in 2.4.23aa1: 21_pte-highmem-mremap-smp-scale-1
Only in 2.4.23aa2: 21_pte-highmem-mremap-smp-scale-2

	Fix race condition if src is cleared under us while
	we allocate the pagetable (from 2.6 CVS, from Andrew).

Only in 2.4.23aa2: 30_20-nfs-directio-lfs-1

	Read right page with directio.

Only in 2.4.23aa2: 9999900_overflow-buffers-1

	paranoid check (dirty buffers should always be <= NORMAL_ZONE,
	and the other counters for clean and locked are never read,
	so this is a noop but it's safer).

Only in 2.4.23aa2: 9999900_scsi-deadlock-fix-1

	Avoid locking inversion.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-27  1:33 2.4.23aa2 (bugfixes and important VM improvements for the high end) Andrea Arcangeli
@ 2004-02-27  4:38 ` Rik van Riel
  2004-02-27 17:32   ` Andrea Arcangeli
  0 siblings, 1 reply; 100+ messages in thread
From: Rik van Riel @ 2004-02-27  4:38 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

On Fri, 27 Feb 2004, Andrea Arcangeli wrote:

> 	becomes freeable/swappable. Some lib function in the patch is taken
> 	from the objrmap patch for 2.6 in the mbligh tree implemented by IBM
> 	(thanks to Martin and IBM for maintaining that patch uptodate for 2.6,
> 	that is a must-have starting point for the 2.6 VM too).  The original
> 	idea of using objrmap for the vm unmapping procedure is from David
> 	Miller (objrmap itself has always existed in every linux kernel out

Good to hear that you're finally convinced that some form
of reverse mapping is needed.

I agree with you that object based rmap may well be better
for 2.6, if you want to look into that I wouldn't mind at
all.  Especially if we can keep akpm's and Nick's nice VM
balancing intact ...

> 	The rest is infinite swapping and machine total hung, since those 4.8G
> 	of swapcache are now freeable and dirty, and the kernel will not
> 	notice the freeable and clean swapcache generated by this 64M
> 	swapout, since it's being queued at the opposite side of the lru

An obvious solution for this is the O(1) VM stuff that Arjan
wrote and I integrated into rmap 15.  Should be worth looking
into this for the 2.6 kernel ...

Basically it keeps the just-written pages near the end of the
LRU, so they're easily found and freed, before the kernel even
starts thinking about submitting the other gigabytes of dirty
data for writeout.

> 	efficient swapping because it was taking a long time before the
> 	vm could notice the clean swapcache, after it started the I/O on
> 	it.

... Arjan's O(1) VM stuff ;)

> 	It was pretty clear after that, that we've to prioritize and to
> 	prefer discarding memory that is zerocost to collect, than to do
> 	extremely expensive things to release free memory instead.

I'm not convinced.  If we need to free up 10MB of memory, we just
shouldn't do much more than 10MB of IO.  Doing just that should be
cheap enough, after all.

The problem is when you do two orders of magnitude more writes than
the amount of memory you need to free.  Trying to do zero IO probably
isn't quite needed ...

> 	vm is an order of magnitude worse in the high end. So the fix I
> 	implemented is to run a inactive_list/vm_cache_scan_ratio pass
> 	on the clean immediatly freeable cache in the inactive list

Should work ok for a while, until you completely run out of
clean pages and then you might run into a wall ... unless you
implement smarter cleaning & freeing like Arjan's stuff does.

Then again, your stuff will also find pages the moment they're
cleaned, just at the cost of a (little?) bit more CPU time.
Shouldn't be too critical, unless you've got more than maybe
a hundred GB of memory, which should be a year off.

> 	A better fix would be to have an anchor in the lru (can be a per-lru
> 	page_t with a PG_anchor set) and to avoid the clean-cache search to
> 	alter the point where we keep swapping with writepage, but it
> 	shouldn't matter that much and 2.4 being obsolete isn't very
> 	worthwhile to make it even better.

Hey, that's Arjan's stuff ;)   Want to help get that into 2.6 ? ;)

cheers,

Rik
-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-27  4:38 ` Rik van Riel
@ 2004-02-27 17:32   ` Andrea Arcangeli
  2004-02-27 19:08     ` Rik van Riel
  0 siblings, 1 reply; 100+ messages in thread
From: Andrea Arcangeli @ 2004-02-27 17:32 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel

On Thu, Feb 26, 2004 at 11:38:20PM -0500, Rik van Riel wrote:
> On Fri, 27 Feb 2004, Andrea Arcangeli wrote:
> 
> > 	becomes freeable/swappable. Some lib function in the patch is taken
> > 	from the objrmap patch for 2.6 in the mbligh tree implemented by IBM
> > 	(thanks to Martin and IBM for maintaining that patch uptodate for 2.6,
> > 	that is a must-have starting point for the 2.6 VM too).  The original
> > 	idea of using objrmap for the vm unmapping procedure is from David
> > 	Miller (objrmap itself has always existed in every linux kernel out
> 
> Good to hear that you're finally convinced that some form
> of reverse mapping is needed.

I'm convinced about that since April 2003 (and it wasn't 1st april joke
when I posted this ;)

http://groups.google.it/groups?q=g:thl1757664681d&dq=&hl=it&lr=&ie=UTF-8&selm=20030405025008%2463d0%40gated-at.bofh.it&rnum=19

quote "Indeed. objrmap is the only way to avoid the big rmap waste.
Infact I'm not even convinced about the hybrid approch, rmap should be
avoided even for the anon pages. And the swap cpu doesn't matter, as far
as we can reach pagteables in linear time that's fine, doesn't matter
how many fixed cycles it takes. Only the complexity factor matters, and
objrmap takes care of it just fine."

It's not like I wakeup yesterday with the idea of changing it, it's just
that nobody listened to my argument for almost one year so now we're
stuck with a 2.6 kernel that has no way to run in the high end (i.e.
>200 tasks with 2.7G of shm mapped each on a 32G box, with 2.4-aa I can
reach ~6k tasks with 2.7G mapped each).

Then Andrew pointed out that there are complexity issues that objrmap
can't handle but I'm not concerned about the complexity issue of objrmap
since no real app will run into it and it's mostly a red herring since
you should be able to trigger the same complexity issues with truncate
already.

the thing I'm against for years and that I'm still against is "rmap" not
objrmap. "rmap" is what prevents 2.6 from being able to handle high end
workloads. Not even 4:4 can hide the rmap overhead, even ignoring the
slowdown provided by 4:4, you still lockup at 4 times more address space
mapped, and 4 times more address space than what 2.6 can do now, is
still a tiny fraction of what my current 2.4-aa can map with objrmap.

rmap has nothing to do with objrmap. you know objrmap is available in
every linux kernel out there (2.0 probably had objrmap too), this is why
it's zero cost for linux to use objrmap, we just start using it for the
paging mechanism for the first time, instead of building a new redundant
extremely zone-normal costly infrastructure (i.e. rmap). I also don't
buy the 64bit argument, since the waste is there, it's just not a
showstopper blocker in 64bit, but the fact it's a blocker for 32bit
archs is a good thing so we're forced to optimize 64bit archs too.

as for remap_file_pages I've two ways:

1) implicit mlock and allow it only to root (i.e. cap lock capability)
   or under your sysctl that enables mlock for everybody, this is the
   simple lazy way, and I think this is what you're doing in 2.4 too
2) use a pagetable walk on every vma marked VM_NONLINEAR queued into the
   address space, we need that pagetable walk anyways for doing truncate
   perfect

To avoid altering the API probably the first remap_file_pages should set
the VM_NONLINEAR on the vma (maybe it's already doing that, I didn't
check).

so I think 2 is the best, sure one can argue it will waste cpu, but
this is just a correctness thing, it doesn't need to be fast at all,
I prefer to optimize for the fast path.

> I agree with you that object based rmap may well be better
> for 2.6, if you want to look into that I wouldn't mind at
> all.  Especially if we can keep akpm's and Nick's nice VM
> balancing intact ...

Glad we like it both.

So my current plan is to do objrmap for all file mappings first (this is
a blocker showstopper issue or 2.6 simply will lockup immediatly in the
high end, already verified just to be sure I wasn't missing something in
the code), then convert remap_file_pages to do the pagetable walk
instead of relying on rmap, then I can go further and add a dummy inode
for anonymous mappings too during COW like DaveM did originally. Only
then I can remove rmap enterely. This last step is somewhat lower prio.

> > 	The rest is infinite swapping and machine total hung, since those 4.8G
> > 	of swapcache are now freeable and dirty, and the kernel will not
> > 	notice the freeable and clean swapcache generated by this 64M
> > 	swapout, since it's being queued at the opposite side of the lru
> 
> An obvious solution for this is the O(1) VM stuff that Arjan
> wrote and I integrated into rmap 15.  Should be worth looking
> into this for the 2.6 kernel ...
> 
> Basically it keeps the just-written pages near the end of the
> LRU, so they're easily found and freed, before the kernel even
> starts thinking about submitting the other gigabytes of dirty
> data for writeout.
> 
> > 	efficient swapping because it was taking a long time before the
> > 	vm could notice the clean swapcache, after it started the I/O on
> > 	it.
> 
> ... Arjan's O(1) VM stuff ;)
> 
> > 	It was pretty clear after that, that we've to prioritize and to
> > 	prefer discarding memory that is zerocost to collect, than to do
> > 	extremely expensive things to release free memory instead.
> 
> I'm not convinced.  If we need to free up 10MB of memory, we just

if you're not convinced I assume Arjan's O(1) VM stuff is not doing
that, which means Arjan's O(1) VM stuff has little to do with the point
3.

point 1 is objrmap, you've rmap instead.

point 2 is "start I/O from shm_writepage" and you definitely want that
too,  O(1) VM can workaround for the lack of "start I/O from
shm_writepage" but it's a wrong workaround for that, I/O must be started
immediatly from shm_writepage, no point to delay it, when writepage is
called I/O must be started, waiting another small pass of the o1 vm is
wasted cpu. As I wrote this was a trivial two liner that you can easily
merge and it's orthogonal with all other issues. I see o1 vm may be
hiding this stupidity of shm_writepage for you but you will get a
benefit from the proper fix.

the only similarity between my stuff and Arjan's O(1) VM stuff is in
point 3, but only for the "searching of the clean swapcache". Since
you're not convinced about the above it means you don't get right the
"clean cache is a memleak without -aa latest stuff".

My stuff takes care of both issues at the same time, clearly Arjan's
O(1) VM stuff can be stacked on top of my stuff if I wanted to, to reach
more quickly the swapped out swapcache now clean. This is something I don't
want to do because I think that such O(1) VM is not O(1) at all, infact
it may end up wasting a lot more cpu depending on the allocation
frequency and the disk speed, since you've no guarantee that when you
run into the page again you will find it unlocked, so it may still be
under I/O, so it may actually waste cpu instead of saving it.

Overall Arjan's O(1) VM stuff can't help in the workload I was dealing
with, since it's all about freeing clean cache first and it's not doing
that, freeing clean swapcache is important too, but it's not as
important and avoiding I/O if we can.

> shouldn't do much more than 10MB of IO.  Doing just that should be
> cheap enough, after all.
> 
> The problem is when you do two orders of magnitude more writes than
> the amount of memory you need to free.  Trying to do zero IO probably
> isn't quite needed ...

in small machines the current 2.4 stock algo works just fine too, it's
only when the lru has the million pages queued that without my new vm
algo you'll do million swapouts before freeing the memleak^Wcache.

all I care about is to avoid the I/O if I can, and that's the only thing
my patch is doing. This is about life or death of the machine, it's not
a fast/slow issue ;).

> > 	vm is an order of magnitude worse in the high end. So the fix I
> > 	implemented is to run a inactive_list/vm_cache_scan_ratio pass
> > 	on the clean immediatly freeable cache in the inactive list
> 
> Should work ok for a while, until you completely run out of
> clean pages and then you might run into a wall ... unless you
> implement smarter cleaning & freeing like Arjan's stuff does.

I understand the smart cleaning & freeing, it's not obvious as said
above for the cpu waste that it risks to generate, and the "put it near
the end of the lru" one has to define "near" which is a mess. I'm not
saying o1 vm will necessairly waste cpu, I'm just saying it's non
obvious and it's in no way similar or equivalent to my code.

> Then again, your stuff will also find pages the moment they're
> cleaned, just at the cost of a (little?) bit more CPU time.

exactly, that's an important effect of my patch and that's the only
thing that o1 vm is taking care of, I don't think it's enough since the
gigs of cache would still be like a memleak without my code.

> Shouldn't be too critical, unless you've got more than maybe
> a hundred GB of memory, which should be a year off.

I think these effects starts to be visible over 8G, the worst thing is
that you can have 4G in a row of swapcache, in smaller systems the
lru tends to be more intermixed.

> > 	A better fix would be to have an anchor in the lru (can be a per-lru
> > 	page_t with a PG_anchor set) and to avoid the clean-cache search to
> > 	alter the point where we keep swapping with writepage, but it
> > 	shouldn't matter that much and 2.4 being obsolete isn't very
> > 	worthwhile to make it even better.
> 
> Hey, that's Arjan's stuff ;)   Want to help get that into 2.6 ? ;)

I think you mean he's using an anchor in the lru too in the same way I
proposed here, but I doubt he's using it nearly as I would, there seems
to be a fundamental difference in the two algorithms, with mine partly
covering the work done by his, and not the other way around.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-27 17:32   ` Andrea Arcangeli
@ 2004-02-27 19:08     ` Rik van Riel
  2004-02-27 20:29       ` Andrew Morton
                         ` (2 more replies)
  0 siblings, 3 replies; 100+ messages in thread
From: Rik van Riel @ 2004-02-27 19:08 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, Andrew Morton

First, let me start with one simple request.  Whatever you do,
please send changes upstream in small, manageable chunks so we
can merge your improvements without destabilising the kernel.

We should avoid the kind of disaster we had around 2.4.10...

On Fri, 27 Feb 2004, Andrea Arcangeli wrote:

> Then Andrew pointed out that there are complexity issues that objrmap
> can't handle but I'm not concerned about the complexity issue of objrmap
> since no real app will run into it

We've heard the "no real app runs into it" argument before,
about various other subjects.  I remember using it myself,
too, and every single time I used the "no real apps run into
it" argument I turned out to be wrong in the end.

> So my current plan is to do objrmap for all file mappings first

If this can be integrated cleanly without too many bad corner
cases, sure ...

> then convert remap_file_pages to do the pagetable walk instead of
> relying on rmap,

I'm not convinced, though that could also be because I'm not
sure exactly what you're planning.  I'll start arguing for
or against your changes here once I know exactly what they'll 
look like ;)

> then I can go further and add a dummy inode for anonymous mappings too
> during COW like DaveM did originally. Only then I can remove rmap
> enterely. This last step is somewhat lower prio.

Moving to a full objrmap from the current pte-rmap could well
be a good thing from a code cleanliness perspective.

I'm not particularly attached to rmap.c and won't be opposed
to a replacement, provided that the replacement is also more
or less modular with the VM so plugging in an even more
improved version in the future wil be easy ;)

> in small machines the current 2.4 stock algo works just fine too, it's
> only when the lru has the million pages queued that without my new vm
> algo you'll do million swapouts before freeing the memleak^Wcache.

Same for Arjan's O(1) VM.  For machines in the single and low
double digit number of gigabytes of memory either would work
similarly well ...

> > Then again, your stuff will also find pages the moment they're
> > cleaned, just at the cost of a (little?) bit more CPU time.
> 
> exactly, that's an important effect of my patch and that's the only
> thing that o1 vm is taking care of, I don't think it's enough since the
> gigs of cache would still be like a memleak without my code.

... however, if you have a hundred gigabyte of memory, or
even more, then you cannot afford to search the inactive
list for clean pages on swapout. It will end up using too
much CPU time.

The FreeBSD people found this out the hard way, even on
smaller systems...

> > Shouldn't be too critical, unless you've got more than maybe
> > a hundred GB of memory, which should be a year off.
> 
> I think these effects starts to be visible over 8G, the worst thing is
> that you can have 4G in a row of swapcache, in smaller systems the
> lru tends to be more intermixed.

I've even seen the problem on small systems, where I used a
"smart" algorithm that freed the clean pages first and only
cleaned the dirty pages later.

On my 128 MB desktop system everything was smooth, until
the point where the cache was gone and the system suddenly
faced an inactive list entirely filled with dirty pages.

Because of this, we should do some (limited) pre-cleaning
of inactive pages. The key word here is "limited" ;)

> I think you mean he's using an anchor in the lru too in the same way I
> proposed here, but I doubt he's using it nearly as I would, there seems
> to be a fundamental difference in the two algorithms, with mine partly
> covering the work done by his, and not the other way around.

An anchor in the lru list is definately needed. Some
companies want to run Linux on systems with 256 GB or
more memory.  In those systems the amount of CPU time
used to search the inactive list will become a problem,
unless we use a smartly placed anchor.

Note that I wouldn't want to use the current O(1) VM
code on such a system, because the placement of the
anchor isn't quite smart enough ...

Lets try combining your ideas and Arjan's ideas into
something that fixes all these problems.

kind regards,

Rik
-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-27 19:08     ` Rik van Riel
@ 2004-02-27 20:29       ` Andrew Morton
  2004-02-27 20:49         ` Rik van Riel
                           ` (3 more replies)
  2004-02-27 20:31       ` Andrea Arcangeli
  2004-02-29  6:34       ` Mike Fedyk
  2 siblings, 4 replies; 100+ messages in thread
From: Andrew Morton @ 2004-02-27 20:29 UTC (permalink / raw)
  To: Rik van Riel; +Cc: andrea, linux-kernel

Rik van Riel <riel@redhat.com> wrote:
>
> First, let me start with one simple request.  Whatever you do,
> please send changes upstream in small, manageable chunks so we
> can merge your improvements without destabilising the kernel.
> 
> We should avoid the kind of disaster we had around 2.4.10...

We need to understand that right now, 2.6.x is 2.7-pre.  Once 2.7 forks off
we are more at liberty to merge nasty highmem hacks which will die when 2.6
is end-of-lined.

I plan to merge the 4g split immediately after 2.7 forks.  I wouldn't be
averse to objrmap for file-backed mappings either - I agree that the search
problems which were demonstrated are unlikely to bite in real life.

But first someone would need to demonstrate that pte_chains+4g/4g are
for some reason unacceptable for some real-world setup.

Apart from the search problem, my main gripe with objrmap is that it
creates different handling for file-backed and anonymous memory.  And the
code which extends it to anonymous memory is complex and large.  One ends
up needing to seriously ask oneself what is being gained from it all.

> 
> We've heard the "no real app runs into it" argument before,
> about various other subjects.  I remember using it myself,
> too, and every single time I used the "no real apps run into
> it" argument I turned out to be wrong in the end.
> 

heh.

> I'm not particularly attached to rmap.c and won't be opposed
> to a replacement, provided that the replacement is also more
> or less modular with the VM so plugging in an even more
> improved version in the future wil be easy ;)

Sure, let's see what it looks like.  Even if it is nearly two years late.

Oh, and can we please have testcases?  It's all very well to assert "it
sucks doing X and I fixed it" but it's a lot more useful if one can
distrubute testcases as well so others can evaluate the fix and can explore
alternative solutions.

Andrea, this shmem problem is a case in point, please.

> > in small machines the current 2.4 stock algo works just fine too, it's
> > only when the lru has the million pages queued that without my new vm
> > algo you'll do million swapouts before freeing the memleak^Wcache.
> 
> Same for Arjan's O(1) VM.  For machines in the single and low
> double digit number of gigabytes of memory either would work
> similarly well ...

Case in point.  We went round the O(1) page reclaim loop a year ago and I
was never able to obtain a testcase which demonstrated the problem on 2.4,
let alone on 2.6.

I had previously found some workloads in which the 2.4 VM collapsed for
similar reasons and those were fixed with the rotate_reclaimable_page()
logic.  Without testcases we will not be able to verify that anything else
needs doing.

> ... however, if you have a hundred gigabyte of memory, or
> even more, then you cannot afford to search the inactive
> list for clean pages on swapout. It will end up using too
> much CPU time.
> 
> The FreeBSD people found this out the hard way, even on
> smaller systems...

Did they have a testcase?

> On my 128 MB desktop system everything was smooth, until
> the point where the cache was gone and the system suddenly
> faced an inactive list entirely filled with dirty pages.
> 
> Because of this, we should do some (limited) pre-cleaning
> of inactive pages. The key word here is "limited" ;)

Current 2.6 will write out nr_inactive>>DEF_PRIORITY pages, will then
throttle behind I/O and then will start reclaiming clean pages from the
tail of the LRU which were moved there at interrupt time.

> An anchor in the lru list is definately needed.

Maybe not.  Testcase, please ;)

> Lets try combining your ideas and Arjan's ideas into
> something that fixes all these problems.

Did I mention testcases?


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-27 19:08     ` Rik van Riel
  2004-02-27 20:29       ` Andrew Morton
@ 2004-02-27 20:31       ` Andrea Arcangeli
  2004-02-29  6:34       ` Mike Fedyk
  2 siblings, 0 replies; 100+ messages in thread
From: Andrea Arcangeli @ 2004-02-27 20:31 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, Andrew Morton

On Fri, Feb 27, 2004 at 02:08:28PM -0500, Rik van Riel wrote:
> First, let me start with one simple request.  Whatever you do,
> please send changes upstream in small, manageable chunks so we
> can merge your improvements without destabilising the kernel.
> 
> We should avoid the kind of disaster we had around 2.4.10...

2.4.10 was great success, the opposite of a disaster.

> On Fri, 27 Feb 2004, Andrea Arcangeli wrote:
> 
> > Then Andrew pointed out that there are complexity issues that objrmap
> > can't handle but I'm not concerned about the complexity issue of objrmap
> > since no real app will run into it
> 
> We've heard the "no real app runs into it" argument before,
> about various other subjects.  I remember using it myself,
> too, and every single time I used the "no real apps run into
> it" argument I turned out to be wrong in the end.

If this is an issue, than truncate() may be running into it already. So
if you're worried about the vm usage you should worrk about truncate
usage first.

> > then I can go further and add a dummy inode for anonymous mappings too
> > during COW like DaveM did originally. Only then I can remove rmap
> > enterely. This last step is somewhat lower prio.
> 
> Moving to a full objrmap from the current pte-rmap could well
> be a good thing from a code cleanliness perspective.
> 
> I'm not particularly attached to rmap.c and won't be opposed
> to a replacement, provided that the replacement is also more
> or less modular with the VM so plugging in an even more
> improved version in the future wil be easy ;)

objrmap itself should be self contained, the patch from Martin's tree is
quite small too.

> > in small machines the current 2.4 stock algo works just fine too, it's
> > only when the lru has the million pages queued that without my new vm
> > algo you'll do million swapouts before freeing the memleak^Wcache.
> 
> Same for Arjan's O(1) VM.  For machines in the single and low
> double digit number of gigabytes of memory either would work
> similarly well ...

I don't think they would work equally well. btw the vmstat I posted was
from a 16G machine btw, where the copy was stuck because of inability to
find 2G of clean cache.

> > > Then again, your stuff will also find pages the moment they're
> > > cleaned, just at the cost of a (little?) bit more CPU time.
> > 
> > exactly, that's an important effect of my patch and that's the only
> > thing that o1 vm is taking care of, I don't think it's enough since the
> > gigs of cache would still be like a memleak without my code.
> 
> ... however, if you have a hundred gigabyte of memory, or
> even more, then you cannot afford to search the inactive
> list for clean pages on swapout. It will end up using too
> much CPU time.

I'm using various techniques so that it doesn't scan million pages in
one go, and obviously I must start swapping before the very last clean
cache is recycled. What I outlined is the concept. That is to
"prioritize on clean cache", "prioritize" doesn't mean "all and only
clean cache first".

but it's true I throw cpu at the work, but there's no other way without
more invasive changes, and the cpu load is not significant during
swapping anyways, so it's not urgent to improve the vm further in 2.4.
Just using an anchor to separate the clean scan from the dirty scan
would improve things, but that as well is low priority.

> The FreeBSD people found this out the hard way, even on
> smaller systems...

the last thing I would do is to take examples from freebsd or other
unix (not only for legal reasons).

> > > Shouldn't be too critical, unless you've got more than maybe
> > > a hundred GB of memory, which should be a year off.
> > 
> > I think these effects starts to be visible over 8G, the worst thing is
> > that you can have 4G in a row of swapcache, in smaller systems the
> > lru tends to be more intermixed.
> 
> I've even seen the problem on small systems, where I used a
> "smart" algorithm that freed the clean pages first and only
> cleaned the dirty pages later.
> 
> On my 128 MB desktop system everything was smooth, until
> the point where the cache was gone and the system suddenly
> faced an inactive list entirely filled with dirty pages.
> 
> Because of this, we should do some (limited) pre-cleaning
> of inactive pages. The key word here is "limited" ;)

correct, this is why it's not a full scan, I provide a sysctl to tune
that and it's called vm_cache_scan_ratio as I wrote in the original
email. If it doesn't swap enough increasing the sysctl will swap more.

> > I think you mean he's using an anchor in the lru too in the same way I
> > proposed here, but I doubt he's using it nearly as I would, there seems
> > to be a fundamental difference in the two algorithms, with mine partly
> > covering the work done by his, and not the other way around.
> 
> An anchor in the lru list is definately needed. Some
> companies want to run Linux on systems with 256 GB or
> more memory.  In those systems the amount of CPU time
> used to search the inactive list will become a problem,
> unless we use a smartly placed anchor.
> 
> Note that I wouldn't want to use the current O(1) VM
> code on such a system, because the placement of the
> anchor isn't quite smart enough ...

this was my point about the o1 vm, making the placement of the anchor is
very non obvious, the idea itself sounds fine.

> 
> Lets try combining your ideas and Arjan's ideas into
> something that fixes all these problems.

So you here agree they're different things. Not sure if my idea is the
best for the long run either, but certainly it's needed in 2.4 to handle
such load and an equivalent solution (o1 vm is not enough IMO) will be
needed in 2.6 as well. The basic idea behind my patch may be the right
one for the long term though. However this is all low prio for 2.6 at
the moment and I didn't even list this part in the roadmap, because
first I need to avoid the rmap lockup before the machine start swapping
then I can think about this. (as far as I keep swapoff -a this is not an
issue)

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-27 20:29       ` Andrew Morton
@ 2004-02-27 20:49         ` Rik van Riel
  2004-02-27 20:55           ` Andrew Morton
                             ` (2 more replies)
  2004-02-27 21:15         ` Andrea Arcangeli
                           ` (2 subsequent siblings)
  3 siblings, 3 replies; 100+ messages in thread
From: Rik van Riel @ 2004-02-27 20:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: andrea, linux-kernel

On Fri, 27 Feb 2004, Andrew Morton wrote:

> But first someone would need to demonstrate that pte_chains+4g/4g are
> for some reason unacceptable for some real-world setup.

Agreed.  The current 2.6 VM is well tuned already so
we should be extremely cautious not to upset it.

> > Same for Arjan's O(1) VM.  For machines in the single and low
> > double digit number of gigabytes of memory either would work
> > similarly well ...
> 
> I had previously found some workloads in which the 2.4 VM collapsed for
> similar reasons and those were fixed with the rotate_reclaimable_page()
> logic.  Without testcases we will not be able to verify that anything else
> needs doing.

Duh, I forgot all about the rotate_reclaimable_page() stuff.
That may well fix all problems 2.6 would have otherwise had
in this area.

I really hope we won't need anything like the O(1) VM stuff
in 2.6, since that would leave me more time to work on other
cool stuff (like resource management ;)).

> > Because of this, we should do some (limited) pre-cleaning
> > of inactive pages. The key word here is "limited" ;)
> 
> Current 2.6 will write out nr_inactive>>DEF_PRIORITY pages,

That may be a bit much on extremely huge systems, but that should
require no more than a little tweaking to fix.  Certainly no code
changes should be needed ...

> will then throttle behind I/O and then will start reclaiming clean pages
> from the tail of the LRU which were moved there at interrupt time.

That may well be much better than either the O(1) VM stuff or the
stuff Andrea proposed...

Forget about me proposing the O(1) VM stuff ;)

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-27 20:49         ` Rik van Riel
@ 2004-02-27 20:55           ` Andrew Morton
  2004-02-27 21:28           ` Andrea Arcangeli
  2004-03-01 11:10           ` Nikita Danilov
  2 siblings, 0 replies; 100+ messages in thread
From: Andrew Morton @ 2004-02-27 20:55 UTC (permalink / raw)
  To: Rik van Riel; +Cc: andrea, linux-kernel

Rik van Riel <riel@redhat.com> wrote:
>
> > Current 2.6 will write out nr_inactive>>DEF_PRIORITY pages,
> 
>  That may be a bit much on extremely huge systems, but that should
>  require no more than a little tweaking to fix.  Certainly no code
>  changes should be needed ...

hmm, with 4 million pages on the inactive list that's 1000 pages.  It might
be OK.

Bear in mind that under usual circumstances the direct-reclaim path will
refuse to block on request queue exhaustion so we might end up just
scanning past some dirty pages without starting I/O against them at all. 
End result: some jumbling up of the LRU order.  I suspect that's a
second-order problem though.  But hey, if we have a testcase, we can fix it!

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-27 20:29       ` Andrew Morton
  2004-02-27 20:49         ` Rik van Riel
@ 2004-02-27 21:15         ` Andrea Arcangeli
  2004-02-27 22:03           ` Martin J. Bligh
  2004-02-27 21:42         ` Hugh Dickins
  2004-02-27 23:18         ` Marcelo Tosatti
  3 siblings, 1 reply; 100+ messages in thread
From: Andrea Arcangeli @ 2004-02-27 21:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, linux-kernel

On Fri, Feb 27, 2004 at 12:29:36PM -0800, Andrew Morton wrote:
> Rik van Riel <riel@redhat.com> wrote:
> >
> > First, let me start with one simple request.  Whatever you do,
> > please send changes upstream in small, manageable chunks so we
> > can merge your improvements without destabilising the kernel.
> > 
> > We should avoid the kind of disaster we had around 2.4.10...
> 
> We need to understand that right now, 2.6.x is 2.7-pre.  Once 2.7 forks off
> we are more at liberty to merge nasty highmem hacks which will die when 2.6
> is end-of-lined.
> 
> I plan to merge the 4g split immediately after 2.7 forks.  I wouldn't be

note that the 4:4 split is wrong in 99% of cases where people needs 64G
gigs. I'm advocating strongly for the 2:2 split to everybody I talk
with, I'm trying to spread the 2:2 idea because IMHO it's an order of
magnitude simpler and an order of magnitude superior. Unfortunately I
could get not a single number to back my 2:2 claims, since the 4:4
buzzword is spreading and people only test with 4:4. so it's pretty hard
for me to spread the 2:2 buzzword.

4:4 makes no sense at all, the only advantage of 4:4 w.r.t. 2:2 is that
they can map 2.7G per task of shm instead of 1.7G per task of shm. Oh
they also have 1G more of normal zone, that's useless since 32G 3:1
works perfectly and more zone-normal won't help at all (and if you leave
rmap kernel will lockup no matter of 4:4 or 3:1 or 2:2 or 1:3). But the
1 more gig they can map per task will give them nothing since they flush
the tlb every syscall and every irq. So it's utterly stupid to map 1
more gig per task with the end result that you've to switch_mm every
syscall and irq. I expect the databases will run an order of magnitude
faster with _2:2_ in a 64G configuration, with _1.7G_ per process of shm
mapped, instead of their 4:4 split with 2.7G (or more, up to 3.9 ;)
mapped per task.

I don't mind if 4:4 gets merged but I recommend db vendors to benchmark
_2:2_ against 4:4 before remotely considering deploying 4:4 in
production.  Then of course let me know since I had not the luck to get
any number back and I've no access to any 64G box.

I don't care about 256G with 2:2 split, since intel and hp are now going
x86-64 too.

going past 32G the bigpages makes an huge difference, not just for the
pte memory overhead, but for the tlb caching, this is make me very
confortable of claiming 2:2 will payoff big compared to 4:4.

> averse to objrmap for file-backed mappings either - I agree that the search
> problems which were demonstrated are unlikely to bite in real life.

cool.

Martin's patch from IBM is a great start IMHO. I found a bug in the vma
flags check though, VM_RESERVED should be checked too, not only
VM_LOCKED, unless I'm missing something, but it's a minor issue.

The other scary part is if the trylocking fails too often, would be nice
to be able to spin and not to trylock, I would feel safer. In 2.4 I
don't care since it's a best-effort, I don't depend on the trylocking to
succeed to unmap pages, the original walk still runs and it spins.

> But first someone would need to demonstrate that pte_chains+4g/4g are
> for some reason unacceptable for some real-world setup.

with an rmap kernel the limit goes from 150 users of 3:1 to around 700
users of 4:4.  in 2.4 I can handle ~6k users at full speed with 3:1. And
the 4:4 slowdown is a so big order of magnitude that I believe it's
crazy to use 4:4 even on a 64G box where I advocate for 2:2 instead. And
if you leave rmap in, 4:4 will be needed even on a 8G box (not only on
64G boxes) to get past 700 users.

> Apart from the search problem, my main gripe with objrmap is that it
> creates different handling for file-backed and anonymous memory.  And the
> code which extends it to anonymous memory is complex and large.  One ends
> up needing to seriously ask oneself what is being gained from it all.

I don't have a definitive answer, but trying to use objrmap for anon too
is my object. It's not clear if it worth or not though. But this is
lower prio.

> > We've heard the "no real app runs into it" argument before,
> > about various other subjects.  I remember using it myself,
> > too, and every single time I used the "no real apps run into
> > it" argument I turned out to be wrong in the end.
> > 
> 
> heh.

my answer to this is that truncate() may already be running into weird
apps. sure the vm has more probability since truncate of mapped files
isn't too frequent, but if you really expect bad luck we already have a
window open for the bad luck ;) I try to be an optimist ;). Let's say I
know at least the most important apps won't run into this. Currently the
most imporant apps will lockup. So I don't have much choice.

> Oh, and can we please have testcases?  It's all very well to assert "it
> sucks doing X and I fixed it" but it's a lot more useful if one can
> distrubute testcases as well so others can evaluate the fix and can explore
> alternative solutions.
> 
> Andrea, this shmem problem is a case in point, please.

I don't have the real life testcase myself (I lack both software and
hardware to reproduce and it's not easy to run the thing either) but
I think it's possible to test it as we move to 2.6. At the moment it's
pointless to try due rmap but as soon as it's in good shape and math
gives the ok I will try to get stuff tested in practice (at the moment I
only verified rmap is a showstopper as math says, but just to be sure).

We can write a testcase ourself, it's pretty easy, just create a 2.7G
file in /dev/shm, and mmap(MAP_SHARED) it from 1k processes and fault in
all the pagetables from all tasks touching the shm vma. Then run a
second copy until the machine starts swapping and see how thing goes. To
do this you need probably 8G, this is why I didn't write the testcase
myself yet ;).  maybe I can simulate with less shm and less tasks on 1G
boxes too, but the extreme lru effects of point 3 won't be visibile
there, the very same software configuration works fine on 1/2G boxes on
stock 2.4. problems showsup when the lru grows due the algorithm not
contemplating million of dirty swapcache in a row at the end of the lru
and some gigs of free cache ad the head of the lru. the rmap-only issues
can also be tested with math, no testcase is needed for that.

> Maybe not.  Testcase, please ;)

I think a more efficient algorithm to achive the o1 vm object (which is
only one of the issues that my point 3 solves), could be to have a
separate lru (not an anchor) protected by an irq spinlock (spinlock
because it will be accessed by the I/O completion routine) that we check
once per second and not more, so the variable becomes "time" instead of
"frequency of allocations", since swapout disk I/O is going to be quite
dependent on fixed time, rather than on the allocation rate which isn't
really fixed. This way we know we won't throw a too huge amount of cpu
on locked pages and it avoids the anchor and in turn the anchor
placement non obvious problem. However the coding of this would be
complex, maybe not more complex than the o1 vm code though.

The usage I wanted to do of an anchor in 2.4 is completely different
from the usage that o1 vm is doing IMHO.

thanks for the help!

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-27 20:49         ` Rik van Riel
  2004-02-27 20:55           ` Andrew Morton
@ 2004-02-27 21:28           ` Andrea Arcangeli
  2004-02-27 21:37             ` Andrea Arcangeli
  2004-02-28  3:22             ` Andrea Arcangeli
  2004-03-01 11:10           ` Nikita Danilov
  2 siblings, 2 replies; 100+ messages in thread
From: Andrea Arcangeli @ 2004-02-27 21:28 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel

On Fri, Feb 27, 2004 at 03:49:03PM -0500, Rik van Riel wrote:
> On Fri, 27 Feb 2004, Andrew Morton wrote:
> 
> > But first someone would need to demonstrate that pte_chains+4g/4g are
> > for some reason unacceptable for some real-world setup.
> 
> Agreed.  The current 2.6 VM is well tuned already so
> we should be extremely cautious not to upset it.

this is very easy:

	2.7*1024*1024*1024/4096*8*700/1024/1024 = 3780M

at ~700 tasks 4:4 will lockup with rmap.

with 3:1 and rmap the limit is down to around ~150 users.

that's nothing, way too low, normal regression test on 12G uses >1k
tasks.

regardless the 4:4 buzzword sounds a terribly wrong idea to me for the
99% of its proposed 64G usages (and with rmap 4:4 is needed even for a
8G box, not just for the 64G box, which sounds unacceptable)

my 2:2 proposal sounds to have a lot more potential technically than the
4:4. I know for sure if it was me owning that 64G hardware and running
that big software, I would first of all try 2:2 and 1.7G per process and
then compare with 4:4. I would like to get number comparisons for my 2:2
buzzword too but I failed so far (first time I asked was August 2003 and
it was for 2.4 where 2:2 is easy too). Some resource will have to be
allocated soon to test my 2:2 idea, if it turns out doing good as I
expect if compared to 4:4, I won't personally need to deal with the 4:4
2.0 design (yes 2.0 design), so I try to be optimistic for this too ;).
If I'm wrong, then we may be forced to allow a special 4:4 option to use
on the 64G boxes. I wanted to do page clustering but there are too many
other things to do first so it may be too late for 2.6 for the page
clustering (for mainline is pretty much obviously too late, I was
thinking at 2.6-aa here).

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-27 21:28           ` Andrea Arcangeli
@ 2004-02-27 21:37             ` Andrea Arcangeli
  2004-02-28  3:22             ` Andrea Arcangeli
  1 sibling, 0 replies; 100+ messages in thread
From: Andrea Arcangeli @ 2004-02-27 21:37 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel

On Fri, Feb 27, 2004 at 10:28:44PM +0100, Andrea Arcangeli wrote:
> expect if compared to 4:4, I won't personally need to deal with the 4:4

<joke>
and btw, if I will have the luck of not having to deal with the 4:4 2.0
kernel slowdown, it's also because AMD is effectively saving the soul of
the vm hackers ;)
</joke>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-27 20:29       ` Andrew Morton
  2004-02-27 20:49         ` Rik van Riel
  2004-02-27 21:15         ` Andrea Arcangeli
@ 2004-02-27 21:42         ` Hugh Dickins
  2004-02-27 23:18         ` Marcelo Tosatti
  3 siblings, 0 replies; 100+ messages in thread
From: Hugh Dickins @ 2004-02-27 21:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, andrea, linux-kernel

On Fri, 27 Feb 2004, Andrew Morton wrote:
> 
> Apart from the search problem, my main gripe with objrmap is that it
> creates different handling for file-backed and anonymous memory.  And the
> code which extends it to anonymous memory is complex and large.

I challenge that: anobjrmap ventured into more files than you wanted to
change at the time, but it was not complex, and removed more than it added.

Hugh


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-27 21:15         ` Andrea Arcangeli
@ 2004-02-27 22:03           ` Martin J. Bligh
  2004-02-27 22:23             ` Andrew Morton
  2004-02-28  2:32             ` Andrea Arcangeli
  0 siblings, 2 replies; 100+ messages in thread
From: Martin J. Bligh @ 2004-02-27 22:03 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton; +Cc: Rik van Riel, linux-kernel

> note that the 4:4 split is wrong in 99% of cases where people needs 64G
> gigs. I'm advocating strongly for the 2:2 split to everybody I talk
> with, I'm trying to spread the 2:2 idea because IMHO it's an order of
> magnitude simpler and an order of magnitude superior. Unfortunately I
> could get not a single number to back my 2:2 claims, since the 4:4
> buzzword is spreading and people only test with 4:4. so it's pretty hard
> for me to spread the 2:2 buzzword.

For the record, I for one am not opposed to doing 2:2 instead of 4:4.
What pisses me off is people trying to squeeze large amounts of memory
into 3:1, and distros pretending it's supportable, when it's never 
stable across a broad spectrum of workloads. Between 2:2 and 4:4,
it's just a different overhead tradeoff.

> 4:4 makes no sense at all, the only advantage of 4:4 w.r.t. 2:2 is that
> they can map 2.7G per task of shm instead of 1.7G per task of shm.

Eh? You have a 2GB difference of user address space, and a 1GB difference
of shm size. You lost a GB somewhere ;-) Depending on whether you move
TASK_UNMAPPPED_BASE or not, it you might mean 2.7 vs 0.7 or at a pinch
3.5 vs 1.5, I'm not sure.

> syscall and irq. I expect the databases will run an order of magnitude
> faster with _2:2_ in a 64G configuration, with _1.7G_ per process of shm
> mapped, instead of their 4:4 split with 2.7G (or more, up to 3.9 ;)
> mapped per task.

That may well be true for some workloads, I suspect it's slower for others.
One could call the tradeoff either way.
 
> I don't mind if 4:4 gets merged but I recommend db vendors to benchmark
> _2:2_ against 4:4 before remotely considering deploying 4:4 in
> production.  Then of course let me know since I had not the luck to get
> any number back and I've no access to any 64G box.

If you send me a *simple* simulation test, I'll gladly run it for you ;-)
But I'm not going to go fiddle with Oracle, and thousands of disks ;-)

> I don't care about 256G with 2:2 split, since intel and hp are now going
> x86-64 too.

Yeah, I don't think we ever need to deal with that kind of insanity ;-)
 
>> averse to objrmap for file-backed mappings either - I agree that the search
>> problems which were demonstrated are unlikely to bite in real life.
> 
> cool.
> 
> Martin's patch from IBM is a great start IMHO. I found a bug in the vma
> flags check though, VM_RESERVED should be checked too, not only
> VM_LOCKED, unless I'm missing something, but it's a minor issue.

I didn't actually write it - that was Dave McCracken ;-) I just suggested
the partial aproach (because I'm dirty and lazy ;-)) and carried it
in my tree.

I agree with Andrew's comments though - it's not nice having the dual
approach of the partial, but the complexity of the full approach is a
bit scary and buys you little in real terms (performance and space).
I still believe that creating an "address_space like structure" for
anon memory, shared across VMAs is an idea that might give us cleaner
code - it also fixes other problems like Andi's NUMA API binding.

> We can write a testcase ourself, it's pretty easy, just create a 2.7G
> file in /dev/shm, and mmap(MAP_SHARED) it from 1k processes and fault in
> all the pagetables from all tasks touching the shm vma. Then run a
> second copy until the machine starts swapping and see how thing goes. To
> do this you need probably 8G, this is why I didn't write the testcase
> myself yet ;).  maybe I can simulate with less shm and less tasks on 1G
> boxes too, but the extreme lru effects of point 3 won't be visibile
> there, the very same software configuration works fine on 1/2G boxes on
> stock 2.4. problems showsup when the lru grows due the algorithm not
> contemplating million of dirty swapcache in a row at the end of the lru
> and some gigs of free cache ad the head of the lru. the rmap-only issues
> can also be tested with math, no testcase is needed for that.

I don't have time at the moment to go write it at the moment, but I can certainly run it on large end hardware if that helps.

M.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-27 22:03           ` Martin J. Bligh
@ 2004-02-27 22:23             ` Andrew Morton
  2004-02-28  2:32             ` Andrea Arcangeli
  1 sibling, 0 replies; 100+ messages in thread
From: Andrew Morton @ 2004-02-27 22:23 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: andrea, riel, linux-kernel

"Martin J. Bligh" <mbligh@aracnet.com> wrote:
>
> > We can write a testcase ourself, it's pretty easy, just create a 2.7G
> > file in /dev/shm, and mmap(MAP_SHARED) it from 1k processes and fault in
> > all the pagetables from all tasks touching the shm vma. Then run a
> > second copy until the machine starts swapping and see how thing goes. To
> > do this you need probably 8G, this is why I didn't write the testcase
> > myself yet ;).  maybe I can simulate with less shm and less tasks on 1G
> > boxes too, but the extreme lru effects of point 3 won't be visibile
> > there, the very same software configuration works fine on 1/2G boxes on
> > stock 2.4. problems showsup when the lru grows due the algorithm not
> > contemplating million of dirty swapcache in a row at the end of the lru
> > and some gigs of free cache ad the head of the lru. the rmap-only issues
> > can also be tested with math, no testcase is needed for that.
> 
> I don't have time at the moment to go write it at the moment, but I can certainly run it on large end hardware if that helps.

I think just

	usemem -m 2700 -f test-file -r 10 -n 1000

will do it.  I need to verify that.

http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-27 23:18         ` Marcelo Tosatti
@ 2004-02-27 22:39           ` Andrew Morton
  0 siblings, 0 replies; 100+ messages in thread
From: Andrew Morton @ 2004-02-27 22:39 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: riel, andrea, linux-kernel

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> Andrew, are your testcases online somewhere?

Well the tools are in ext3 CVS (http://www.zip.com.au/~akpm/linux/ext3/)
but the issue is how to drive them to create a particular scenario.

I never wrote that down, but there's heaps and heaps of info in the
changelogs:

http://linux.bkbits.net:8080/linux-2.5/user=akpm/ChangeSet?nav=!-|index.html|stats|!+|index.html

22000 lines of stuff there, so `bk revtool' and a fast computer may be a
more convenient navigation system.

That's not particularly useful, sorry.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-27 20:29       ` Andrew Morton
                           ` (2 preceding siblings ...)
  2004-02-27 21:42         ` Hugh Dickins
@ 2004-02-27 23:18         ` Marcelo Tosatti
  2004-02-27 22:39           ` Andrew Morton
  3 siblings, 1 reply; 100+ messages in thread
From: Marcelo Tosatti @ 2004-02-27 23:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, andrea, linux-kernel



On Fri, 27 Feb 2004, Andrew Morton wrote:

> Oh, and can we please have testcases?  It's all very well to assert "it
> sucks doing X and I fixed it" but it's a lot more useful if one can
> distrubute testcases as well so others can evaluate the fix and can explore
> alternative solutions.
>
> Andrea, this shmem problem is a case in point, please.
>
> > > in small machines the current 2.4 stock algo works just fine too, it's
> > > only when the lru has the million pages queued that without my new vm
> > > algo you'll do million swapouts before freeing the memleak^Wcache.
> >
> > Same for Arjan's O(1) VM.  For machines in the single and low
> > double digit number of gigabytes of memory either would work
> > similarly well ...
>
> Case in point.  We went round the O(1) page reclaim loop a year ago and I
> was never able to obtain a testcase which demonstrated the problem on 2.4,
> let alone on 2.6.
>
> I had previously found some workloads in which the 2.4 VM collapsed for
> similar reasons and those were fixed with the rotate_reclaimable_page()
> logic.  Without testcases we will not be able to verify that anything else
> needs doing.

Btw,

Andrew, are your testcases online somewhere?

I heard once someone was going to collect VM tests to make a "official
testing package", but that has never happened AFAIK.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-27 22:03           ` Martin J. Bligh
  2004-02-27 22:23             ` Andrew Morton
@ 2004-02-28  2:32             ` Andrea Arcangeli
  2004-02-28  4:57               ` Wim Coekaerts
  2004-02-28  6:10               ` Martin J. Bligh
  1 sibling, 2 replies; 100+ messages in thread
From: Andrea Arcangeli @ 2004-02-28  2:32 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrew Morton, Rik van Riel, linux-kernel

> > 4:4 makes no sense at all, the only advantage of 4:4 w.r.t. 2:2 is that
> > they can map 2.7G per task of shm instead of 1.7G per task of shm.

On Fri, Feb 27, 2004 at 02:03:07PM -0800, Martin J. Bligh wrote:
> 
> Eh? You have a 2GB difference of user address space, and a 1GB difference
> of shm size. You lost a GB somewhere ;-) Depending on whether you move
> TASK_UNMAPPPED_BASE or not, it you might mean 2.7 vs 0.7 or at a pinch
> 3.5 vs 1.5, I'm not sure.

the numbers I wrote are right. No shm size is lost. The shm size is >20G,
it doesn't fit in 4g of address space of 4:4 like it doesn't fit in 3G
of address space of 3:1 like it doesn't fit in 2:2.

I think nobody tested 2:2 seriously on 64G boxes yet, I'm simply asking
for that.

And I agree with you using 64G with 3:1 is not feasible for application
like databases, it's feasible for other apps for example needing big
caches (if you can manage to boot the machine ;) it's not a matter of
opinion, it's a matter fact, for a generic misc load the high limit of
3:1 is mem=48G, which is not too bad.

What changes between 3:1 and 2:2 is the "view" on the 20G shm file, not
the size of the shm. you can do less simultaneous mmap with a 1.7G view
instead of a 2.7G view. the nonlinear vma will be 1.7G in size with 2:2,
instead of 2.7G in size with 3:1 or 4:4 (300M are as usual left for some
hole, the binary itself and the stack)

> > syscall and irq. I expect the databases will run an order of magnitude
> > faster with _2:2_ in a 64G configuration, with _1.7G_ per process of shm
> > mapped, instead of their 4:4 split with 2.7G (or more, up to 3.9 ;)
> > mapped per task.
> 
> That may well be true for some workloads, I suspect it's slower for others.
> One could call the tradeoff either way.

the only chance it's faster is if you never use syscalls and you drive
all interrupts to other cpus and you have an advantage by mapping >2G in
the same address space. If you use syscalls and irqs, then you'll keep
flushing the address space, so you can as well use mmap and flush
_by_hand_ only the interesting bits when you really run into a
view-miss, so you can run at full speed in the fast path including
syscalls and irqs. Most of the time the view will be enough, there's
some aging technic to apply on the collection of the old buckets too. So
I've some doubt 4:4 runs faster anywhere. I could be wrong though.

> > I don't mind if 4:4 gets merged but I recommend db vendors to benchmark
> > _2:2_ against 4:4 before remotely considering deploying 4:4 in
> > production.  Then of course let me know since I had not the luck to get
> > any number back and I've no access to any 64G box.
> 
> If you send me a *simple* simulation test, I'll gladly run it for you ;-)
> But I'm not going to go fiddle with Oracle, and thousands of disks ;-)

:)

thanks for the offer! ;) I would prefer a real life db bench since
syscalls and irqs are an important part of the load that hurts 4:4 most,
it doesn't need to be necessairly oracle though. And if it's a cpu with
big tlb cache like p4 it would be prefereable. maybe we should talk
about this offline.

> > I don't care about 256G with 2:2 split, since intel and hp are now going
> > x86-64 too.
> 
> Yeah, I don't think we ever need to deal with that kind of insanity ;-)

;)

> >> averse to objrmap for file-backed mappings either - I agree that the search
> >> problems which were demonstrated are unlikely to bite in real life.
> > 
> > cool.
> > 
> > Martin's patch from IBM is a great start IMHO. I found a bug in the vma
> > flags check though, VM_RESERVED should be checked too, not only
> > VM_LOCKED, unless I'm missing something, but it's a minor issue.
> 
> I didn't actually write it - that was Dave McCracken ;-) I just suggested
> the partial aproach (because I'm dirty and lazy ;-)) and carried it
> in my tree.

I know you didn't write it but I forgot who was the author so I just
given credit to IBM at large ;). thanks for giving the due credit to
Dave ;)

> I agree with Andrew's comments though - it's not nice having the dual
> approach of the partial, but the complexity of the full approach is a
> bit scary and buys you little in real terms (performance and space).
> I still believe that creating an "address_space like structure" for
> anon memory, shared across VMAs is an idea that might give us cleaner
> code - it also fixes other problems like Andi's NUMA API binding.

agreed. It's just lower prio at the moment since anon memory doesn't
tend to be that much shared, so the overhead is minimal.

> I don't have time at the moment to go write it at the moment, but I
> can certainly run it on large end hardware if that helps.

thanks, we should write it someday. that testcase isn't the one suitable
for the 4:4 vs 2:2 thing though, for that a real life thing is needed
since irqs, syscalls (and possibly page faults but not that many with a
db) are fundamental parts of the load.  we could write a smarter
testcase as well, but I guess using a db is simpler, evaluating 2:2 vs
4:4 is more a do-once thing, results won't change over time.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-27 21:28           ` Andrea Arcangeli
  2004-02-27 21:37             ` Andrea Arcangeli
@ 2004-02-28  3:22             ` Andrea Arcangeli
  1 sibling, 0 replies; 100+ messages in thread
From: Andrea Arcangeli @ 2004-02-28  3:22 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel

On Fri, Feb 27, 2004 at 10:28:44PM +0100, Andrea Arcangeli wrote:
> on the 64G boxes. I wanted to do page clustering but there are too many

for the record with page clustering above I meant the patch developed
originally by Hugh for 2.4.7 and then developed and currently maintained
by William on kernel.org.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-28  2:32             ` Andrea Arcangeli
@ 2004-02-28  4:57               ` Wim Coekaerts
  2004-02-28  6:18                 ` Andrea Arcangeli
  2004-02-28  6:10               ` Martin J. Bligh
  1 sibling, 1 reply; 100+ messages in thread
From: Wim Coekaerts @ 2004-02-28  4:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Martin J. Bligh, Andrew Morton, Rik van Riel, linux-kernel

On Sat, Feb 28, 2004 at 03:32:36AM +0100, Andrea Arcangeli wrote:
> On Fri, Feb 27, 2004 at 02:03:07PM -0800, Martin J. Bligh wrote:
> > 
> > Eh? You have a 2GB difference of user address space, and a 1GB difference
> > of shm size. You lost a GB somewhere ;-) Depending on whether you move
> > TASK_UNMAPPPED_BASE or not, it you might mean 2.7 vs 0.7 or at a pinch
> > 3.5 vs 1.5, I'm not sure.
> 
> the numbers I wrote are right. No shm size is lost. The shm size is >20G,
> it doesn't fit in 4g of address space of 4:4 like it doesn't fit in 3G
> of address space of 3:1 like it doesn't fit in 2:2.

Andrea, one thing I don't think we have discussed before is that aside
from mapping into shmfs or hugetlbfs, there is also the regular shmem
segment (shmget) we always use. the way we currently allocate memory is
like this :

just a big shmem segment w/ shmget() up to like 1.7 or 2.5 gb,
containing the entire in memory part

or 

shm (reasoanble sized segment, between 400mb and today on 32bit up to
like 1.7 - 2 gb) which is used for non buffercache (sqlcache, parse
trees etc)
a default of about 16000 mmaps into the shmfs file (or
remap_file_pages) and the total size ranging from a few gb to many gb
	which contains the data buffer cache

we cannot put the sqlcache(shared pool) into shmfs and do the windowing
and this is a big deal for performance as well. eg the larger the
better. it would have to be able to get to a reasonable size, and you
have about 512mb on top of that for the window into shmfs. average sizes
range between 1gb and 1.7gb so a 2/2 split would not be useful here. 
sql/plsql/java cache is quite important for certain things. 

I think Van is running a test on a32gb box to compare the 2 but I think
that would be too limiting in general to have only 2gb.

wim






^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-28  2:32             ` Andrea Arcangeli
  2004-02-28  4:57               ` Wim Coekaerts
@ 2004-02-28  6:10               ` Martin J. Bligh
  2004-02-28  6:43                 ` Andrea Arcangeli
  2004-03-02  9:10                 ` Kurt Garloff
  1 sibling, 2 replies; 100+ messages in thread
From: Martin J. Bligh @ 2004-02-28  6:10 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Rik van Riel, linux-kernel

>> > 4:4 makes no sense at all, the only advantage of 4:4 w.r.t. 2:2 is that
>> > they can map 2.7G per task of shm instead of 1.7G per task of shm.
> 
> On Fri, Feb 27, 2004 at 02:03:07PM -0800, Martin J. Bligh wrote:
>> 
>> Eh? You have a 2GB difference of user address space, and a 1GB difference
>> of shm size. You lost a GB somewhere ;-) Depending on whether you move
>> TASK_UNMAPPPED_BASE or not, it you might mean 2.7 vs 0.7 or at a pinch
>> 3.5 vs 1.5, I'm not sure.
> 
> the numbers I wrote are right. No shm size is lost. The shm size is >20G,
> it doesn't fit in 4g of address space of 4:4 like it doesn't fit in 3G
> of address space of 3:1 like it doesn't fit in 2:2.

OK, I understand you can window it, but I still don't get where your
figures of 2.7GB/task vs 1.7GB per task come from?
 
> I think nobody tested 2:2 seriously on 64G boxes yet, I'm simply asking
> for that.
> 
> And I agree with you using 64G with 3:1 is not feasible for application
> like databases, it's feasible for other apps for example needing big
> caches (if you can manage to boot the machine ;) it's not a matter of
> opinion, it's a matter fact, for a generic misc load the high limit of
> 3:1 is mem=48G, which is not too bad.

48GB is sailing damned close to the wind. The problem I've had before is
distros saying "we support X GB of RAM", but it only works for some
workloads, and falls over on others. Oddly enough, that tends to upset
the customers quite a bit ;-) I'd agree with what you say - for a generic
misc load, it might work ... but I'd hate a customer to hear that and
misinterpret it.
 
> What changes between 3:1 and 2:2 is the "view" on the 20G shm file, not
> the size of the shm. you can do less simultaneous mmap with a 1.7G view
> instead of a 2.7G view. the nonlinear vma will be 1.7G in size with 2:2,
> instead of 2.7G in size with 3:1 or 4:4 (300M are as usual left for some
> hole, the binary itself and the stack)

Why is it 2.7GB with both 3:1 and 4:4 ... surely it can get bigger on 
4:4 ???

> the only chance it's faster is if you never use syscalls and you drive
> all interrupts to other cpus and you have an advantage by mapping >2G in
> the same address space. 

I think that's the key - when you need to map a LOT of data into the
address space. Unfortunately, I think that's the kind of app that the
large machines run.

> I've some doubt 4:4 runs faster anywhere. I could be wrong though.

There's only one real way to tell ;-)

>> If you send me a *simple* simulation test, I'll gladly run it for you ;-)
>> But I'm not going to go fiddle with Oracle, and thousands of disks ;-)
> 
> :)
> 
> thanks for the offer! ;) I would prefer a real life db bench since
> syscalls and irqs are an important part of the load that hurts 4:4 most,
> it doesn't need to be necessairly oracle though. And if it's a cpu with
> big tlb cache like p4 it would be prefereable. maybe we should talk
> about this offline.

I've been talking with others here about running a database workload
test, but it'll probably be on a machine with only 8GB or so. I still
think that's enough to show us something interesting.
 
> agreed. It's just lower prio at the moment since anon memory doesn't
> tend to be that much shared, so the overhead is minimal.

Yup, that's what my analysis found, most of it falls under the pte_direct
optimisation. The only problem seems to be that at fork/exec time we
set up the chain, then tear it down again, which is ugly. That's the bit
where I like Hugh's stuff.
 
>> I don't have time at the moment to go write it at the moment, but I
>> can certainly run it on large end hardware if that helps.
> 
> thanks, we should write it someday. that testcase isn't the one suitable
> for the 4:4 vs 2:2 thing though, for that a real life thing is needed
> since irqs, syscalls (and possibly page faults but not that many with a
> db) are fundamental parts of the load.  we could write a smarter
> testcase as well, but I guess using a db is simpler, evaluating 2:2 vs
> 4:4 is more a do-once thing, results won't change over time.

OK, I'll see what people here can do about that ;-)

M.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-28  4:57               ` Wim Coekaerts
@ 2004-02-28  6:18                 ` Andrea Arcangeli
  2004-02-28  6:45                   ` Martin J. Bligh
       [not found]                   ` <20040228061838.GO8834@dualathlon.random.suse.lists.linux.kernel>
  0 siblings, 2 replies; 100+ messages in thread
From: Andrea Arcangeli @ 2004-02-28  6:18 UTC (permalink / raw)
  To: Wim Coekaerts; +Cc: Martin J. Bligh, Andrew Morton, Rik van Riel, linux-kernel

On Fri, Feb 27, 2004 at 08:57:14PM -0800, Wim Coekaerts wrote:
> On Sat, Feb 28, 2004 at 03:32:36AM +0100, Andrea Arcangeli wrote:
> > On Fri, Feb 27, 2004 at 02:03:07PM -0800, Martin J. Bligh wrote:
> > > 
> > > Eh? You have a 2GB difference of user address space, and a 1GB difference
> > > of shm size. You lost a GB somewhere ;-) Depending on whether you move
> > > TASK_UNMAPPPED_BASE or not, it you might mean 2.7 vs 0.7 or at a pinch
> > > 3.5 vs 1.5, I'm not sure.
> > 
> > the numbers I wrote are right. No shm size is lost. The shm size is >20G,
> > it doesn't fit in 4g of address space of 4:4 like it doesn't fit in 3G
> > of address space of 3:1 like it doesn't fit in 2:2.
> 
> Andrea, one thing I don't think we have discussed before is that aside
> from mapping into shmfs or hugetlbfs, there is also the regular shmem
> segment (shmget) we always use. the way we currently allocate memory is
> like this :
> 
> just a big shmem segment w/ shmget() up to like 1.7 or 2.5 gb,
> containing the entire in memory part
> 
> or 
> 
> shm (reasoanble sized segment, between 400mb and today on 32bit up to
> like 1.7 - 2 gb) which is used for non buffercache (sqlcache, parse
> trees etc)
> a default of about 16000 mmaps into the shmfs file (or
> remap_file_pages) and the total size ranging from a few gb to many gb
> 	which contains the data buffer cache
> 
> we cannot put the sqlcache(shared pool) into shmfs and do the windowing
> and this is a big deal for performance as well. eg the larger the
> better. it would have to be able to get to a reasonable size, and you
> have about 512mb on top of that for the window into shmfs. average sizes
> range between 1gb and 1.7gb so a 2/2 split would not be useful here. 
> sql/plsql/java cache is quite important for certain things. 

I see, so losing 1g sounds too much.

> 
> I think Van is running a test on a32gb box to compare the 2 but I think
> that would be too limiting in general to have only 2gb.

thanks for giving it a spin (btw I assume it's 2.4, that's fine for
a quick test, and I seem not to find the 2:2 and 1:3 options in the 2.6
kernel anymore ;).

What I probably didn't specify yet is that 2.5:1.5 is feasible too, I've
a fairly small and strightforward patch here from ibm that implements
3.5:0.5 for PAE mode (for a completely different matter, but I mean,
it's not really a problem to do 2.5:1.5 either if needed, it's the same
as the current PAE mode 3.5:0.5).

starting with the assumtion that 32G machines works with 3:1 (like they
do in 2.4), and assuming the size of a page is 48 bytes (like in 2.4, in
2.6 it's a bit bigger but we can most certainly shrink it, for example
removing rmap for anon pages will immediatly release 128M of kernel
memory), moving from 32G to 64G means losing 384M of those additional
512M in pages, you can use the remaining additional 512M-384M=128M for
vmas, task structs, files etc...  So 2.5:1.5 should be enough as far as
the kernel is concerned to run on 64G machines (provided the page_t is
not bigger than 2.4 which sounds feasible too).

we can add a config option to enable together with 2.5:1.5 to drop the
gap page in vmalloc, and to reduce the vmalloc space, so that we can
sneak another few "free" dozen megs back for the 64G kernel just to get
more margin even if we don't strictly need it. (btw, the vmalloc space
is also tunable at boot, so this config option would just change the
default value)

So as far as 32G works with 3:1, 2.5:1.5 is going to be more than enough
to handle 64G.

the question remains is if you can live with only 2.5G of address space,
so if losing 512m is a blocker or not. I see losing 1G was way too much,
but kernel doesn't need 1G more, an additional 512m is enough to make
the kernel happy. If losing 512m is a big problem too, I don't think we
can drop from userspace less than 512m of address space, so 4:4 would
remain the only way to handle 64G and we can forget about this my
suggestion. Certainly I believe 2.5:1.5 has a very good chance to
significantly outperform 4:4 if you can make the ipc shm 1.7G and the
window on the shm 512m (that leaves you 300m for holes, stack, binary,
anonymous memory and similar minor allocations).

thanks.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-28  6:10               ` Martin J. Bligh
@ 2004-02-28  6:43                 ` Andrea Arcangeli
  2004-02-28  7:00                   ` Martin J. Bligh
  2004-03-02  9:10                 ` Kurt Garloff
  1 sibling, 1 reply; 100+ messages in thread
From: Andrea Arcangeli @ 2004-02-28  6:43 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrew Morton, Rik van Riel, linux-kernel

On Fri, Feb 27, 2004 at 10:10:22PM -0800, Martin J. Bligh wrote:
> OK, I understand you can window it, but I still don't get where your
> figures of 2.7GB/task vs 1.7GB per task come from?

2.7G is what you can map right now in 2.6 with 3:1.

dropping 1G from the userspace means reducing the shm mappings to 1.7G
of address space.

sorry if I made some confusion.

> 48GB is sailing damned close to the wind. The problem I've had before is

;)

> distros saying "we support X GB of RAM", but it only works for some
> workloads, and falls over on others. Oddly enough, that tends to upset
> the customers quite a bit ;-) I'd agree with what you say - for a generic
> misc load, it might work ... but I'd hate a customer to hear that and
> misinterpret it.

I see...

> > What changes between 3:1 and 2:2 is the "view" on the 20G shm file, not
> > the size of the shm. you can do less simultaneous mmap with a 1.7G view
> > instead of a 2.7G view. the nonlinear vma will be 1.7G in size with 2:2,
> > instead of 2.7G in size with 3:1 or 4:4 (300M are as usual left for some
> > hole, the binary itself and the stack)
> 
> Why is it 2.7GB with both 3:1 and 4:4 ... surely it can get bigger on 
> 4:4 ???

yes it can be bigger there. I wrote it to simplify, I mean it doesn't
need to be bigger, but it can.

> > the only chance it's faster is if you never use syscalls and you drive
> > all interrupts to other cpus and you have an advantage by mapping >2G in
> > the same address space. 
> 
> I think that's the key - when you need to map a LOT of data into the
> address space. Unfortunately, I think that's the kind of app that the
> large machines run.

agreed.

> > I've some doubt 4:4 runs faster anywhere. I could be wrong though.
> 
> There's only one real way to tell ;-)

indeed ;)

> >> If you send me a *simple* simulation test, I'll gladly run it for you ;-)
> >> But I'm not going to go fiddle with Oracle, and thousands of disks ;-)
> > 
> > :)
> > 
> > thanks for the offer! ;) I would prefer a real life db bench since
> > syscalls and irqs are an important part of the load that hurts 4:4 most,
> > it doesn't need to be necessairly oracle though. And if it's a cpu with
> > big tlb cache like p4 it would be prefereable. maybe we should talk
> > about this offline.
> 
> I've been talking with others here about running a database workload
> test, but it'll probably be on a machine with only 8GB or so. I still
> think that's enough to show us something interesting.

yes, it should be enough to show something interesting. However the best
would be to really run it on a 32G box, 32G should be really show the
divergence. getting results w/ and w/o hugetlbfs may be interesting too
(it's not clear if 4:4 will benefit more or less from hutetlbfs, it will
walk only twice to reach the physical page, but OTOH flushing the tlb so
frequently will partly invalidate the huge tlb behaviour).

> > agreed. It's just lower prio at the moment since anon memory doesn't
> > tend to be that much shared, so the overhead is minimal.
> 
> Yup, that's what my analysis found, most of it falls under the pte_direct
> optimisation. The only problem seems to be that at fork/exec time we
> set up the chain, then tear it down again, which is ugly. That's the bit
> where I like Hugh's stuff.

Me too. I've a testcase here that works 50% slower in 2.6 than 2.4, due
the slowdown in fork/pagefaults etc.. (real apps of course doesn't show
it, this is a "malicious" testcase ;).

#include <sys/mman.h>
#include <stdio.h>
#include <fcntl.h>

#define SIZE (1024*1024*1024)

int main(int argc, char ** argv)
{
  int fd, level, max_level;
  char * start, * end, * tmp;

  max_level = atoi(argv[1]);

  fd = open("/tmp/x", O_CREAT|O_RDWR);
  if (fd < 0)
    perror("open"), exit(1);
  if (ftruncate(fd, SIZE) < 0)
    perror("truncate"), exit(1);
  if ((start = mmap(0, SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0)) == MAP_FAILED)
    perror("mmap"), exit(1);
  end = start + SIZE;
  
  for (tmp = start; tmp < end; tmp += 4096) {
    *tmp = 0;
  }

  for (level = 0; level < max_level; level++) {
    if (fork() < 0)
      perror("fork"), exit(1);
    if (munmap(start, SIZE) < 0)
      perror("munmap"), exit(1);
    if ((start = mmap(0, SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0)) == MAP_FAILED)
      perror("mmap"), exit(1);
    end = start + SIZE;
  
    for (tmp = start; tmp < end; tmp += 4096) {
      *(volatile char *)tmp;
    }
  }
  return 0;
}

(it's insecure since "/tmp/x" is fixed, change that file if you need
local security).

> >> I don't have time at the moment to go write it at the moment, but I
> >> can certainly run it on large end hardware if that helps.
> > 
> > thanks, we should write it someday. that testcase isn't the one suitable
> > for the 4:4 vs 2:2 thing though, for that a real life thing is needed
> > since irqs, syscalls (and possibly page faults but not that many with a
> > db) are fundamental parts of the load.  we could write a smarter
> > testcase as well, but I guess using a db is simpler, evaluating 2:2 vs
> > 4:4 is more a do-once thing, results won't change over time.
> 
> OK, I'll see what people here can do about that ;-)

cool ;)

as I wrote to Wim to make it more acceptable we'll have to modify your
3.5:0.5 PAE patch to do 2.5:1.5 too, to give userspace another 512m that
the kernel actually doesn't need. And still I'm not sure if Wim can live
with 1.7G ipcshm and 512m of shmfs window, if that's not enough user
address space then it's unlikely this thread will go anywhere since 512m
are needed to handle an additional 32G with a reasonable margin (even
after shrinking the page_t to the 2.4 levels).

The last issue that we may run into are apps assuming the stack is at 3G
fixed, some jvm assumed that, but they should be fixed by now (at the
very least it's not hard at all to fix those).

It also depends on the performance difference if this is worthwhile, if
the difference isn't very significant 4:4 will be certainly prefereable
so you can also allocate 4G in the same task for apps not using syscalls
or page faults or flood of network irqs.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-28  6:18                 ` Andrea Arcangeli
@ 2004-02-28  6:45                   ` Martin J. Bligh
  2004-02-28  7:05                     ` Andrea Arcangeli
       [not found]                   ` <20040228061838.GO8834@dualathlon.random.suse.lists.linux.kernel>
  1 sibling, 1 reply; 100+ messages in thread
From: Martin J. Bligh @ 2004-02-28  6:45 UTC (permalink / raw)
  To: Andrea Arcangeli, Wim Coekaerts, Hugh Dickins
  Cc: Andrew Morton, Rik van Riel, linux-kernel


> thanks for giving it a spin (btw I assume it's 2.4, that's fine for
> a quick test, and I seem not to find the 2:2 and 1:3 options in the 2.6
> kernel anymore ;).

ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/patches/2.6.3/2.6.3-mjb1/212-config_page_offset
(which sits on top of the 4/4 patches, so might need some massaging to apply)

> What I probably didn't specify yet is that 2.5:1.5 is feasible too, I've
> a fairly small and strightforward patch here from ibm that implements
> 3.5:0.5 for PAE mode (for a completely different matter, but I mean,
> it's not really a problem to do 2.5:1.5 either if needed, it's the same
> as the current PAE mode 3.5:0.5).

I'm not sure it's that straightforward really - doing the non-pgd aligned
split is messy. 2.5 might actually be much cleaner than 3.5 though, as we
never updated the mappings of the PMD that's shared between user and kernel.
Hmmm ... that's quite tempting.

> starting with the assumtion that 32G machines works with 3:1 (like they
> do in 2.4), and assuming the size of a page is 48 bytes (like in 2.4, in
> 2.6 it's a bit bigger but we can most certainly shrink it, for example
> removing rmap for anon pages will immediatly release 128M of kernel
> memory), moving from 32G to 64G means losing 384M of those additional
> 512M in pages, you can use the remaining additional 512M-384M=128M for
> vmas, task structs, files etc...  So 2.5:1.5 should be enough as far as
> the kernel is concerned to run on 64G machines (provided the page_t is
> not bigger than 2.4 which sounds feasible too).

Shrinking struct page sounds nice. Did Hugh's patch actually end up doing
that? I don't recall that, but I don't see why it wouldn't.

M.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-28  6:43                 ` Andrea Arcangeli
@ 2004-02-28  7:00                   ` Martin J. Bligh
  2004-02-28  7:29                     ` Andrea Arcangeli
  0 siblings, 1 reply; 100+ messages in thread
From: Martin J. Bligh @ 2004-02-28  7:00 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Rik van Riel, linux-kernel

> The last issue that we may run into are apps assuming the stack is at 3G
> fixed, some jvm assumed that, but they should be fixed by now (at the
> very least it's not hard at all to fix those).

All the potential solutions we're discussing hit that problem so I don't
see it matters much which one we choose ;-)
 
> It also depends on the performance difference if this is worthwhile, if
> the difference isn't very significant 4:4 will be certainly prefereable
> so you can also allocate 4G in the same task for apps not using syscalls
> or page faults or flood of network irqs.

There are some things that may well help here: one is vsyscall gettimeofday,
which will fix up the worst of the issues (the 30% figure you mentioned
to me in Ottowa), the other is NAPI, which would help with the network
stuff.

Bill had a patch to allocate mmaps, etc down from the top of memory and
thus elimininate TASK_UNMAPPED_BASE, and shift the stack back into the
empty hole from 0-128MB of memory where it belongs (according to the spec).
Getting rid of those two problems gives us back a little more userspace 
as well. 

Unfortunately it does seem to break some userspace apps making stupid 
assumptions, but if we have a neat way to mark the binaries (Andi was
talking about personalities or something), we could at least get the
big mem hogs to do that (databases, java, etc).

I have a copy of Bill's patch in my tree if you want to take a look:

ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/patches/2.6.3/2.6.3-mjb1/410-topdown

That might make your 2.5/1.5 proposal more feasible with less loss of
userspace.

M

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-28  6:45                   ` Martin J. Bligh
@ 2004-02-28  7:05                     ` Andrea Arcangeli
  2004-02-28  9:19                       ` Dave Hansen
  0 siblings, 1 reply; 100+ messages in thread
From: Andrea Arcangeli @ 2004-02-28  7:05 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Wim Coekaerts, Hugh Dickins, Andrew Morton, Rik van Riel, linux-kernel

On Fri, Feb 27, 2004 at 10:45:21PM -0800, Martin J. Bligh wrote:
> 
> > thanks for giving it a spin (btw I assume it's 2.4, that's fine for
> > a quick test, and I seem not to find the 2:2 and 1:3 options in the 2.6
> > kernel anymore ;).
> 
> ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/patches/2.6.3/2.6.3-mjb1/212-config_page_offset
> (which sits on top of the 4/4 patches, so might need some massaging to apply)

thanks for maintaining this bit too, very helpful!

> > What I probably didn't specify yet is that 2.5:1.5 is feasible too, I've
> > a fairly small and strightforward patch here from ibm that implements
> > 3.5:0.5 for PAE mode (for a completely different matter, but I mean,
> > it's not really a problem to do 2.5:1.5 either if needed, it's the same
> > as the current PAE mode 3.5:0.5).
> 
> I'm not sure it's that straightforward really - doing the non-pgd aligned
> split is messy. 2.5 might actually be much cleaner than 3.5 though, as we
> never updated the mappings of the PMD that's shared between user and kernel.
> Hmmm ... that's quite tempting.

I read the 3.5:0.5 PAE sometime last year and it was pretty
strightforward too, the only single reason I didn't merge it is that
it had the problem that it changed common code that every archs depends
on, so it broke all other archs, but it's not really a matter of
difficult code, as worse it just needs a few liner change in every arch
to make them compile again. So I'm quite optimistic 2.5:1.5 will be
doable with a reasonably clean patch and with ~zero performance downside
compared to 3:1 and 2:2.

In the meantime testing 2:2 against 4:4 (with a very/too reduced ipcshm
in the 2:2 test) still sounds very interesting.

> > starting with the assumtion that 32G machines works with 3:1 (like they
> > do in 2.4), and assuming the size of a page is 48 bytes (like in 2.4, in
> > 2.6 it's a bit bigger but we can most certainly shrink it, for example
> > removing rmap for anon pages will immediatly release 128M of kernel
> > memory), moving from 32G to 64G means losing 384M of those additional
> > 512M in pages, you can use the remaining additional 512M-384M=128M for
> > vmas, task structs, files etc...  So 2.5:1.5 should be enough as far as
> > the kernel is concerned to run on 64G machines (provided the page_t is
> > not bigger than 2.4 which sounds feasible too).
> 
> Shrinking struct page sounds nice. Did Hugh's patch actually end up doing
> that? I don't recall that, but I don't see why it wouldn't.

full objrmap can certainly release 8 bytes per page, 128M total, so
quite an huge amount of ram (that is also why I'd like to do the full
objrmap and not only to stop at the file mappings ;).

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-28  7:00                   ` Martin J. Bligh
@ 2004-02-28  7:29                     ` Andrea Arcangeli
  2004-02-28 14:55                       ` Rik van Riel
  0 siblings, 1 reply; 100+ messages in thread
From: Andrea Arcangeli @ 2004-02-28  7:29 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrew Morton, Rik van Riel, linux-kernel

On Fri, Feb 27, 2004 at 11:00:44PM -0800, Martin J. Bligh wrote:
> > The last issue that we may run into are apps assuming the stack is at 3G
> > fixed, some jvm assumed that, but they should be fixed by now (at the
> > very least it's not hard at all to fix those).
> 
> All the potential solutions we're discussing hit that problem so I don't
> see it matters much which one we choose ;-)

I agree, I thought about it too, but I didn't mention that since
theoretically 4:4 has a chance to start with a stack at 3G and to depend
on the userspace startup to relocate it at 4G ;). x86-64 does something
like that to guarantee 100% compatibility.

> > It also depends on the performance difference if this is worthwhile, if
> > the difference isn't very significant 4:4 will be certainly prefereable
> > so you can also allocate 4G in the same task for apps not using syscalls
> > or page faults or flood of network irqs.
> 
> There are some things that may well help here: one is vsyscall gettimeofday,
> which will fix up the worst of the issues (the 30% figure you mentioned
> to me in Ottowa), the other is NAPI, which would help with the network
> stuff.

I think it's very fair to benchmark vsyscalls with 2.5:1.5 vs vsyscalls
with 4:4.

However remeber you said you want a generic kernel for 64G right? Not
all userspaces will use vsyscalls, and it's not just one app using
gettimeofday. As of today no production userspace uses vgettimeofday in
x86 yet. I mean, we can tell people to always use vsyscalls with the 4:4
kernel and it's acceptable, but it's not as generic as 2.5:1.5.

> Bill had a patch to allocate mmaps, etc down from the top of memory and
> thus elimininate TASK_UNMAPPED_BASE, and shift the stack back into the
> empty hole from 0-128MB of memory where it belongs (according to the spec).
> Getting rid of those two problems gives us back a little more userspace 
> as well. 
> 
> Unfortunately it does seem to break some userspace apps making stupid 
> assumptions, but if we have a neat way to mark the binaries (Andi was
> talking about personalities or something), we could at least get the
> big mem hogs to do that (databases, java, etc).

I read something about this issue. I agree it must be definitely marked.
apps may very well make assumptions about that space being empty below
128m and overwrite it with a mmap() (mmap will just silent overwrite),
and I'm unsure if we can claim that to be an userspace bug..., I guess
most people will blame the kernel ;)

Now that x86 is dying it probably don't worth to mark the binaries, the
few apps needing this should relocate the stack by hand and setup the
growsdown bitflag. plus they should lower mapped base by hand with the
/proc tweak like we do in 2.4.

I agree having the stack growsdown at 128 is the best for the db setup,
but I doubt we can make it generic and automatic for all apps. Also it's
not the stack really the problem in terms of genericity, infact with
recursive algos the stack may need to grow a lot, and having it at 128m
could segfault. As for mapped-base the space between 128 and 1G may as
well be assumpd empty by the apps, so relocation is possible on demand
by the app. I doubt we can do better than the above without taking risks ;)

> I have a copy of Bill's patch in my tree if you want to take a look:
> 
> ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/patches/2.6.3/2.6.3-mjb1/410-topdown

thanks for the pointer.

> 
> That might make your 2.5/1.5 proposal more feasible with less loss of
> userspace.

Yes. I was sort of assuming that we would use the mapped-base tweak for
achieving that, the relocation of the stack is a good idea, and it's
doable all in userspace (though it's not generic/automatic).

thanks.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-28  7:05                     ` Andrea Arcangeli
@ 2004-02-28  9:19                       ` Dave Hansen
  2004-03-18  2:44                         ` Andrea Arcangeli
  0 siblings, 1 reply; 100+ messages in thread
From: Dave Hansen @ 2004-02-28  9:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Martin J. Bligh, Wim Coekaerts, Hugh Dickins, Andrew Morton,
	Rik van Riel, Linux Kernel Mailing List

On Fri, 2004-02-27 at 23:05, Andrea Arcangeli wrote:
> > I'm not sure it's that straightforward really - doing the non-pgd aligned
> > split is messy. 2.5 might actually be much cleaner than 3.5 though, as we
> > never updated the mappings of the PMD that's shared between user and kernel.
> > Hmmm ... that's quite tempting.
> 
> I read the 3.5:0.5 PAE sometime last year and it was pretty
> strightforward too, the only single reason I didn't merge it is that
> it had the problem that it changed common code that every archs depends
> on, so it broke all other archs, but it's not really a matter of
> difficult code, as worse it just needs a few liner change in every arch
> to make them compile again. So I'm quite optimistic 2.5:1.5 will be
> doable with a reasonably clean patch and with ~zero performance downside
> compared to 3:1 and 2:2.

The only performance problem with using PMDs which are shared between
kernel and user PTE pages is that you have a potential to be required to
instantiate the kernel portion of the shared PMD each time you need a
new set of page tables.  A slab for these partial PMDs is quite helpful
in this case.  

The real logistical problem with partial PMDs is just making sure that
all of the 0 ... PTRS_PER_PMD loops are correct.  The last few times
I've implemented it, I just made PTRS_PER_PMD take a PGD index, and made
sure to start all of the loops from things like pmd_index(PAGE_OFFSET)
instead of 0.  

Here are a couple of patches that allowed partial user/kernel PMDs. 
These conflicted with 4:4 and got dropped somewhere along the way, but
the generic approaches worked.  I believe they at least compiled on all
of the arches, too.  

ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/patches/2.5.68/2.5.68-mjb1/540-separate_pmd
ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/patches/2.5.68/2.5.68-mjb1/650-banana_split

-- dave


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
       [not found]                   ` <20040228061838.GO8834@dualathlon.random.suse.lists.linux.kernel>
@ 2004-02-28 12:46                     ` Andi Kleen
  2004-02-29  1:39                       ` Andrea Arcangeli
  0 siblings, 1 reply; 100+ messages in thread
From: Andi Kleen @ 2004-02-28 12:46 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

Andrea Arcangeli <andrea@suse.de> writes:
> 
> we can add a config option to enable together with 2.5:1.5 to drop the
> gap page in vmalloc, and to reduce the vmalloc space, so that we can
> sneak another few "free" dozen megs back for the 64G kernel just to get
> more margin even if we don't strictly need it. (btw, the vmalloc space
> is also tunable at boot, so this config option would just change the
> default value)

Not sure if that would help, but you could relatively easily save
8 bytes on 32bit for each vma too. Replace vm_next with rb_next()
and move vm_rb.color into vm_flags. It would be a lot of editing
work though. NUMA API will add new 4 bytes again. 

-Andi

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-28  7:29                     ` Andrea Arcangeli
@ 2004-02-28 14:55                       ` Rik van Riel
  2004-02-28 15:06                         ` Arjan van de Ven
  2004-02-29  1:43                         ` Andrea Arcangeli
  0 siblings, 2 replies; 100+ messages in thread
From: Rik van Riel @ 2004-02-28 14:55 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Martin J. Bligh, Andrew Morton, linux-kernel

On Sat, 28 Feb 2004, Andrea Arcangeli wrote:

> I agree, I thought about it too, but I didn't mention that since
> theoretically 4:4 has a chance to start with a stack at 3G and to depend
> on the userspace startup to relocate it at 4G ;). x86-64 does something
> like that to guarantee 100% compatibility.

Personalities work fine for that kind of thing.  The few buggy
apps that can't deal with addresses >3GB (IIRC the JVM in the
Oracle installer) get their stack at 3GB, the others get their
stack at 4GB.

> I think it's very fair to benchmark vsyscalls with 2.5:1.5 vs vsyscalls
> with 4:4.

The different setups should definately be benchmarked.  I know
we expected the 4:4 kernel to be slower at everything, but the
folks at Oracle actually ran into a few situations where the 4:4
kernel was _faster_ than a 3:1 kernel.

Definately not what we expected, but a nice surprise nontheless.

> Now that x86 is dying it probably don't worth to mark the binaries,

All you need to do for that is to copy some code from RHEL3 ;)

> I agree having the stack growsdown at 128 is the best for the db setup,

Alternatively, you start the mmap at "stack start - stack ulimit" and
grow it down from there. That still gives you 3.8GB of usable address
space on x86, with the 4:4 split. ;)

cheers,

Rik
-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-28 14:55                       ` Rik van Riel
@ 2004-02-28 15:06                         ` Arjan van de Ven
  2004-02-29  1:43                         ` Andrea Arcangeli
  1 sibling, 0 replies; 100+ messages in thread
From: Arjan van de Ven @ 2004-02-28 15:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Martin J. Bligh, Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 197 bytes --]


> > Now that x86 is dying it probably don't worth to mark the binaries,
> 
> All you need to do for that is to copy some code from RHEL3 ;)

which we in turn copied from Andi, eg x86_64



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-28 12:46                     ` Andi Kleen
@ 2004-02-29  1:39                       ` Andrea Arcangeli
  2004-02-29  2:29                         ` Andi Kleen
  0 siblings, 1 reply; 100+ messages in thread
From: Andrea Arcangeli @ 2004-02-29  1:39 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Sat, Feb 28, 2004 at 01:46:47PM +0100, Andi Kleen wrote:
> Andrea Arcangeli <andrea@suse.de> writes:
> > 
> > we can add a config option to enable together with 2.5:1.5 to drop the
> > gap page in vmalloc, and to reduce the vmalloc space, so that we can
> > sneak another few "free" dozen megs back for the 64G kernel just to get
> > more margin even if we don't strictly need it. (btw, the vmalloc space
> > is also tunable at boot, so this config option would just change the
> > default value)
> 
> Not sure if that would help, but you could relatively easily save
> 8 bytes on 32bit for each vma too. Replace vm_next with rb_next()
> and move vm_rb.color into vm_flags. It would be a lot of editing

the vm_flags rb_color thing is a smart idea indeed, I never thought
about it using vm_flags itself, however it clearly needs a generic
wrapper since we want to keep the rbtree completely generic. David
Woodhouse once suggested me to use the least significant bit of one of
the pointers to save the rb_color, that could work but that really
messes the code up since such a pointer would need to be masked every
time, and it's not self contained. Using vm_flags sounds more
interesting since the pointers are still usable in raw mode, one only
needs to be careful about the locking: vm_flags seems pretty much a
readonly thing so it's probably ok, if there would be other writers
outside the rbtree code then we'd need to sure they're serialized.

you're wrong about s/vm_next/rb_next()/, walking the tree like in
get_unmapped_area would require recurisve algos w/o vm_next, or
significant heap allocations. that's the only thing vm_next is needed
for (i.e. to walk the tree in order efficiently). only if we drop all
tree walks than we can nuke vm_next.

> work though. NUMA API will add new 4 bytes again. 

saving in vmas is partly already accomplished by remap_file_pages, so I
don't rate vma size as critical.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-28 14:55                       ` Rik van Riel
  2004-02-28 15:06                         ` Arjan van de Ven
@ 2004-02-29  1:43                         ` Andrea Arcangeli
       [not found]                           ` < 1078370073.3403.759.camel@abyss.local>
  2004-03-04  3:14                           ` Peter Zaitsev
  1 sibling, 2 replies; 100+ messages in thread
From: Andrea Arcangeli @ 2004-02-29  1:43 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Martin J. Bligh, Andrew Morton, linux-kernel

On Sat, Feb 28, 2004 at 09:55:14AM -0500, Rik van Riel wrote:
> The different setups should definately be benchmarked.  I know
> we expected the 4:4 kernel to be slower at everything, but the
> folks at Oracle actually ran into a few situations where the 4:4
> kernel was _faster_ than a 3:1 kernel.
> 
> Definately not what we expected, but a nice surprise nontheless.

this is the first time I hear something like this. Maybe you mean the
4:4 was actually using more ram for the SGA? Just curious.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-29  1:39                       ` Andrea Arcangeli
@ 2004-02-29  2:29                         ` Andi Kleen
  2004-02-29 16:34                           ` Andrea Arcangeli
  0 siblings, 1 reply; 100+ messages in thread
From: Andi Kleen @ 2004-02-29  2:29 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

On Sun, 29 Feb 2004 02:39:24 +0100
Andrea Arcangeli <andrea@suse.de> wrote:

> you're wrong about s/vm_next/rb_next()/, walking the tree like in
> get_unmapped_area would require recurisve algos w/o vm_next, or
> significant heap allocations. that's the only thing vm_next is needed
> for (i.e. to walk the tree in order efficiently). only if we drop all
> tree walks than we can nuke vm_next.

Not sure what you mean here. rb_next() is not recursive.

-Andi

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-27 19:08     ` Rik van Riel
  2004-02-27 20:29       ` Andrew Morton
  2004-02-27 20:31       ` Andrea Arcangeli
@ 2004-02-29  6:34       ` Mike Fedyk
  2 siblings, 0 replies; 100+ messages in thread
From: Mike Fedyk @ 2004-02-29  6:34 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, linux-kernel, Andrew Morton

Rik van Riel wrote:
> On Fri, 27 Feb 2004, Andrea Arcangeli wrote:
>>>Then again, your stuff will also find pages the moment they're
>>>cleaned, just at the cost of a (little?) bit more CPU time.
>>
>>exactly, that's an important effect of my patch and that's the only
>>thing that o1 vm is taking care of, I don't think it's enough since the
>>gigs of cache would still be like a memleak without my code.
> 
> 
> ... however, if you have a hundred gigabyte of memory, or
> even more, then you cannot afford to search the inactive
> list for clean pages on swapout. It will end up using too
> much CPU time.
> 
> The FreeBSD people found this out the hard way, even on
> smaller systems...

So that's what the inact_clean list is for in 2.4-rmap.

But your inactive lists are always much smaller than the active list on 
the smallish (< 1.5G) machines...

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-29  2:29                         ` Andi Kleen
@ 2004-02-29 16:34                           ` Andrea Arcangeli
  0 siblings, 0 replies; 100+ messages in thread
From: Andrea Arcangeli @ 2004-02-29 16:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Sun, Feb 29, 2004 at 03:29:47AM +0100, Andi Kleen wrote:
> On Sun, 29 Feb 2004 02:39:24 +0100
> Andrea Arcangeli <andrea@suse.de> wrote:
> 
> > you're wrong about s/vm_next/rb_next()/, walking the tree like in
> > get_unmapped_area would require recurisve algos w/o vm_next, or
> > significant heap allocations. that's the only thing vm_next is needed
> > for (i.e. to walk the tree in order efficiently). only if we drop all
> > tree walks than we can nuke vm_next.
> 
> Not sure what you mean here. rb_next() is not recursive.

if you don't allocate the memory with recursion-like algos, you'll trow
too much cpu in a loop like this with rb_next. so it worth to keep
vm_next for performance reasons or for memory allocation reasons.

	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-27 20:49         ` Rik van Riel
  2004-02-27 20:55           ` Andrew Morton
  2004-02-27 21:28           ` Andrea Arcangeli
@ 2004-03-01 11:10           ` Nikita Danilov
  2 siblings, 0 replies; 100+ messages in thread
From: Nikita Danilov @ 2004-03-01 11:10 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, andrea, linux-kernel

Rik van Riel writes:

[...]

 > 
 > Duh, I forgot all about the rotate_reclaimable_page() stuff.
 > That may well fix all problems 2.6 would have otherwise had
 > in this area.
 > 
 > I really hope we won't need anything like the O(1) VM stuff
 > in 2.6, since that would leave me more time to work on other
 > cool stuff (like resource management ;)).

Page-out from end of the inactive list is not efficient, because pages
are submitted for IO in more or less random order and this results in
a lot of seeks. Test-case: replace ->writepage() with

int foofs_writepage(struct page *page)
{
        SetPageDirty(page);
        unlock_page(page);
        return 0;
}

and run

$ time cp /tmpfs/huge-data-set /foofs

File systems (and anonymous memory) want clustered write-out and VM
designs with separate write-out queue (like O(1) VM) are better suited
for this.

Nikita.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-28  6:10               ` Martin J. Bligh
  2004-02-28  6:43                 ` Andrea Arcangeli
@ 2004-03-02  9:10                 ` Kurt Garloff
  2004-03-02 15:32                   ` Martin J. Bligh
  1 sibling, 1 reply; 100+ messages in thread
From: Kurt Garloff @ 2004-03-02  9:10 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Linux kernel list

[-- Attachment #1: Type: text/plain, Size: 503 bytes --]

On Fri, Feb 27, 2004 at 10:10:22PM -0800, Martin J. Bligh wrote:
> Why is it 2.7GB with both 3:1 and 4:4 ... surely it can get bigger on 
> 4:4 ???

You could use 3.7 on 4:4, but what's the point if you throw away the
mapping constantly by flushing the TLB?

Regards,
-- 
Kurt Garloff                   <kurt@garloff.de>             [Koeln, DE]
Physics:Plasma modeling <garloff@plasimo.phys.tue.nl> [TU Eindhoven, NL]
Linux: SUSE Labs (Head)        <garloff@suse.de>    [SUSE Nuernberg, DE]

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-02  9:10                 ` Kurt Garloff
@ 2004-03-02 15:32                   ` Martin J. Bligh
  0 siblings, 0 replies; 100+ messages in thread
From: Martin J. Bligh @ 2004-03-02 15:32 UTC (permalink / raw)
  To: Kurt Garloff; +Cc: Linux kernel list

> On Fri, Feb 27, 2004 at 10:10:22PM -0800, Martin J. Bligh wrote:
>> Why is it 2.7GB with both 3:1 and 4:4 ... surely it can get bigger on 
>> 4:4 ???
> 
> You could use 3.7 on 4:4, but what's the point if you throw away the
> mapping constantly by flushing the TLB?

Normally, a bigger shm segment = higher performance. Throwing the TLB
away means lower performance. Depending on the workload, the tradeoff
could work out either way ... the only thing I've seen so far from
someone who has measured it was hints that 4/4 was faster in some 
situations ... we're trying to do some more runs to confirm / deny that.
Hopefully others will do the same ;-)

M.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-29  1:43                         ` Andrea Arcangeli
       [not found]                           ` < 1078370073.3403.759.camel@abyss.local>
@ 2004-03-04  3:14                           ` Peter Zaitsev
  2004-03-04  3:33                             ` Andrew Morton
  1 sibling, 1 reply; 100+ messages in thread
From: Peter Zaitsev @ 2004-03-04  3:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Martin J. Bligh, Andrew Morton, linux-kernel

On Sat, 2004-02-28 at 17:43, Andrea Arcangeli wrote:

> > 
> > Definately not what we expected, but a nice surprise nontheless.
> 
> this is the first time I hear something like this. Maybe you mean the
> 4:4 was actually using more ram for the SGA? Just curious.

I actually recently Did MySQL benchmarks using DBT2 MySQL port.

The test box was  4Way Xeon w HT,  4Gb RAM,  8 SATA Disks in RAID10.

I used RH AS 3.0 for tests (2.4.21-9.ELxxx)

For Disk Bound workloads (200 Warehouse) I got 1250TPM for "hugemem" vs
1450TPM for "smp" kernel, which is some 14% slowdown.

For CPU bound load (10 Warehouses) I got 7000TPM instead of 4500TPM,
which is over 35% slowdown.





-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com

Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
  http://www.mysql.com/uc2004/


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-04  3:14                           ` Peter Zaitsev
@ 2004-03-04  3:33                             ` Andrew Morton
  2004-03-04  3:44                               ` Peter Zaitsev
  0 siblings, 1 reply; 100+ messages in thread
From: Andrew Morton @ 2004-03-04  3:33 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: andrea, riel, mbligh, linux-kernel

Peter Zaitsev <peter@mysql.com> wrote:
>
> On Sat, 2004-02-28 at 17:43, Andrea Arcangeli wrote:
> 
> > > 
> > > Definately not what we expected, but a nice surprise nontheless.
> > 
> > this is the first time I hear something like this. Maybe you mean the
> > 4:4 was actually using more ram for the SGA? Just curious.
> 
> I actually recently Did MySQL benchmarks using DBT2 MySQL port.
> 
> The test box was  4Way Xeon w HT,  4Gb RAM,  8 SATA Disks in RAID10.
> 
> I used RH AS 3.0 for tests (2.4.21-9.ELxxx)
> 
> For Disk Bound workloads (200 Warehouse) I got 1250TPM for "hugemem" vs
> 1450TPM for "smp" kernel, which is some 14% slowdown.

Please define these terms.  What is the difference between "hugemem" and
"smp"?

> For CPU bound load (10 Warehouses) I got 7000TPM instead of 4500TPM,
> which is over 35% slowdown.

Well no, it is a 56% speedup.   Please clarify.  Lots.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-04  3:33                             ` Andrew Morton
@ 2004-03-04  3:44                               ` Peter Zaitsev
  2004-03-04  4:07                                 ` Andrew Morton
  2004-03-05 10:33                                 ` Ingo Molnar
  0 siblings, 2 replies; 100+ messages in thread
From: Peter Zaitsev @ 2004-03-04  3:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: andrea, riel, mbligh, linux-kernel

On Wed, 2004-03-03 at 19:33, Andrew Morton wrote:



> > 
> > For Disk Bound workloads (200 Warehouse) I got 1250TPM for "hugemem" vs
> > 1450TPM for "smp" kernel, which is some 14% slowdown.
> 
> Please define these terms.  What is the difference between "hugemem" and
> "smp"?

Andrew,


Sorry if I was unclear.  These are suffexes from RH AS 3.0 kernel
namings.  "SMP" corresponds to normal SMP kernel they have,  "hugemem"
is kernel with 4G/4G split.

> 
> > For CPU bound load (10 Warehouses) I got 7000TPM instead of 4500TPM,
> > which is over 35% slowdown.
> 
> Well no, it is a 56% speedup.   Please clarify.  Lots.

Huh. The numbers shall be other way around of course :)   "smp" kernel
had better performance of some 7000TPM, compared to  4500TPM with
HugeMem kernel. 

Swap was disable in both cases. 


-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com

Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
  http://www.mysql.com/uc2004/


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-04  3:44                               ` Peter Zaitsev
@ 2004-03-04  4:07                                 ` Andrew Morton
  2004-03-04  4:44                                   ` Peter Zaitsev
                                                     ` (2 more replies)
  2004-03-05 10:33                                 ` Ingo Molnar
  1 sibling, 3 replies; 100+ messages in thread
From: Andrew Morton @ 2004-03-04  4:07 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: andrea, riel, mbligh, linux-kernel

Peter Zaitsev <peter@mysql.com> wrote:
>
> Sorry if I was unclear.  These are suffexes from RH AS 3.0 kernel
>  namings.  "SMP" corresponds to normal SMP kernel they have,  "hugemem"
>  is kernel with 4G/4G split.
> 
>  > 
>  > > For CPU bound load (10 Warehouses) I got 7000TPM instead of 4500TPM,
>  > > which is over 35% slowdown.
>  > 
>  > Well no, it is a 56% speedup.   Please clarify.  Lots.
> 
>  Huh. The numbers shall be other way around of course :)   "smp" kernel
>  had better performance of some 7000TPM, compared to  4500TPM with
>  HugeMem kernel. 

That's a larger difference than I expected.  But then, everyone has been
mysteriously quiet with the 4g/4g benchmarking.

A kernel profile would be interesting.  As would an optimisation effort,
which, as far as I know, has never been undertaken.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-04  4:07                                 ` Andrew Morton
@ 2004-03-04  4:44                                   ` Peter Zaitsev
  2004-03-04  4:52                                   ` Andrea Arcangeli
  2004-03-04 17:35                                   ` Martin J. Bligh
  2 siblings, 0 replies; 100+ messages in thread
From: Peter Zaitsev @ 2004-03-04  4:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: andrea, riel, mbligh, linux-kernel

On Wed, 2004-03-03 at 20:07, Andrew Morton wrote:

> >  Huh. The numbers shall be other way around of course :)   "smp" kernel
> >  had better performance of some 7000TPM, compared to  4500TPM with
> >  HugeMem kernel. 
> 
> That's a larger difference than I expected.  But then, everyone has been
> mysteriously quiet with the 4g/4g benchmarking.

Yes. It is larger than I expected as well but numbers are pretty
reliable. 

> 
> A kernel profile would be interesting.  As would an optimisation effort,
> which, as far as I know, has never been undertaken.

Just let me know which information you would like me to gather and how
and I'll get it for you.



-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com

Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
  http://www.mysql.com/uc2004/


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-04  4:07                                 ` Andrew Morton
  2004-03-04  4:44                                   ` Peter Zaitsev
@ 2004-03-04  4:52                                   ` Andrea Arcangeli
  2004-03-04  5:10                                     ` Andrew Morton
                                                       ` (2 more replies)
  2004-03-04 17:35                                   ` Martin J. Bligh
  2 siblings, 3 replies; 100+ messages in thread
From: Andrea Arcangeli @ 2004-03-04  4:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Peter Zaitsev, riel, mbligh, linux-kernel

On Wed, Mar 03, 2004 at 08:07:04PM -0800, Andrew Morton wrote:
> That's a larger difference than I expected.  But then, everyone has been

mysql is threaded (it's not using processes that force tlb flushes at
every context switch), so the only time a tlb flush ever happens is when
a syscall or an irq or a page fault happens with 4:4. Not tlb flush
would ever happen with 3:1 in the whole workload (yeah, some background
tlb flushing happens anyways when you type char on bash or move the
mouse of course but it's very low frequency)

(to be fair, because it's threaded it means they also find 512m of
address space lost more problematic than the db using processes, though
besides the reduced address space there would be no measurable slowdown
with 2.5:1.5)

Also the 4:4 pretty much depends on the vgettimeofday to be backported
from the x86-64 tree and an userspace to use it, so the test may be
repeated with vgettimeofday, though it's very possible mysql isn't using
that much gettimeofday as other databases, especially the I/O bound
workload shouldn't matter that much with gettimeofday.

another reason could be the xeon bit, all numbers I've seen were on p3,
that's why I was asking about xeon and p4 or more recent.

all random ideas, just guessing.

> mysteriously quiet with the 4g/4g benchmarking.

indeed.

> A kernel profile would be interesting.  As would an optimisation effort,
> which, as far as I know, has never been undertaken.

yes, though I doubt you'll find anything interesting in the kernel, the
slowdown should happen because the userspace runs slower, it's like
undercloking the cpu, it's not a bottleneck in the kernel that can be
optimized (at least unless there are bugs in the patch which I think not).

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-04  4:52                                   ` Andrea Arcangeli
@ 2004-03-04  5:10                                     ` Andrew Morton
  2004-03-04  5:27                                       ` Andrea Arcangeli
  2004-03-05 20:19                                       ` Jamie Lokier
  2004-03-04 12:12                                     ` Rik van Riel
  2004-03-04 16:21                                     ` Peter Zaitsev
  2 siblings, 2 replies; 100+ messages in thread
From: Andrew Morton @ 2004-03-04  5:10 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: peter, riel, mbligh, linux-kernel

Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Wed, Mar 03, 2004 at 08:07:04PM -0800, Andrew Morton wrote:
>  > That's a larger difference than I expected.  But then, everyone has been
> 
>  mysql is threaded

There is a patch in -mm's 4g/4g implementation
(4g4g-locked-userspace-copy.patch) which causes all kernel<->userspace
copies to happen under page_table_lock.  In some threaded apps on SMP this
is likely to cause utterly foul performance.

That's why I'm keeping it as a separate patch.  The problem which it fixes
is very obscure indeed and I suspect most implementors will simply drop it
after they'e had a two-second peek at the profile results.

hm, I note that the changelog in that patch is junk.  I'll fix that up.

Something like:

  The current 4g/4g implementation does not guarantee the atomicity of
  mprotect() on SMP machines.  If one CPU is in the middle of a read() into
  a user memory region and another CPU is in the middle of an
  mprotect(!PROT_READ) of that region, it is possible for a race to occur
  which will result in that read successfully completing _after_ the other
  CPU's mprotect() call has returned.

  We believe that this could cause misbehaviour of such things as the
  boehm garbage collector.  This patch provides the mprotect() atomicity by
  performing all userspace copies under page_table_lock.


It is a judgement call.  Personally, I wouldn't ship a production kernel
with this patch.  People need to be aware of the tradeoff and to think and
test very carefully.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-04  5:10                                     ` Andrew Morton
@ 2004-03-04  5:27                                       ` Andrea Arcangeli
  2004-03-04  5:38                                         ` Andrew Morton
  2004-03-05 20:19                                       ` Jamie Lokier
  1 sibling, 1 reply; 100+ messages in thread
From: Andrea Arcangeli @ 2004-03-04  5:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: peter, riel, mbligh, linux-kernel

On Wed, Mar 03, 2004 at 09:10:42PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > On Wed, Mar 03, 2004 at 08:07:04PM -0800, Andrew Morton wrote:
> >  > That's a larger difference than I expected.  But then, everyone has been
> > 
> >  mysql is threaded
> 
> There is a patch in -mm's 4g/4g implementation
> (4g4g-locked-userspace-copy.patch) which causes all kernel<->userspace
> copies to happen under page_table_lock.  In some threaded apps on SMP this
> is likely to cause utterly foul performance.

I see, I wasn't aware about this issue with the copy-user code, thanks
for the info, I definitely agree having a profiling of the run would be
nice since it maybe part of the overhead is due this lock (though I
doubt it's most the overhead), so we can see if it was that spinlock
generating part of the slowdown.

> That's why I'm keeping it as a separate patch.  The problem which it fixes
> is very obscure indeed and I suspect most implementors will simply drop it
> after they'e had a two-second peek at the profile results.

I doubt one can ship without it without feeling a bit like cheating, the
garbage collectors sometime depends on mprotect to generate protection
faults, it's not like nothing is using mprotect in racy ways against
other threads.

> It is a judgement call.  Personally, I wouldn't ship a production kernel
> with this patch.  People need to be aware of the tradeoff and to think and
> test very carefully.

test what? there's no way to know what soft of proprietary software
people will run on the thing.

Personally I wouldn't feel safe to ship a kernel with a known race
condition add-on. I mean, if you don't know about it and it's an
implementation bug you know nobody is perfect and you try to fix it if
it happens, but if you know about it and you don't apply it, that's
pretty bad if something goes wrong.  Especially because it's a race,
even you test it, it may still happen only a long time later during
production. I would never trade performance for safety, if something I'd
try to find a more complex way to serialize against the vmas or similar.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-04  5:27                                       ` Andrea Arcangeli
@ 2004-03-04  5:38                                         ` Andrew Morton
  0 siblings, 0 replies; 100+ messages in thread
From: Andrew Morton @ 2004-03-04  5:38 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: peter, riel, mbligh, linux-kernel

Andrea Arcangeli <andrea@suse.de> wrote:
>
>  > It is a judgement call.  Personally, I wouldn't ship a production kernel
>  > with this patch.  People need to be aware of the tradeoff and to think and
>  > test very carefully.
> 
>  test what? there's no way to know what soft of proprietary software
>  people will run on the thing.

In the vast majority of cases the application was already racy.  It took
davem a very long time to convince me that this was really a bug ;)

>  Personally I wouldn't feel safe to ship a kernel with a known race
>  condition add-on. I mean, if you don't know about it and it's an
>  implementation bug you know nobody is perfect and you try to fix it if
>  it happens, but if you know about it and you don't apply it, that's
>  pretty bad if something goes wrong.  Especially because it's a race,
>  even you test it, it may still happen only a long time later during
>  production. I would never trade performance for safety, if something I'd
>  try to find a more complex way to serialize against the vmas or similar.

Well first people need to understand the problem and convince themselves
that this really is a bug.  And yes, there are surely other ways of fixing
it up.  One might be to put some sequence counter in the mm_struct and
rerun the mprotect if it detects that someone else snuck in with a
usercopy.  Or add an rwsem to the mm_struct, take it for writing in
mprotect.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-04  4:52                                   ` Andrea Arcangeli
  2004-03-04  5:10                                     ` Andrew Morton
@ 2004-03-04 12:12                                     ` Rik van Riel
  2004-03-04 16:21                                     ` Peter Zaitsev
  2 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2004-03-04 12:12 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Peter Zaitsev, mbligh, linux-kernel

On Thu, 4 Mar 2004, Andrea Arcangeli wrote:
> On Wed, Mar 03, 2004 at 08:07:04PM -0800, Andrew Morton wrote:

> > A kernel profile would be interesting.  As would an optimisation effort,
> > which, as far as I know, has never been undertaken.
> 
> yes, though I doubt you'll find anything interesting in the kernel,

Oh, but there is a big bottleneck left, at least in RHEL3.

All the CPUs use the _same_ mm_struct in kernel space, so
all VM operations inside the kernel are effectively single 
threaded.

Ingo had a patch to fix that, but it wasn't ready in time.
Maybe it is in the 2.6 patch set, maybe not ...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-04  4:52                                   ` Andrea Arcangeli
  2004-03-04  5:10                                     ` Andrew Morton
  2004-03-04 12:12                                     ` Rik van Riel
@ 2004-03-04 16:21                                     ` Peter Zaitsev
  2004-03-04 18:13                                       ` Andrea Arcangeli
  2 siblings, 1 reply; 100+ messages in thread
From: Peter Zaitsev @ 2004-03-04 16:21 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, riel, mbligh, linux-kernel

On Wed, 2004-03-03 at 20:52, Andrea Arcangeli wrote:

Andrea,

> mysql is threaded (it's not using processes that force tlb flushes at
> every context switch), so the only time a tlb flush ever happens is when
> a syscall or an irq or a page fault happens with 4:4. Not tlb flush
> would ever happen with 3:1 in the whole workload (yeah, some background
> tlb flushing happens anyways when you type char on bash or move the
> mouse of course but it's very low frequency)

Do not we get TLB flush also due to latching or are pthread_mutex_lock
etc implemented without one nowadays ?

> 
> (to be fair, because it's threaded it means they also find 512m of
> address space lost more problematic than the db using processes, though
> besides the reduced address space there would be no measurable slowdown
> with 2.5:1.5)

Hm. What 512Mb of address space loss are you speaking here. Are threaded
programs only able to use 2.5G in  3G/1G memory split ? 


> 
> Also the 4:4 pretty much depends on the vgettimeofday to be backported
> from the x86-64 tree and an userspace to use it, so the test may be
> repeated with vgettimeofday, though it's very possible mysql isn't using
> that much gettimeofday as other databases, especially the I/O bound
> workload shouldn't matter that much with gettimeofday.

You're right.  MySQL does not use gettimeofday very frequently now,
actually it uses time() most of the time, as some platforms used to have
huge performance problems with gettimeofday() in the past.

The amount of gettimeofday() use will increase dramatically in the
future so it is good to know about this matter.


-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com

Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
  http://www.mysql.com/uc2004/


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-04  4:07                                 ` Andrew Morton
  2004-03-04  4:44                                   ` Peter Zaitsev
  2004-03-04  4:52                                   ` Andrea Arcangeli
@ 2004-03-04 17:35                                   ` Martin J. Bligh
  2004-03-04 18:16                                     ` Andrea Arcangeli
  2004-03-04 20:21                                     ` Peter Zaitsev
  2 siblings, 2 replies; 100+ messages in thread
From: Martin J. Bligh @ 2004-03-04 17:35 UTC (permalink / raw)
  To: Andrew Morton, Peter Zaitsev; +Cc: andrea, riel, linux-kernel

> Peter Zaitsev <peter@mysql.com> wrote:
>> 
>> Sorry if I was unclear.  These are suffexes from RH AS 3.0 kernel
>>  namings.  "SMP" corresponds to normal SMP kernel they have,  "hugemem"
>>  is kernel with 4G/4G split.
>> 
>>  > 
>>  > > For CPU bound load (10 Warehouses) I got 7000TPM instead of 4500TPM,
>>  > > which is over 35% slowdown.
>>  > 
>>  > Well no, it is a 56% speedup.   Please clarify.  Lots.
>> 
>>  Huh. The numbers shall be other way around of course :)   "smp" kernel
>>  had better performance of some 7000TPM, compared to  4500TPM with
>>  HugeMem kernel. 
> 
> That's a larger difference than I expected.  But then, everyone has been
> mysteriously quiet with the 4g/4g benchmarking.
> 
> A kernel profile would be interesting.  As would an optimisation effort,
> which, as far as I know, has never been undertaken.

In particular:

1. a diffprofile between the two would be interesting (assuming it's
at least partly increase in kernel time), or any other way to see exactly
why it's slower (well, TLB flushes, obviously, but what's causing them).

2. If it's gettimeofday hammering it (which it probably is, from previous
comments by others, and my own experience), then vsyscall gettimeofday
(John's patch) may well fix it up.

3. Are you using the extra user address space? Otherwise yes, it'll be 
all downside. And 4/4 vs 3/1 isn't really a fair comparison ... 4/4 is
designed for bigboxen, so 4/4 vs 2/2 would be better, IMHO. People have
said before that DB performance can increase linearly with shared area
sizes (for some workloads), so that'd bring you a 100% or so increase
in performance for 4/4 to counter the loss.

M.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-04 16:21                                     ` Peter Zaitsev
@ 2004-03-04 18:13                                       ` Andrea Arcangeli
  0 siblings, 0 replies; 100+ messages in thread
From: Andrea Arcangeli @ 2004-03-04 18:13 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: Andrew Morton, riel, mbligh, linux-kernel

On Thu, Mar 04, 2004 at 08:21:26AM -0800, Peter Zaitsev wrote:
> On Wed, 2004-03-03 at 20:52, Andrea Arcangeli wrote:
> 
> Andrea,
> 
> > mysql is threaded (it's not using processes that force tlb flushes at
> > every context switch), so the only time a tlb flush ever happens is when
> > a syscall or an irq or a page fault happens with 4:4. Not tlb flush
> > would ever happen with 3:1 in the whole workload (yeah, some background
> > tlb flushing happens anyways when you type char on bash or move the
> > mouse of course but it's very low frequency)
> 
> Do not we get TLB flush also due to latching or are pthread_mutex_lock
> etc implemented without one nowadays ?

pthread mutex uses futex in nptl and ngpt or they use sched_yield in
linuxthreads, either ways they don't need to flush the tlb. The address
space is the same, no need of changing address space for the mutex
(otherwise mutex would be very detrimental too). Kernel threads as well
don't require a tlb flush.

> > (to be fair, because it's threaded it means they also find 512m of
> > address space lost more problematic than the db using processes, though
> > besides the reduced address space there would be no measurable slowdown
> > with 2.5:1.5)
> 
> Hm. What 512Mb of address space loss are you speaking here. Are threaded
> programs only able to use 2.5G in  3G/1G memory split ? 

I was talking about the 2.5:1.5: split here, 3:1 gives you 3G of address
space (both for threads and processes), 2.5:1.5 would give you only 2.5G
of address space to use instead (with a loss of 512m that are being used
by kernel to handle properly a 64G box).

> > Also the 4:4 pretty much depends on the vgettimeofday to be backported
> > from the x86-64 tree and an userspace to use it, so the test may be
> > repeated with vgettimeofday, though it's very possible mysql isn't using
> > that much gettimeofday as other databases, especially the I/O bound
> > workload shouldn't matter that much with gettimeofday.
> 
> You're right.  MySQL does not use gettimeofday very frequently now,
> actually it uses time() most of the time, as some platforms used to have
> huge performance problems with gettimeofday() in the past.
> 
> The amount of gettimeofday() use will increase dramatically in the
> future so it is good to know about this matter.

If you noticed Martin mentioned a >30% figure due gettimeofday being
called frequently (w/o vsyscalls implementing vgettimeofday like in
x86-64), this figure it certainly won't sum to your current number
linearly but you can expect a significant further loss by calling
gettimeofday dramatically more frequently.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-04 17:35                                   ` Martin J. Bligh
@ 2004-03-04 18:16                                     ` Andrea Arcangeli
  2004-03-04 19:31                                       ` Martin J. Bligh
  2004-03-04 20:21                                     ` Peter Zaitsev
  1 sibling, 1 reply; 100+ messages in thread
From: Andrea Arcangeli @ 2004-03-04 18:16 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrew Morton, Peter Zaitsev, riel, linux-kernel

On Thu, Mar 04, 2004 at 09:35:13AM -0800, Martin J. Bligh wrote:
> designed for bigboxen, so 4/4 vs 2/2 would be better, IMHO. People have
> said before that DB performance can increase linearly with shared area
> sizes (for some workloads), so that'd bring you a 100% or so increase
> in performance for 4/4 to counter the loss.

that's a nice theory with the benchmarks that runs with a 64G working
set, but if your working set is smaller than 32G  99% of the time and
you install the 64G to handle the peak load happening 1% of the time
faster, you'll run 30% slower 99% of the time even if the benchmark
only stressing the 64G working set runs a lot faster than with 32G only.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-04 18:16                                     ` Andrea Arcangeli
@ 2004-03-04 19:31                                       ` Martin J. Bligh
  0 siblings, 0 replies; 100+ messages in thread
From: Martin J. Bligh @ 2004-03-04 19:31 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Peter Zaitsev, riel, linux-kernel

> On Thu, Mar 04, 2004 at 09:35:13AM -0800, Martin J. Bligh wrote:
>> designed for bigboxen, so 4/4 vs 2/2 would be better, IMHO. People have
>> said before that DB performance can increase linearly with shared area
>> sizes (for some workloads), so that'd bring you a 100% or so increase
>> in performance for 4/4 to counter the loss.
> 
> that's a nice theory with the benchmarks that runs with a 64G working
> set, but if your working set is smaller than 32G  99% of the time and
> you install the 64G to handle the peak load happening 1% of the time
> faster, you'll run 30% slower 99% of the time even if the benchmark
> only stressing the 64G working set runs a lot faster than with 32G only.

The amount of ram in the system, and the amount consumed by mem_map can,
I think, be taken as static for the purposes of this argument. So I don't
see why the total working set of the machine matters.

What does matter is the per-process user address space set - if the same
argument applied to that (ie most of the time, processes only use 1GB
of shmem each), then I'd agree with you. I don't know whether that's
true or not though ... I'll let the DB people argue that one out. 

Much though people hate benchmarks, it's also important to be able to
prove that Linux can run as fast as RandomOtherOS in order to ensure
total world domination for Linux ;-) So it would be nice to ensure the
benchmarks at least have an option to be able to run as fast as possible.

M.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-04 17:35                                   ` Martin J. Bligh
  2004-03-04 18:16                                     ` Andrea Arcangeli
@ 2004-03-04 20:21                                     ` Peter Zaitsev
  1 sibling, 0 replies; 100+ messages in thread
From: Peter Zaitsev @ 2004-03-04 20:21 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrew Morton, andrea, riel, linux-kernel

On Thu, 2004-03-04 at 09:35, Martin J. Bligh wrote:

> 
> 2. If it's gettimeofday hammering it (which it probably is, from previous
> comments by others, and my own experience), then vsyscall gettimeofday
> (John's patch) may well fix it up.

Well, as I wrote MySQL does not use a lot of gettimeofday.   It rather
has 2-3 calls to time() per query, but it is very small number compared
to othet syscalls.

> 
> 3. Are you using the extra user address space? Otherwise yes, it'll be 
> all downside. And 4/4 vs 3/1 isn't really a fair comparison ... 4/4 is
> designed for bigboxen, so 4/4 vs 2/2 would be better, IMHO. People have
> said before that DB performance can increase linearly with shared area
> sizes (for some workloads), so that'd bring you a 100% or so increase
> in performance for 4/4 to counter the loss.

I do not really understand this :)

I know 4/4 was designed for BigBoxes, however we're more interested in
side effect we have - having 4G per user process instead of 3G in 3G/1G
split. As MySQL is designed as single process this is what rather
important for us. 

I was not using extra address space in this test, as the idea was to see
how much slowdown 4G/4G split gives you with all other being the same. 

Based on other benchmarks I know extra performance extra 1Gb  used as
buffers can give. 

Bringing this numbers together I shall conclude what 4G/4G does not make
sense for most MySQL loads, as  1Gb used for internal buffers (vs 1Gb
used for file cache) will not give high enough performance to cover such
major speed loss. 

There are exceptions of course, for example the case where your full
workload will fit in 3G cache while will not fit in 2G (very edge one),
or in case you need 4G just to manage  10000+ connections with
reasonable buffers etc, which is also far from most typical scenario.

For "Big Boxes" I just would not advice having 32bit configuration at
all - happily nowadays you can get 64bit pretty cheap.





-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com

Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
  http://www.mysql.com/uc2004/


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-04  3:44                               ` Peter Zaitsev
  2004-03-04  4:07                                 ` Andrew Morton
@ 2004-03-05 10:33                                 ` Ingo Molnar
  2004-03-05 14:15                                   ` Andrea Arcangeli
  1 sibling, 1 reply; 100+ messages in thread
From: Ingo Molnar @ 2004-03-05 10:33 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: Andrew Morton, andrea, riel, mbligh, linux-kernel


* Peter Zaitsev <peter@mysql.com> wrote:

> > > For Disk Bound workloads (200 Warehouse) I got 1250TPM for "hugemem" vs
> > > 1450TPM for "smp" kernel, which is some 14% slowdown.
> > 
> > Please define these terms.  What is the difference between "hugemem" and
> > "smp"?
> 
> Sorry if I was unclear.  These are suffexes from RH AS 3.0 kernel
> namings.  "SMP" corresponds to normal SMP kernel they have, "hugemem"
> is kernel with 4G/4G split.

the 'hugemem' kernel also has config_highpte defined which is a bit
redundant - that complexity one could avoid with the 4/4 split. Another
detail: the hugemem kernel also enables PAE, which adds another 2 usecs
to every syscall (!). So these performance numbers only hold if you are
running mysql on x86 using more than 4GB of RAM. (which, given mysql's
threaded design, doesnt make all that much of a sense.)

But no doubt, the 4/4 split is not for free. If a workload does lots of
high-frequency system-calls then the cost can be pretty high.

vsyscall-sys_gettimeofday and vsyscall-sys_time could help quite some
for mysql. Also, the highly threaded nature of mysql on the same MM
which is pretty much the worst-case for the 4/4 design. If it's an
issue, there are multiple ways to mitigate this cost.

but 4/4 is mostly a life-extender for the high end of the x86 platform -
which is dying fast. If i were to decide between some of the highly
intrusive architectural highmem solutions (which all revolve about the
concept of dynamically mapping back and forth) and the simplicity of
4/4, i'd go for 4/4 unless forced otherwise.

	Ingo

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 10:33                                 ` Ingo Molnar
@ 2004-03-05 14:15                                   ` Andrea Arcangeli
  2004-03-05 14:32                                     ` Ingo Molnar
  2004-03-05 14:34                                     ` Ingo Molnar
  0 siblings, 2 replies; 100+ messages in thread
From: Andrea Arcangeli @ 2004-03-05 14:15 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel

On Fri, Mar 05, 2004 at 11:33:08AM +0100, Ingo Molnar wrote:
> 
> * Peter Zaitsev <peter@mysql.com> wrote:
> 
> > > > For Disk Bound workloads (200 Warehouse) I got 1250TPM for "hugemem" vs
> > > > 1450TPM for "smp" kernel, which is some 14% slowdown.
> > > 
> > > Please define these terms.  What is the difference between "hugemem" and
> > > "smp"?
> > 
> > Sorry if I was unclear.  These are suffexes from RH AS 3.0 kernel
> > namings.  "SMP" corresponds to normal SMP kernel they have, "hugemem"
> > is kernel with 4G/4G split.
> 
> the 'hugemem' kernel also has config_highpte defined which is a bit
> redundant - that complexity one could avoid with the 4/4 split. Another

the machine only has 4G of ram and you've an huge zone-normal, so I
guess it will offset not more than 1 point percent or so.

> detail: the hugemem kernel also enables PAE, which adds another 2 usecs
> to every syscall (!). So these performance numbers only hold if you are
> running mysql on x86 using more than 4GB of RAM. (which, given mysql's
> threaded design, doesnt make all that much of a sense.)

are you saying you force _all_ people with >4G of ram to use 4:4?!?
that would be way way overkill. 8/16/32G boxes works perfectly with 3:1
with the stock 2.4 VM (after you nuke rmap).

> vsyscall-sys_gettimeofday and vsyscall-sys_time could help quite some
> for mysql. Also, the highly threaded nature of mysql on the same MM

he said he doesn't use gettimeofday frequently, so most of the flushes
are from other syscalls.

> which is pretty much the worst-case for the 4/4 design. If it's an

definitely agreed.

> issue, there are multiple ways to mitigate this cost.

how? just curious.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 14:15                                   ` Andrea Arcangeli
@ 2004-03-05 14:32                                     ` Ingo Molnar
  2004-03-05 14:58                                       ` Andrea Arcangeli
  2004-03-05 14:34                                     ` Ingo Molnar
  1 sibling, 1 reply; 100+ messages in thread
From: Ingo Molnar @ 2004-03-05 14:32 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel


* Andrea Arcangeli <andrea@suse.de> wrote:

> [...] 8/16/32G boxes works perfectly with 3:1 with the stock 2.4 VM
> (after you nuke rmap).

the mem_map[] on 32G is 400 MB (using the stock 2.4 struct page). This
leaves ~500 MB for the lowmem zone. It's ridiculously easy to use up 500
MB of lowmem. 500 MB is a lowmem:RAM ratio of 1:60. With 4/4 you have 6
times more lowmem. So starting at 32 GB (but often much earlier) the 3/1
split breaks down. And you obviously it's a no-go at 64 GB.

inbetween it all depends on the workload. If the 3:1 split works fine
then sure, use it. There's no one kernel that fits all sizes.

	Ingo

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 14:15                                   ` Andrea Arcangeli
  2004-03-05 14:32                                     ` Ingo Molnar
@ 2004-03-05 14:34                                     ` Ingo Molnar
  2004-03-05 14:59                                       ` Andrea Arcangeli
  1 sibling, 1 reply; 100+ messages in thread
From: Ingo Molnar @ 2004-03-05 14:34 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel


* Andrea Arcangeli <andrea@suse.de> wrote:

> > vsyscall-sys_gettimeofday and vsyscall-sys_time could help quite some
                                  ^^^^^^^^^^^^^^^^^
> > for mysql. Also, the highly threaded nature of mysql on the same MM
> 
> he said he doesn't use gettimeofday frequently, so most of the flushes
> are from other syscalls.

you are not reading Pete's and my emails too carefully, are you? Pete
said:

> [...] MySQL does not use gettimeofday very frequently now, actually it
> uses time() most of the time, as some platforms used to have huge
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> performance problems with gettimeofday() in the past.
>
> The amount of gettimeofday() use will increase dramatically in the
> future so it is good to know about this matter.

	Ingo

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 14:32                                     ` Ingo Molnar
@ 2004-03-05 14:58                                       ` Andrea Arcangeli
  2004-03-05 15:26                                         ` Ingo Molnar
  2004-03-05 18:42                                         ` Martin J. Bligh
  0 siblings, 2 replies; 100+ messages in thread
From: Andrea Arcangeli @ 2004-03-05 14:58 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel

On Fri, Mar 05, 2004 at 03:32:10PM +0100, Ingo Molnar wrote:
> 
> * Andrea Arcangeli <andrea@suse.de> wrote:
> 
> > [...] 8/16/32G boxes works perfectly with 3:1 with the stock 2.4 VM
> > (after you nuke rmap).
> 
> the mem_map[] on 32G is 400 MB (using the stock 2.4 struct page). This
> leaves ~500 MB for the lowmem zone. It's ridiculously easy to use up 500

yes, mem_map_t takes 384M that leaves us 879-384 = 495Mbyte of
zone-normal.

> MB of lowmem. 500 MB is a lowmem:RAM ratio of 1:60. With 4/4 you have 6
> times more lowmem. So starting at 32 GB (but often much earlier) the 3/1
> split breaks down. And you obviously it's a no-go at 64 GB.

It's a nogo for 64G but I would be really pleased to see a workload
triggering the zone-normal shortage in 32G, I've never seen any one. And
16G has even more margin.

Note that on a 32G box with my google-logic a correct kernel like latest
2.4 mainline reserves 100% of the zone-normal to allocations that cannot
go in highmem, plus the vm highmem fixes like bh and inode zone-normal
related reclaims. Without those logics it would be easy to run oom due
highmem allocations going into zone-normal but that's just a vm issue
and it's fixed (all fixes should be in mainline already).

> inbetween it all depends on the workload. If the 3:1 split works fine
> then sure, use it. There's no one kernel that fits all sizes.

yes, the inbetween definitely works fine but there's always plenty of
margin even on the 32G in all heavy workloads I've seen. I've not a
single pending report for 32G boxes, all the bugreports starts at >=48G
and that tells you those 32G users had a 198M of margin free to use for
the peak loads which are more than enough in practice. I agree it's not
a huge margin, but it's quite reasonable considering they've only 60-70%
of the zone-normal pinned during the workload.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 14:34                                     ` Ingo Molnar
@ 2004-03-05 14:59                                       ` Andrea Arcangeli
  2004-03-05 15:02                                         ` Ingo Molnar
  0 siblings, 1 reply; 100+ messages in thread
From: Andrea Arcangeli @ 2004-03-05 14:59 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel

On Fri, Mar 05, 2004 at 03:34:25PM +0100, Ingo Molnar wrote:
> 
> * Andrea Arcangeli <andrea@suse.de> wrote:
> 
> > > vsyscall-sys_gettimeofday and vsyscall-sys_time could help quite some
>                                   ^^^^^^^^^^^^^^^^^
> > > for mysql. Also, the highly threaded nature of mysql on the same MM
> > 
> > he said he doesn't use gettimeofday frequently, so most of the flushes
> > are from other syscalls.
> 
> you are not reading Pete's and my emails too carefully, are you? Pete
> said:

I thought time() wouldn't be called more than 1 per second anyways, why
would anyone call time more than 1 per second?

> 
> > [...] MySQL does not use gettimeofday very frequently now, actually it
> > uses time() most of the time, as some platforms used to have huge
>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > performance problems with gettimeofday() in the past.
> >
> > The amount of gettimeofday() use will increase dramatically in the
> > future so it is good to know about this matter.
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 14:59                                       ` Andrea Arcangeli
@ 2004-03-05 15:02                                         ` Ingo Molnar
       [not found]                                           ` <20040305150225.GA13237@elte.hu.suse.lists.linux.kernel>
                                                             ` (2 more replies)
  0 siblings, 3 replies; 100+ messages in thread
From: Ingo Molnar @ 2004-03-05 15:02 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel


* Andrea Arcangeli <andrea@suse.de> wrote:

> I thought time() wouldn't be called more than 1 per second anyways,
> why would anyone call time more than 1 per second?

if mysql in fact calls time() frequently, then it should rather start a
worker thread that updates a global time variable every second.

	Ingo

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 14:58                                       ` Andrea Arcangeli
@ 2004-03-05 15:26                                         ` Ingo Molnar
  2004-03-05 15:53                                           ` Andrea Arcangeli
  2004-03-05 21:28                                           ` Martin J. Bligh
  2004-03-05 18:42                                         ` Martin J. Bligh
  1 sibling, 2 replies; 100+ messages in thread
From: Ingo Molnar @ 2004-03-05 15:26 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel


* Andrea Arcangeli <andrea@suse.de> wrote:

> It's a nogo for 64G but I would be really pleased to see a workload
> triggering the zone-normal shortage in 32G, I've never seen any one. 
> [...]

have you tried TPC-C/TPC-H?

	Ingo

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
       [not found]                                           ` <20040305150225.GA13237@elte.hu.suse.lists.linux.kernel>
@ 2004-03-05 15:51                                             ` Andi Kleen
  2004-03-05 16:23                                               ` Ingo Molnar
  0 siblings, 1 reply; 100+ messages in thread
From: Andi Kleen @ 2004-03-05 15:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrea Arcangeli, Peter Zaitsev, Andrew Morton, riel, mbligh,
	linux-kernel

Ingo Molnar <mingo@elte.hu> writes:

> * Andrea Arcangeli <andrea@suse.de> wrote:
> 
> > I thought time() wouldn't be called more than 1 per second anyways,
> > why would anyone call time more than 1 per second?
> 
> if mysql in fact calls time() frequently, then it should rather start a
> worker thread that updates a global time variable every second.

I just fixed the x86-64 vsyscall vtime() to only read the user mapped
__xtime.tv_sec.  This should be equivalent. Only drawback is that if a
timer tick is delayed for too long it won't fix that, but I guess
that's reasonable for a 1s resolution.

-Andi
 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 15:26                                         ` Ingo Molnar
@ 2004-03-05 15:53                                           ` Andrea Arcangeli
  2004-03-07  8:41                                             ` Ingo Molnar
  2004-03-05 21:28                                           ` Martin J. Bligh
  1 sibling, 1 reply; 100+ messages in thread
From: Andrea Arcangeli @ 2004-03-05 15:53 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel

On Fri, Mar 05, 2004 at 04:26:22PM +0100, Ingo Molnar wrote:
> have you tried TPC-C/TPC-H?

not sure, I'm not the one dealing with the testing, but most relevant
data is public on the official websites. the limit reached is around 5k
users with 8cpus 32G and I don't recall that limit to be zone-normal
bound.  With 2.6 and bio and remap_file_pages we may reduce the
zone-normal usage as well (after dropping rmap).

But I definitely agree going past that with 3:1 is not feasible.

Overall we may argue about the 32G (especially a 32-way would be more
problematic due the 4 times higher per-cpu memory reservation in
zone-normal, I mean 48M of zone-normal are just wasted in the page
allocator per-cpu logic, without counting the other per-cpu stuff, all
would be easily fixable by limiting the per-cpu sizes, though for 2.4 it
probably doesn't worth it), but I'm quite confortable to say that up to
16G (included) 4:4 is worthless unless you've to deal with the rmap
waste IMHO. And <= 16G probably counts for 99% of machines out there
which are handled optimally by 3:1.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 15:51                                             ` Andi Kleen
@ 2004-03-05 16:23                                               ` Ingo Molnar
  2004-03-05 16:39                                                 ` Andrea Arcangeli
  2004-03-10 13:21                                                 ` Andi Kleen
  0 siblings, 2 replies; 100+ messages in thread
From: Ingo Molnar @ 2004-03-05 16:23 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrea Arcangeli, Peter Zaitsev, Andrew Morton, riel, mbligh,
	linux-kernel


* Andi Kleen <ak@suse.de> wrote:

> > if mysql in fact calls time() frequently, then it should rather start a
> > worker thread that updates a global time variable every second.
> 
> I just fixed the x86-64 vsyscall vtime() to only read the user mapped
> __xtime.tv_sec.  This should be equivalent. [...]

yeah - nice!

> [...] Only drawback is that if a timer tick is delayed for too long it
> won't fix that, but I guess that's reasonable for a 1s resolution.

what do you mean by delayed?

	Ingo

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 16:23                                               ` Ingo Molnar
@ 2004-03-05 16:39                                                 ` Andrea Arcangeli
  2004-03-07  8:16                                                   ` Ingo Molnar
  2004-03-10 13:21                                                 ` Andi Kleen
  1 sibling, 1 reply; 100+ messages in thread
From: Andrea Arcangeli @ 2004-03-05 16:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel

On Fri, Mar 05, 2004 at 05:23:19PM +0100, Ingo Molnar wrote:
> what do you mean by delayed?

if the timer softirq doesn't run and wall_jiffies doesn't increase, we
won't be able to account for it, so time() will return a time in the
past, it will potentially go backwards precisely 1/HZ seconds every tick
that isn't executing the timer softirq. I tend to agree for a 1sec
resultion that's not a big deal though if you run:

	gettimeofday()
	time()

gettimeofday may say the time of the day is 17:39:10 and time may tell
17:39:09

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-10 13:21                                                 ` Andi Kleen
@ 2004-03-05 16:42                                                   ` Andrea Arcangeli
  2004-03-05 16:49                                                   ` Ingo Molnar
  1 sibling, 0 replies; 100+ messages in thread
From: Andrea Arcangeli @ 2004-03-05 16:42 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ingo Molnar, peter, akpm, riel, mbligh, linux-kernel

On Wed, Mar 10, 2004 at 02:21:25PM +0100, Andi Kleen wrote:
> On Fri, 5 Mar 2004 17:23:19 +0100
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > [...] Only drawback is that if a timer tick is delayed for too long it
> > > won't fix that, but I guess that's reasonable for a 1s resolution.
> > 
> > what do you mean by delayed?
> 
> Normal gettimeofday can "fix" lost timer ticks because it computes the true
> offset to the last timer interrupt using the TSC or other means. xtime
> is always the last tick without any correction. If it got delayed too much 
> the result will be out of date.

lost timer ticks doesn't worry me that much, they mess up the system
time persistently anyways with 2.4 (and not all platforms uses the tsc
anyways, even on x86), it's only the lost softirqs that concerns me.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-10 13:21                                                 ` Andi Kleen
  2004-03-05 16:42                                                   ` Andrea Arcangeli
@ 2004-03-05 16:49                                                   ` Ingo Molnar
  2004-03-05 16:58                                                     ` Andrea Arcangeli
  1 sibling, 1 reply; 100+ messages in thread
From: Ingo Molnar @ 2004-03-05 16:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: andrea, peter, akpm, riel, mbligh, linux-kernel


* Andi Kleen <ak@suse.de> wrote:

> > > [...] Only drawback is that if a timer tick is delayed for too long it
> > > won't fix that, but I guess that's reasonable for a 1s resolution.
> > 
> > what do you mean by delayed?
> 
> Normal gettimeofday can "fix" lost timer ticks because it computes the
> true offset to the last timer interrupt using the TSC or other means.
> xtime is always the last tick without any correction. If it got
> delayed too much the result will be out of date.

yeah - i doubt the softirq delay is a real issue.

	Ingo

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 16:49                                                   ` Ingo Molnar
@ 2004-03-05 16:58                                                     ` Andrea Arcangeli
  0 siblings, 0 replies; 100+ messages in thread
From: Andrea Arcangeli @ 2004-03-05 16:58 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andi Kleen, peter, akpm, riel, mbligh, linux-kernel

On Fri, Mar 05, 2004 at 05:49:02PM +0100, Ingo Molnar wrote:
> 
> * Andi Kleen <ak@suse.de> wrote:
> 
> > > > [...] Only drawback is that if a timer tick is delayed for too long it
> > > > won't fix that, but I guess that's reasonable for a 1s resolution.
> > > 
> > > what do you mean by delayed?
> > 
> > Normal gettimeofday can "fix" lost timer ticks because it computes the
> > true offset to the last timer interrupt using the TSC or other means.
> > xtime is always the last tick without any correction. If it got
> > delayed too much the result will be out of date.
> 
> yeah - i doubt the softirq delay is a real issue.

Do you think it's more likely the irq is lost? I think it's more likely
the softirq takes more than 1msec than the irq is lost. If softirq takes
more than 1msec we don't necessairly need to fix that, the timer code is
designed to handle that case properly and the softirq is the place where
to do the bulk of the work, if irq is lost we definitely need to fix
that.

Anyways either ways time may go backwards w.r.t. gettimeofday.

I'm not saying it's a real issue though.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 14:58                                       ` Andrea Arcangeli
  2004-03-05 15:26                                         ` Ingo Molnar
@ 2004-03-05 18:42                                         ` Martin J. Bligh
  2004-03-05 19:13                                           ` Andrea Arcangeli
  1 sibling, 1 reply; 100+ messages in thread
From: Martin J. Bligh @ 2004-03-05 18:42 UTC (permalink / raw)
  To: Andrea Arcangeli, Ingo Molnar
  Cc: Peter Zaitsev, Andrew Morton, riel, linux-kernel

> It's a nogo for 64G but I would be really pleased to see a workload
> triggering the zone-normal shortage in 32G, I've never seen any one. And
> 16G has even more margin.

The things I've seen consume ZONE_NORMAL (which aren't reclaimable) are:

1. mem_map (obviously) (64GB = 704MB of mem_map)

2. Buffer_heads (much improved in 2.6, though not completely gone IIRC)

3. Pagetables (pte_highmem helps, pmds are existant, but less of a problem,
10,000 tasks would be 117MB)

4. Kernel stacks (10,000 tasks would be 78MB - 4K stacks would help obviously)

5. rmap chains - this is the real killer without objrmap (even 1000 tasks 
sharing a 2GB shmem segment will kill you without large pages).

6. vmas - wierdo Oracle things before remap_file_pages especially.

I may have forgotten some, but I think those were the main ones. 10,000 tasks
is a little heavy, but it's easy to scale the numbers around. I guess my main
point is that it's often as much to do with the number of tasks as it is
with just the larger amount of memory - but bigger machines tend to run more
tasks, so it often goes hand-in-hand.

Also bear in mind that as memory gets tight, the reclaimable things like
dcache and icache will get shrunk, which will hurt performance itself too,
so some of the cost of 4/4 is paid back there too. Without shared pagetables,
we may need highpte even on 4/4, which kind of sucks (can be 10% or so hit).

M.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 18:42                                         ` Martin J. Bligh
@ 2004-03-05 19:13                                           ` Andrea Arcangeli
  2004-03-05 19:55                                             ` Martin J. Bligh
  0 siblings, 1 reply; 100+ messages in thread
From: Andrea Arcangeli @ 2004-03-05 19:13 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Ingo Molnar, Peter Zaitsev, Andrew Morton, riel, linux-kernel

On Fri, Mar 05, 2004 at 10:42:55AM -0800, Martin J. Bligh wrote:
> > It's a nogo for 64G but I would be really pleased to see a workload
> > triggering the zone-normal shortage in 32G, I've never seen any one. And
> > 16G has even more margin.
> 
> The things I've seen consume ZONE_NORMAL (which aren't reclaimable) are:
> 
> 1. mem_map (obviously) (64GB = 704MB of mem_map)

I was asking 32G, that's half of that and it leaves 500M free. 64G is a
no-way with 3:1.

> 
> 2. Buffer_heads (much improved in 2.6, though not completely gone IIRC)

the vm is able to reclaim them before running oom, though it has a
performance cost.

> 3. Pagetables (pte_highmem helps, pmds are existant, but less of a problem,
> 10,000 tasks would be 117MB)

pmds seems 13M for 10000 tasks, but maybe I did the math wrong.

> 
> 4. Kernel stacks (10,000 tasks would be 78MB - 4K stacks would help obviously)

4k stacks then need to allocate the task struct in the heap, though it
still saves ram, but it's not very different.

> 
> 5. rmap chains - this is the real killer without objrmap (even 1000 tasks 
> sharing a 2GB shmem segment will kill you without large pages).

this overhead doesn't exist in 2.4.

> 6. vmas - wierdo Oracle things before remap_file_pages especially.

this is one of the main issues of 2.4.

> I may have forgotten some, but I think those were the main ones. 10,000 tasks
> is a little heavy, but it's easy to scale the numbers around. I guess my main
> point is that it's often as much to do with the number of tasks as it is
> with just the larger amount of memory - but bigger machines tend to run more
> tasks, so it often goes hand-in-hand.

yes, an 8-way with 32G it's unlikely that can scale up to 10000 tasks,
regardless, but maybe things change with a 32-way 32G.

The main thing you didn't mention is the overhead in the per-cpu data
structures, that alone generates an overhead of several dozen mbytes
only in the page allocator, without accounting the slab caches,
pagetable caches etc.. putting an high limit to the per-cpu caches
should make a 32-way 32G work fine with 3:1 too though. 8-way is
fine with 32G currently.

other relevant things are the fs stuff like file handles per task and
other pinned slab things.

> Also bear in mind that as memory gets tight, the reclaimable things like
> dcache and icache will get shrunk, which will hurt performance itself too,

for these workloads (the 10000 tasks are the workloads we know very
well) dcache/icache doesn't matter, and still I find 3:1 a more generic
kernel than 4:4 for random workloads too. And if you don't run the 10000
tasks workload then you've the normal-zone free to use for dcache
anyways.

> so some of the cost of 4/4 is paid back there too. Without shared pagetables,
> we may need highpte even on 4/4, which kind of sucks (can be 10% or so hit).

I think pte-highmem is definitely needed on 4:4 too, even if you use
hugetlbfs that won't cover PAE and the granular window which is quite a
lot of the ram.

Overall shared pageteables doesn't payoff for its complexity, rather
than sharing the pagetables it's better not to allocate them in the
first place ;) (hugetlbfs/largepages).

The pratical limit of the hardware was 5k tasks, not a kernel issue.
Your 10k example has never been tested, but obviously at some point a
limit will trigger (eventually the get_pid will stop finding a free pid
too ;)

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 19:13                                           ` Andrea Arcangeli
@ 2004-03-05 19:55                                             ` Martin J. Bligh
  2004-03-05 20:29                                               ` Andrea Arcangeli
  0 siblings, 1 reply; 100+ messages in thread
From: Martin J. Bligh @ 2004-03-05 19:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Peter Zaitsev, Andrew Morton, riel, linux-kernel

>> The things I've seen consume ZONE_NORMAL (which aren't reclaimable) are:
>> 
>> 1. mem_map (obviously) (64GB = 704MB of mem_map)
> 
> I was asking 32G, that's half of that and it leaves 500M free. 64G is a
> no-way with 3:1.

Yup.

>> 2. Buffer_heads (much improved in 2.6, though not completely gone IIRC)
> 
> the vm is able to reclaim them before running oom, though it has a
> performance cost.

Didn't used to in SLES8 at least. maybe it does in 2.6 now, I know Andrew
worked on that a lot.
 
>> 3. Pagetables (pte_highmem helps, pmds are existant, but less of a problem,
>> 10,000 tasks would be 117MB)
> 
> pmds seems 13M for 10000 tasks, but maybe I did the math wrong.

3 pages per task = 12K per task = 120,000Kb. Or that's the way I figured
it at least.

>> 4. Kernel stacks (10,000 tasks would be 78MB - 4K stacks would help obviously)
> 
> 4k stacks then need to allocate the task struct in the heap, though it
> still saves ram, but it's not very different.

In 2.6, I think the task struct is outside the kernel stack either way.
Maybe you were pointing out something else? not sure.
 
> The main thing you didn't mention is the overhead in the per-cpu data
> structures, that alone generates an overhead of several dozen mbytes
> only in the page allocator, without accounting the slab caches,
> pagetable caches etc.. putting an high limit to the per-cpu caches
> should make a 32-way 32G work fine with 3:1 too though. 8-way is
> fine with 32G currently.

Humpf. Do you have a hard figure on how much it actually is per cpu?
 
> other relevant things are the fs stuff like file handles per task and
> other pinned slab things.

Yeah, that was a huge one we forgot ... sysfs. Particularly with large
numbers of disks, IIRC, though other resources might generate similar
issues.

> I think pte-highmem is definitely needed on 4:4 too, even if you use
> hugetlbfs that won't cover PAE and the granular window which is quite a
> lot of the ram.
> 
> Overall shared pageteables doesn't payoff for its complexity, rather
> than sharing the pagetables it's better not to allocate them in the
> first place ;) (hugetlbfs/largepages).

That might be another approach, yes ... some more implicit allocation 
stuff would help here - modifying ISV apps is a PITA to get done, and 
takes *forever*. Adam wrote some patches that are sitting in my tree,
some of which were ported forward from SLES8. But then we get into
massive problems with them not being swappable, so you need capabilities,
etc, etc. Ugh.

> The pratical limit of the hardware was 5k tasks, not a kernel issue.
> Your 10k example has never been tested, but obviously at some point a
> limit will trigger (eventually the get_pid will stop finding a free pid
> too ;)

You mean with the 8cpu box you mentioned above? Yes, probably 5K. Larger 
boxes will get progressively scarier ;-)

What scares me more is that we can sit playing counting games all day,
but there's always something we will forget. So I'm not keen on playing
brinkmanship games with customers systems ;-)

M.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 15:02                                         ` Ingo Molnar
       [not found]                                           ` <20040305150225.GA13237@elte.hu.suse.lists.linux.kernel>
@ 2004-03-05 20:11                                           ` Jamie Lokier
  2004-03-06  5:12                                             ` Jamie Lokier
  2004-03-07 11:55                                             ` Ingo Molnar
  2004-03-07  6:50                                           ` Peter Zaitsev
  2 siblings, 2 replies; 100+ messages in thread
From: Jamie Lokier @ 2004-03-05 20:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrea Arcangeli, Peter Zaitsev, Andrew Morton, riel, mbligh,
	linux-kernel

Ingo Molnar wrote:
> if mysql in fact calls time() frequently, then it should rather start a
> worker thread that updates a global time variable every second.

That has the same problem as discussed later in this thread with
vsyscall-time: the worker thread may not run immediately it is woken,
and also setitimer() and select() round up the delay a little more
then expected, so sometimes the global time variable will be out of
date and misordered w.r.t. gettimeofday() and stat() results of
recently modified files.

Also, if there's paging the variable may be out of date by quite a
long time, so mlock() should be used to remove that aspect of the delay.

I don't know if such delays a problem for MySQL.

-- Jamie

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-04  5:10                                     ` Andrew Morton
  2004-03-04  5:27                                       ` Andrea Arcangeli
@ 2004-03-05 20:19                                       ` Jamie Lokier
  2004-03-05 20:33                                         ` Andrea Arcangeli
  1 sibling, 1 reply; 100+ messages in thread
From: Jamie Lokier @ 2004-03-05 20:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, peter, riel, mbligh, linux-kernel

Andrew Morton wrote:
>   We believe that this could cause misbehaviour of such things as the
>   boehm garbage collector.  This patch provides the mprotect() atomicity by
>   performing all userspace copies under page_table_lock.

Can you use a read-write lock, so that userspace copies only need to
take the lock for reading?  That doesn't eliminate cacheline bouncing
but does eliminate the serialisation.

Or did you do that already, and found performance is still very low?

> It is a judgement call.  Personally, I wouldn't ship a production kernel
> with this patch.  People need to be aware of the tradeoff and to think and
> test very carefully.

If this isn't fixed, _please_ provide a way for a garbage collector to
query the kernel as to whether this race condition is present.

-- Jamie

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 19:55                                             ` Martin J. Bligh
@ 2004-03-05 20:29                                               ` Andrea Arcangeli
  2004-03-05 20:41                                                 ` Andrew Morton
  0 siblings, 1 reply; 100+ messages in thread
From: Andrea Arcangeli @ 2004-03-05 20:29 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Ingo Molnar, Peter Zaitsev, Andrew Morton, riel, linux-kernel

On Fri, Mar 05, 2004 at 11:55:05AM -0800, Martin J. Bligh wrote:
> Didn't used to in SLES8 at least. maybe it does in 2.6 now, I know Andrew
> worked on that a lot.

it should in every SLES8 kernel out there too (it wasn't in mainline
until very recently), see the related bhs stuff.

> In 2.6, I think the task struct is outside the kernel stack either way.
> Maybe you were pointing out something else? not sure.

I meant that making the kernel stack 4k pretty much requires removing
the task_struct, making it 4k w/o removing the task_struct sounds too
small.

> > The main thing you didn't mention is the overhead in the per-cpu data
> > structures, that alone generates an overhead of several dozen mbytes
> > only in the page allocator, without accounting the slab caches,
> > pagetable caches etc.. putting an high limit to the per-cpu caches
> > should make a 32-way 32G work fine with 3:1 too though. 8-way is
> > fine with 32G currently.
> 
> Humpf. Do you have a hard figure on how much it actually is per cpu?

not a definitive one, but it's sure more than 2m per cpu, could be 3m
per cpu.

> > other relevant things are the fs stuff like file handles per task and
> > other pinned slab things.
> 
> Yeah, that was a huge one we forgot ... sysfs. Particularly with large
> numbers of disks, IIRC, though other resources might generate similar
> issues.

which doesn't need to be mounted during production and hotplug should
mount read it and unmount. It's worthless to leave it mounted. Only
root-only hardware related stuff should be in sysfs, everything else
that has been abstracted at the kernel level (transparent to
applications) should remain in /proc. unmounting /proc hurts the
production systems, unmounting sysfs should not.

> You mean with the 8cpu box you mentioned above? Yes, probably 5K. Larger 
> boxes will get progressively scarier ;-)

yes.

> What scares me more is that we can sit playing counting games all day,
> but there's always something we will forget. So I'm not keen on playing
> brinkmanship games with customers systems ;-)

this is true for 4:4 too. Also with 2.4 the system will return -ENOMEM,
not like 2.6 that lockup the box. so it's not a fatal thing if a certain
kernel can't sustain a certain workload in a certain hardware, just like
it's not a fatal thing if your run out of memory for the pagetables on a
64bit architecture with 64bit kernel. My only object is to make it feasible
to run the most high end workloads in the most high end hardware with a
good safety margin, knowing if something goes wrong the worst that can
happen is that a syscall returns -ENOMEM. There will always be a
malicious workload able to fill the zone-normal, if you fork off a tons
of tasks, and you open a gazzillon of sockets and you flood all of them
at the same time to fill all receive windows you'll fill your cool 4G
zone-normal of 4:4 in half a second with a 10gigabit NIC.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 20:19                                       ` Jamie Lokier
@ 2004-03-05 20:33                                         ` Andrea Arcangeli
  2004-03-05 21:44                                           ` Jamie Lokier
  0 siblings, 1 reply; 100+ messages in thread
From: Andrea Arcangeli @ 2004-03-05 20:33 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Andrew Morton, peter, riel, mbligh, linux-kernel

On Fri, Mar 05, 2004 at 08:19:55PM +0000, Jamie Lokier wrote:
> Andrew Morton wrote:
> >   We believe that this could cause misbehaviour of such things as the
> >   boehm garbage collector.  This patch provides the mprotect() atomicity by
> >   performing all userspace copies under page_table_lock.
> 
> Can you use a read-write lock, so that userspace copies only need to
> take the lock for reading?  That doesn't eliminate cacheline bouncing
> but does eliminate the serialisation.

normally the bouncing would be the only overhead, but here I also think
the serialization is a significant factor of the contention because the
critical section is taking lots of time. So I would expect some
improvement by using a read/write lock.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 20:29                                               ` Andrea Arcangeli
@ 2004-03-05 20:41                                                 ` Andrew Morton
  2004-03-05 21:07                                                   ` Andrea Arcangeli
  0 siblings, 1 reply; 100+ messages in thread
From: Andrew Morton @ 2004-03-05 20:41 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: mbligh, mingo, peter, riel, linux-kernel

Andrea Arcangeli <andrea@suse.de> wrote:
>
> > > The main thing you didn't mention is the overhead in the per-cpu data
>  > > structures, that alone generates an overhead of several dozen mbytes
>  > > only in the page allocator, without accounting the slab caches,
>  > > pagetable caches etc.. putting an high limit to the per-cpu caches
>  > > should make a 32-way 32G work fine with 3:1 too though. 8-way is
>  > > fine with 32G currently.
>  > 
>  > Humpf. Do you have a hard figure on how much it actually is per cpu?
> 
>  not a definitive one, but it's sure more than 2m per cpu, could be 3m
>  per cpu.

It'll average out to 68 pages per cpu.  (4 in ZONE_DMA, 64 in ZONE_NORMAL).

That's eight megs on 32-way.  Maybe it can be trimmed back a bit, but on
32-way you probably want the locking amortisation more than the 8 megs.

The settings we have in there are still pretty much guesswork.  I don't
think anyone has done any serious tuning on them.  Any differences are
likely to be small.



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 20:41                                                 ` Andrew Morton
@ 2004-03-05 21:07                                                   ` Andrea Arcangeli
  2004-03-05 22:12                                                     ` Andrew Morton
  0 siblings, 1 reply; 100+ messages in thread
From: Andrea Arcangeli @ 2004-03-05 21:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mbligh, mingo, peter, riel, linux-kernel

On Fri, Mar 05, 2004 at 12:41:19PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > > > The main thing you didn't mention is the overhead in the per-cpu data
> >  > > structures, that alone generates an overhead of several dozen mbytes
> >  > > only in the page allocator, without accounting the slab caches,
> >  > > pagetable caches etc.. putting an high limit to the per-cpu caches
> >  > > should make a 32-way 32G work fine with 3:1 too though. 8-way is
> >  > > fine with 32G currently.
> >  > 
> >  > Humpf. Do you have a hard figure on how much it actually is per cpu?
> > 
> >  not a definitive one, but it's sure more than 2m per cpu, could be 3m
> >  per cpu.
> 
> It'll average out to 68 pages per cpu.  (4 in ZONE_DMA, 64 in ZONE_NORMAL).

3m per cpu with all 3m in zone normal.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 15:26                                         ` Ingo Molnar
  2004-03-05 15:53                                           ` Andrea Arcangeli
@ 2004-03-05 21:28                                           ` Martin J. Bligh
  1 sibling, 0 replies; 100+ messages in thread
From: Martin J. Bligh @ 2004-03-05 21:28 UTC (permalink / raw)
  To: Ingo Molnar, Andrea Arcangeli
  Cc: Peter Zaitsev, Andrew Morton, riel, linux-kernel

> * Andrea Arcangeli <andrea@suse.de> wrote:
> 
>> It's a nogo for 64G but I would be really pleased to see a workload
>> triggering the zone-normal shortage in 32G, I've never seen any one. 
>> [...]
> 
> have you tried TPC-C/TPC-H?

We're doing those here. Publishing results will be tricky due to their
draconian rules, but I'm sure you'll be able to read between the lines ;-)

OASB (Oracle apps) is the other total killer I've found in the past.

M.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 20:33                                         ` Andrea Arcangeli
@ 2004-03-05 21:44                                           ` Jamie Lokier
  0 siblings, 0 replies; 100+ messages in thread
From: Jamie Lokier @ 2004-03-05 21:44 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, peter, riel, mbligh, linux-kernel

Andrea Arcangeli wrote:
> > Can you use a read-write lock, so that userspace copies only need to
> > take the lock for reading?  That doesn't eliminate cacheline bouncing
> > but does eliminate the serialisation.
> 
> normally the bouncing would be the only overhead, but here I also think
> the serialization is a significant factor of the contention because the
> critical section is taking lots of time. So I would expect some
> improvement by using a read/write lock.

For something as significant as user<->kernel data transfers, it might
be worth eliminating the bouncing as well - by using per-CPU * per-mm
spinlocks.

User<->kernel data transfers would take the appropriate per-CPU lock
for the current mm, and not take page_table_lock.  Everything that
normally takes page_table_lock would, and also take all of the per-CPU locks.

That does require a set of per-CPU spinlocks to be allocated whenever
a new mm is allocated (although the sets could be cached so it needn't
be slow).

-- Jamie

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 21:07                                                   ` Andrea Arcangeli
@ 2004-03-05 22:12                                                     ` Andrew Morton
  0 siblings, 0 replies; 100+ messages in thread
From: Andrew Morton @ 2004-03-05 22:12 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: mbligh, mingo, peter, riel, linux-kernel

Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Fri, Mar 05, 2004 at 12:41:19PM -0800, Andrew Morton wrote:
> > Andrea Arcangeli <andrea@suse.de> wrote:
> > >
> > > > > The main thing you didn't mention is the overhead in the per-cpu data
> > >  > > structures, that alone generates an overhead of several dozen mbytes
> > >  > > only in the page allocator, without accounting the slab caches,
> > >  > > pagetable caches etc.. putting an high limit to the per-cpu caches
> > >  > > should make a 32-way 32G work fine with 3:1 too though. 8-way is
> > >  > > fine with 32G currently.
> > >  > 
> > >  > Humpf. Do you have a hard figure on how much it actually is per cpu?
> > > 
> > >  not a definitive one, but it's sure more than 2m per cpu, could be 3m
> > >  per cpu.
> > 
> > It'll average out to 68 pages per cpu.  (4 in ZONE_DMA, 64 in ZONE_NORMAL).
> 
> 3m per cpu with all 3m in zone normal.

In the page allocator?  How did you arrive at this figure?

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 20:11                                           ` Jamie Lokier
@ 2004-03-06  5:12                                             ` Jamie Lokier
  2004-03-06 12:56                                               ` Magnus Naeslund(t)
  2004-03-07 11:55                                             ` Ingo Molnar
  1 sibling, 1 reply; 100+ messages in thread
From: Jamie Lokier @ 2004-03-06  5:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrea Arcangeli, Peter Zaitsev, Andrew Morton, riel, mbligh,
	linux-kernel

Jamie Lokier wrote:
> Ingo Molnar wrote:
> > if mysql in fact calls time() frequently, then it should rather start a
> > worker thread that updates a global time variable every second.
> 
> That has the same problem as discussed later in this thread with
> vsyscall-time: the worker thread may not run immediately it is woken,
> and also setitimer() and select() round up the delay a little more
> then expected, so sometimes the global time variable will be out of
> date and misordered.
>
> I don't know if such delays a problem for MySQL.

I still don't know about MySQL, but I have just encounted some code of
my own which does break if time() returns significantly out of date
values.

Any code which is structured like this will break:

	time_t timeout = time(0) + TIMEOUT_IN_SECONDS;

	do {
		/* Do some stuff which takes a little while. */
	} while (time(0) <= timeout);

It goes wrong when time() returns a value that is in the past, and
then jumps forward to the correct time suddenly.  The timeout of the
above code is reduced by the size of that jump.  If the jump is larger
than TIMEOUT_IN_SECONDS, the timeout mechanism is defeated completely.

That sort of code is a prime candidate for the method of using a
worker thread updating a global variable, so it's really important to
to take care when using it.

-- Jamie

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-06  5:12                                             ` Jamie Lokier
@ 2004-03-06 12:56                                               ` Magnus Naeslund(t)
  2004-03-06 13:13                                                 ` Magnus Naeslund(t)
  0 siblings, 1 reply; 100+ messages in thread
From: Magnus Naeslund(t) @ 2004-03-06 12:56 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

Jamie Lokier wrote:
[snip]
> 
> Any code which is structured like this will break:
> 
> 	time_t timeout = time(0) + TIMEOUT_IN_SECONDS;
> 
> 	do {
> 		/* Do some stuff which takes a little while. */
> 	} while (time(0) <= timeout);
> 
> It goes wrong when time() returns a value that is in the past, and
> then jumps forward to the correct time suddenly.  The timeout of the
> above code is reduced by the size of that jump.  If the jump is larger
> than TIMEOUT_IN_SECONDS, the timeout mechanism is defeated completely.
> 
> That sort of code is a prime candidate for the method of using a
> worker thread updating a global variable, so it's really important to
> to take care when using it.
> 

But isn't this kind of code a known buggy way of implementing timeouts?
Shouldn't it be like:

time_t x = time(0);
do {
   ...
} while (time(0) - x >= TIMEOUT_IN_SECONDS);

Ofcourse it can't handle times in the past, but it won't get easily hung 
  with regards to leaps or wraparounds (if used with other functions).

Regards

Magnus



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-06 12:56                                               ` Magnus Naeslund(t)
@ 2004-03-06 13:13                                                 ` Magnus Naeslund(t)
  0 siblings, 0 replies; 100+ messages in thread
From: Magnus Naeslund(t) @ 2004-03-06 13:13 UTC (permalink / raw)
  To: Magnus Naeslund(t); +Cc: Jamie Lokier, linux-kernel

Magnus Naeslund(t) wrote:
> 
> But isn't this kind of code a known buggy way of implementing timeouts?
> Shouldn't it be like:
> 
> time_t x = time(0);
> do {
>   ...
> } while (time(0) - x >= TIMEOUT_IN_SECONDS);

I meant:
  } while (time(0) - x < TIMEOUT_IN_SECONDS);

Also if time_t is signed, that needs to be taken care of.

Magnus - butterfingers


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 15:02                                         ` Ingo Molnar
       [not found]                                           ` <20040305150225.GA13237@elte.hu.suse.lists.linux.kernel>
  2004-03-05 20:11                                           ` Jamie Lokier
@ 2004-03-07  6:50                                           ` Peter Zaitsev
  2 siblings, 0 replies; 100+ messages in thread
From: Peter Zaitsev @ 2004-03-07  6:50 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrea Arcangeli, Andrew Morton, riel, mbligh, linux-kernel

On Fri, 2004-03-05 at 07:02, Ingo Molnar wrote:
> * Andrea Arcangeli <andrea@suse.de> wrote:
> 
> > I thought time() wouldn't be called more than 1 per second anyways,
> > why would anyone call time more than 1 per second?
> 
> if mysql in fact calls time() frequently, then it should rather start a
> worker thread that updates a global time variable every second.

Ingo, Andrea,

I would not say MySQL calls time that often, it is normally 2 times per
query (to measure query execution time), might be couple of times more.

Looking at typical profiling results it takes much less than 1% of time,
even for  very simple query loads. 

Rather than changing design how time is computed I think we would better
to go to better accuracy - nowadays 1 second is far too raw.


-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com

Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
  http://www.mysql.com/uc2004/


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 16:39                                                 ` Andrea Arcangeli
@ 2004-03-07  8:16                                                   ` Ingo Molnar
  0 siblings, 0 replies; 100+ messages in thread
From: Ingo Molnar @ 2004-03-07  8:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andi Kleen, Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel


* Andrea Arcangeli <andrea@suse.de> wrote:

> > what do you mean by delayed?
> 
> if the timer softirq doesn't run and wall_jiffies doesn't increase, we
> won't be able to account for it, so time() will return a time in the
> past, it will potentially go backwards precisely 1/HZ seconds every
> tick that isn't executing the timer softirq. [...]

we agree that this all is not an issue, but the reasons are different 
from what you describe.

wall_jiffies (and, more importantly, xtime.tv_sec - which is the clock
source used by sys_time()) is updated from hardirq context - so softirq
delay cannot impact it.

gettimeofday() and time() are unsynchronized clocks, and time() will
almost always return a time less than the current time - due to rounding
down.

in the moments where there's a timer IRQ pending (or the timer IRQ's
time update effect is delayed eg. due to contention on xtime_lock)
gettimeofday() can estimate the current time past the timer tick, at
which moment the inaccuracy of time() can be briefly higher than 1
second. (in most cases it should be 1 second + delta)

> [...] I tend to agree for a 1sec resultion that's not a big deal
> though if you run:
> 
> 	gettimeofday()
> 	time()
> 
> gettimeofday may say the time of the day is 17:39:10 and time may tell
> 17:39:09

nobody should rely on gettimeofday() and time() being synchronized on
the second level. Typically the delta will be [0 ... 0.999999 ] seconds,
occasionally it can get larger.

and this has nothing to do with using vsyscalls and it can already
happen. xtime.tv_sec is used without any synchronization so even if
xtime were synchronized with gettimeofday() [eg. by do_gettimeofday()
noticing that xtime.tv_sec needs an update] - the access is not
serialized on SMP.

	Ingo

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 15:53                                           ` Andrea Arcangeli
@ 2004-03-07  8:41                                             ` Ingo Molnar
  2004-03-07 10:29                                               ` Nick Piggin
  2004-03-07 17:24                                               ` Andrea Arcangeli
  0 siblings, 2 replies; 100+ messages in thread
From: Ingo Molnar @ 2004-03-07  8:41 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel


* Andrea Arcangeli <andrea@suse.de> wrote:

> [...] but I'm quite confortable to say that up to 16G (included) 4:4
> is worthless unless you've to deal with the rmap waste IMHO. [...]

i've seen workloads on 8G RAM systems that easily filled up the ~800 MB
lowmem zone. (it had to do with many files and having them as a big
dentry cache, so yes, it's unfixable unless you start putting inodes
into highmem which is crazy. And yes, performance broke down unless most
of the dentries/inodes were cached in lowmem.)

as i said - it all depends on the workload, and users are amazingly
creative at finding all sorts of workloads. Whether 4:4 or 3:1 is thus
workload dependent.

should lowmem footprint be reduced? By all means yes, but only as long
as it doesnt jeopardize the real 64-bit platforms. Is 3:1 adequate as a
generic x86 kernel for absolutely everything up to and including 16 GB? 
Strong no. [not to mention that 'up to 16 GB' is an artificial thing
created by us which wont satisfy an IHV that has a hw line with RAM up
to 32 or 64 GB. It doesnt matter that 90% of the customers wont have
that much RAM, it's a basic "can it scale to that much RAM" question.]

so i think the right answer is to have 4:4 around to cover the bases -
and those users who have workloads that will run fine on 3:1 should run
3:1.

(not to mention the range of users who need 4GB _userspace_.)

but i'm quite strongly convinced that 'getting rid' of the 'pte chain
overhead' in favor of questionable lowmem space gains for a dying
(high-end server) platform is very shortsighted. [getting rid of them
for purposes of the 64-bit platforms could be OK, but the argumentation
isnt that strong there i think.]

	Ingo

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-07  8:41                                             ` Ingo Molnar
@ 2004-03-07 10:29                                               ` Nick Piggin
  2004-03-07 17:33                                                 ` Andrea Arcangeli
  2004-03-07 17:24                                               ` Andrea Arcangeli
  1 sibling, 1 reply; 100+ messages in thread
From: Nick Piggin @ 2004-03-07 10:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrea Arcangeli, Peter Zaitsev, Andrew Morton, riel, mbligh,
	linux-kernel



Ingo Molnar wrote:

>* Andrea Arcangeli <andrea@suse.de> wrote:
>
>
>>[...] but I'm quite confortable to say that up to 16G (included) 4:4
>>is worthless unless you've to deal with the rmap waste IMHO. [...]
>>
>
>i've seen workloads on 8G RAM systems that easily filled up the ~800 MB
>lowmem zone. (it had to do with many files and having them as a big
>dentry cache, so yes, it's unfixable unless you start putting inodes
>into highmem which is crazy. And yes, performance broke down unless most
>of the dentries/inodes were cached in lowmem.)
>
>

If you still have any of these workloads around, they would be
good to test on the memory management changes in Andrew's mm tree
which should correctly balance slab on highmem systems. Linus'
tree has a few problems here.

But if you really have a lot more than 800MB of active dentries,
then maybe 4:4 would be a win?


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 20:11                                           ` Jamie Lokier
  2004-03-06  5:12                                             ` Jamie Lokier
@ 2004-03-07 11:55                                             ` Ingo Molnar
  1 sibling, 0 replies; 100+ messages in thread
From: Ingo Molnar @ 2004-03-07 11:55 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andrea Arcangeli, Peter Zaitsev, Andrew Morton, riel, mbligh,
	linux-kernel


* Jamie Lokier <jamie@shareable.org> wrote:

> Ingo Molnar wrote:
> > if mysql in fact calls time() frequently, then it should rather start a
> > worker thread that updates a global time variable every second.
> 
> That has the same problem as discussed later in this thread with
> vsyscall-time: the worker thread may not run immediately it is woken,
> and also setitimer() and select() round up the delay a little more
> then expected, so sometimes the global time variable will be out of
> date and misordered w.r.t. gettimeofday() and stat() results of
> recently modified files.

we dont have any guarantees wrt. the synchronization of the time() and
the gettimeofday() clocks - irrespective of vsyscalls, do we?

	Ingo

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-07  8:41                                             ` Ingo Molnar
  2004-03-07 10:29                                               ` Nick Piggin
@ 2004-03-07 17:24                                               ` Andrea Arcangeli
  1 sibling, 0 replies; 100+ messages in thread
From: Andrea Arcangeli @ 2004-03-07 17:24 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel

On Sun, Mar 07, 2004 at 09:41:20AM +0100, Ingo Molnar wrote:
> 
> * Andrea Arcangeli <andrea@suse.de> wrote:
> 
> > [...] but I'm quite confortable to say that up to 16G (included) 4:4
> > is worthless unless you've to deal with the rmap waste IMHO. [...]
> 
> i've seen workloads on 8G RAM systems that easily filled up the ~800 MB
> lowmem zone. (it had to do with many files and having them as a big

was that a kernel with rmap or w/o rmap?

> but i'm quite strongly convinced that 'getting rid' of the 'pte chain
> overhead' in favor of questionable lowmem space gains for a dying
> (high-end server) platform is very shortsighted. [getting rid of them
> for purposes of the 64-bit platforms could be OK, but the argumentation
> isnt that strong there i think.]

disagree, the reason I'm doing it is for the 64bit platforms, I can't
care less about x86. the vm is dogslow with rmap.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-07 10:29                                               ` Nick Piggin
@ 2004-03-07 17:33                                                 ` Andrea Arcangeli
  2004-03-08  5:15                                                   ` Nick Piggin
  0 siblings, 1 reply; 100+ messages in thread
From: Andrea Arcangeli @ 2004-03-07 17:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel

On Sun, Mar 07, 2004 at 09:29:37PM +1100, Nick Piggin wrote:
> 
> 
> Ingo Molnar wrote:
> 
> >* Andrea Arcangeli <andrea@suse.de> wrote:
> >
> >
> >>[...] but I'm quite confortable to say that up to 16G (included) 4:4
> >>is worthless unless you've to deal with the rmap waste IMHO. [...]
> >>
> >
> >i've seen workloads on 8G RAM systems that easily filled up the ~800 MB
> >lowmem zone. (it had to do with many files and having them as a big
> >dentry cache, so yes, it's unfixable unless you start putting inodes
> >into highmem which is crazy. And yes, performance broke down unless most
> >of the dentries/inodes were cached in lowmem.)
> >
> >
> 
> If you still have any of these workloads around, they would be

I also have workloads that would die with 4:4 and rmap.

the question is if they tested this in the stock 2.4 or 2.4-aa VM, or if
this was tested on kernels with rmap.

most kernels are also broken w.r.t. lowmem reservation, there are huge
vm design breakages in tons of 2.4 out there, those breakages would
generate lomwm shortages too, so just saying the 8G box runs out of
lowmem is meaningless unless we know exactly which kind of 2.4
incarnation was running on that box.

For istance google was running out of lowmem zone even on 2.5G boxes
until I fixed it, and the fix was merged in mainline only around 2.4.23,
so unless I'm sure all relevant fixes were applied the 8G runs out of
lowmem means nothing to me, since it was running out of lowmem for me
too for ages even on the 4G boxes until I've fixed all those issues in
the vm, not related to the pinned amount of memory.

alternatively if they can count the number of tasks, and the number of
files open, we  can do the math and count the mbytes of lowmem pinned,
that as well can demonstrate it was a limitation of the 3:1 and not a
design bug of the vm in-use on that box.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-07 17:33                                                 ` Andrea Arcangeli
@ 2004-03-08  5:15                                                   ` Nick Piggin
  0 siblings, 0 replies; 100+ messages in thread
From: Nick Piggin @ 2004-03-08  5:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel



Andrea Arcangeli wrote:

>On Sun, Mar 07, 2004 at 09:29:37PM +1100, Nick Piggin wrote:
>
>>
>>Ingo Molnar wrote:
>>
>>
>>>* Andrea Arcangeli <andrea@suse.de> wrote:
>>>
>>>
>>>
>>>>[...] but I'm quite confortable to say that up to 16G (included) 4:4
>>>>is worthless unless you've to deal with the rmap waste IMHO. [...]
>>>>
>>>>
>>>i've seen workloads on 8G RAM systems that easily filled up the ~800 MB
>>>lowmem zone. (it had to do with many files and having them as a big
>>>dentry cache, so yes, it's unfixable unless you start putting inodes
>>>into highmem which is crazy. And yes, performance broke down unless most
>>>of the dentries/inodes were cached in lowmem.)
>>>
>>>
>>>
>>If you still have any of these workloads around, they would be
>>
>
>I also have workloads that would die with 4:4 and rmap.
>
>

I don't doubt that, and of course no amount of tinkering with
reclaim will help where you are dying due to pinned lowmem.

Ingo's workload sounded like slab cache reclaim improvements in
recent mm kernels might possibly help. I was purely interested
in this for testing the reclaim changes.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-05 16:23                                               ` Ingo Molnar
  2004-03-05 16:39                                                 ` Andrea Arcangeli
@ 2004-03-10 13:21                                                 ` Andi Kleen
  2004-03-05 16:42                                                   ` Andrea Arcangeli
  2004-03-05 16:49                                                   ` Ingo Molnar
  1 sibling, 2 replies; 100+ messages in thread
From: Andi Kleen @ 2004-03-10 13:21 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: andrea, peter, akpm, riel, mbligh, linux-kernel

On Fri, 5 Mar 2004 17:23:19 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> > [...] Only drawback is that if a timer tick is delayed for too long it
> > won't fix that, but I guess that's reasonable for a 1s resolution.
> 
> what do you mean by delayed?

Normal gettimeofday can "fix" lost timer ticks because it computes the true
offset to the last timer interrupt using the TSC or other means. xtime
is always the last tick without any correction. If it got delayed too much 
the result will be out of date.

-Andi

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-02-28  9:19                       ` Dave Hansen
@ 2004-03-18  2:44                         ` Andrea Arcangeli
  0 siblings, 0 replies; 100+ messages in thread
From: Andrea Arcangeli @ 2004-03-18  2:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Martin J. Bligh, Wim Coekaerts, Hugh Dickins, Andrew Morton,
	Rik van Riel, Linux Kernel Mailing List

On Sat, Feb 28, 2004 at 01:19:01AM -0800, Dave Hansen wrote:
> On Fri, 2004-02-27 at 23:05, Andrea Arcangeli wrote:
> > > I'm not sure it's that straightforward really - doing the non-pgd aligned
> > > split is messy. 2.5 might actually be much cleaner than 3.5 though, as we
> > > never updated the mappings of the PMD that's shared between user and kernel.
> > > Hmmm ... that's quite tempting.
> > 
> > I read the 3.5:0.5 PAE sometime last year and it was pretty
> > strightforward too, the only single reason I didn't merge it is that
> > it had the problem that it changed common code that every archs depends
> > on, so it broke all other archs, but it's not really a matter of
> > difficult code, as worse it just needs a few liner change in every arch
> > to make them compile again. So I'm quite optimistic 2.5:1.5 will be
> > doable with a reasonably clean patch and with ~zero performance downside
> > compared to 3:1 and 2:2.
> 
> The only performance problem with using PMDs which are shared between
> kernel and user PTE pages is that you have a potential to be required to
> instantiate the kernel portion of the shared PMD each time you need a
> new set of page tables.  A slab for these partial PMDs is quite helpful
> in this case.  

that's a bigger cost during context switch but it's still zero cost for
the syscalls, and it never flushes away the user address space
unnecessairly. So I doubt it's measurable (unlike 4:4 which is a big hit).

> The real logistical problem with partial PMDs is just making sure that
> all of the 0 ... PTRS_PER_PMD loops are correct.  The last few times
> I've implemented it, I just made PTRS_PER_PMD take a PGD index, and made
> sure to start all of the loops from things like pmd_index(PAGE_OFFSET)
> instead of 0.  

it is indeed tricky, though your last patch for 3.5G on PAE looked fine.
But now I would like to include the 2.5:1.5 not 3.5:0.5 ;), maybe we can
support 3.5:0.5 too at the same time (though 3.5:0.5 is secondary).

> Here are a couple of patches that allowed partial user/kernel PMDs. 
> These conflicted with 4:4 and got dropped somewhere along the way, but
> the generic approaches worked.  I believe they at least compiled on all
> of the arches, too.  
> 
> ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/patches/2.5.68/2.5.68-mjb1/540-separate_pmd
> ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/patches/2.5.68/2.5.68-mjb1/650-banana_split

would you be willing to implement config 15GB too (2.5:1.5)? In the next
days I'm going to work on the rbtree for the objrmap (just in case
somebody wants to swap the shm with vlm instead of mlocking it), but I
would like to get this done too ;).

you see my current tree in 2.6.5-rc1-aa1 on the ftp site, but you can
use any other kernel too since the code you will touch should be the
same for all 2.6.

It's up to you, only if you are interested, thanks.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-12 21:15                       ` Andi Kleen
@ 2004-03-18 19:50                         ` Peter Zaitsev
  0 siblings, 0 replies; 100+ messages in thread
From: Peter Zaitsev @ 2004-03-18 19:50 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Fri, 2004-03-12 at 13:15, Andi Kleen wrote:
> Peter Zaitsev <peter@mysql.com> writes:
> >
> > Rather than changing design how time is computed I think we would better
> > to go to better accuracy - nowadays 1 second is far too raw.
> 
> Just call gettimeofday(). In near all kernels time internally does that
> anyways.

Right, 

gettimeofday() was much slower some years ago on some other Unix
Platform, which is why time() was used instead.

Now we just need to fix a lot of places (datatypes, prints etc) to move
to gettimeofday()



-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com

Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
  http://www.mysql.com/uc2004/


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
       [not found]                       ` <1x293-2nT-7@gated-at.bofh.it>
@ 2004-03-12 21:25                         ` Andi Kleen
  0 siblings, 0 replies; 100+ messages in thread
From: Andi Kleen @ 2004-03-12 21:25 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, andrea

Ingo Molnar <mingo@elte.hu> writes:

> but i'm quite strongly convinced that 'getting rid' of the 'pte chain
> overhead' in favor of questionable lowmem space gains for a dying
> (high-end server) platform is very shortsighted. [getting rid of them
> for purposes of the 64-bit platforms could be OK, but the argumentation
> isnt that strong there i think.]

pte chain locking seems to be still quite far up in profile logs of
2.6 on x86-64 for common workloads. It's nonexistent in mainline
2.4. I would consider this a strong reason to do something about that.

-Andi


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high  end)
       [not found]                     ` <1x0qG-Dr-3@gated-at.bofh.it>
@ 2004-03-12 21:15                       ` Andi Kleen
  2004-03-18 19:50                         ` Peter Zaitsev
  0 siblings, 1 reply; 100+ messages in thread
From: Andi Kleen @ 2004-03-12 21:15 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: linux-kernel

Peter Zaitsev <peter@mysql.com> writes:
>
> Rather than changing design how time is computed I think we would better
> to go to better accuracy - nowadays 1 second is far too raw.

Just call gettimeofday(). In near all kernels time internally does that
anyways.

-Andi


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-04 23:24   ` Andrea Arcangeli
@ 2004-03-05  3:43     ` Rik van Riel
  0 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2004-03-05  3:43 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Peter Zaitsev, mbligh, linux-kernel

On Fri, 5 Mar 2004, Andrea Arcangeli wrote:
> On Thu, Mar 04, 2004 at 05:14:30PM -0500, Rik van Riel wrote:
> > > or maybe you mean the page_table_lock hold during copy-user that Andrew
> > > mentioned? (copy-user doesn't mean "all VM operations" not sure if you
> > > meant this or the usual locking of every 2.4/2.6 kernel out there)
> > 
> > True, there are some other operations.  However, when
> 
> could you name one that is serialized in 4:4 and not in 3:1 with an mm
> lock? just curious. there are tons of VM operations serialized by the
> page_table_lock that hurts with threads in 3:1 too. I understood only
> copy-user needs the additional locking.

Yeah, in case of a threaded workload you're right.

For a many-processes workload the locking optimisations
definately made a different, IIRC.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
  2004-03-04 22:14 ` Rik van Riel
@ 2004-03-04 23:24   ` Andrea Arcangeli
  2004-03-05  3:43     ` Rik van Riel
  0 siblings, 1 reply; 100+ messages in thread
From: Andrea Arcangeli @ 2004-03-04 23:24 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, Peter Zaitsev, mbligh, linux-kernel

On Thu, Mar 04, 2004 at 05:14:30PM -0500, Rik van Riel wrote:
> > or maybe you mean the page_table_lock hold during copy-user that Andrew
> > mentioned? (copy-user doesn't mean "all VM operations" not sure if you
> > meant this or the usual locking of every 2.4/2.6 kernel out there)
> 
> True, there are some other operations.  However, when

could you name one that is serialized in 4:4 and not in 3:1 with an mm
lock? just curious. there are tons of VM operations serialized by the
page_table_lock that hurts with threads in 3:1 too. I understood only
copy-user needs the additional locking.

> you consider the fact that copy-user operations are
> needed for so many things they are the big bottleneck.
> 
> Making it possible to copy things to and from userspace
> in a lockless way will help performance quite a bit...

I don't expect an huge speedup but certainly it would be measurable.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)
       [not found] <20040304175821.GO4922@dualathlon.random>
@ 2004-03-04 22:14 ` Rik van Riel
  2004-03-04 23:24   ` Andrea Arcangeli
  0 siblings, 1 reply; 100+ messages in thread
From: Rik van Riel @ 2004-03-04 22:14 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Peter Zaitsev, mbligh, linux-kernel

On Thu, 4 Mar 2004, Andrea Arcangeli wrote:
> On Thu, Mar 04, 2004 at 07:12:23AM -0500, Rik van Riel wrote:

> > All the CPUs use the _same_ mm_struct in kernel space, so
> > all VM operations inside the kernel are effectively single 
> > threaded.
> 
> so what, the 3:1 has the same bottleneck too.

Not true, in the 3:1 split every process has its own
mm_struct and they all happen to share the top GB with
kernel stuff.  You can do a copy_to_user on multiple
CPUs efficiently.

> or maybe you mean the page_table_lock hold during copy-user that Andrew
> mentioned? (copy-user doesn't mean "all VM operations" not sure if you
> meant this or the usual locking of every 2.4/2.6 kernel out there)

True, there are some other operations.  However, when
you consider the fact that copy-user operations are
needed for so many things they are the big bottleneck.

Making it possible to copy things to and from userspace
in a lockless way will help performance quite a bit...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 100+ messages in thread

end of thread, other threads:[~2004-03-18 19:52 UTC | newest]

Thread overview: 100+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-02-27  1:33 2.4.23aa2 (bugfixes and important VM improvements for the high end) Andrea Arcangeli
2004-02-27  4:38 ` Rik van Riel
2004-02-27 17:32   ` Andrea Arcangeli
2004-02-27 19:08     ` Rik van Riel
2004-02-27 20:29       ` Andrew Morton
2004-02-27 20:49         ` Rik van Riel
2004-02-27 20:55           ` Andrew Morton
2004-02-27 21:28           ` Andrea Arcangeli
2004-02-27 21:37             ` Andrea Arcangeli
2004-02-28  3:22             ` Andrea Arcangeli
2004-03-01 11:10           ` Nikita Danilov
2004-02-27 21:15         ` Andrea Arcangeli
2004-02-27 22:03           ` Martin J. Bligh
2004-02-27 22:23             ` Andrew Morton
2004-02-28  2:32             ` Andrea Arcangeli
2004-02-28  4:57               ` Wim Coekaerts
2004-02-28  6:18                 ` Andrea Arcangeli
2004-02-28  6:45                   ` Martin J. Bligh
2004-02-28  7:05                     ` Andrea Arcangeli
2004-02-28  9:19                       ` Dave Hansen
2004-03-18  2:44                         ` Andrea Arcangeli
     [not found]                   ` <20040228061838.GO8834@dualathlon.random.suse.lists.linux.kernel>
2004-02-28 12:46                     ` Andi Kleen
2004-02-29  1:39                       ` Andrea Arcangeli
2004-02-29  2:29                         ` Andi Kleen
2004-02-29 16:34                           ` Andrea Arcangeli
2004-02-28  6:10               ` Martin J. Bligh
2004-02-28  6:43                 ` Andrea Arcangeli
2004-02-28  7:00                   ` Martin J. Bligh
2004-02-28  7:29                     ` Andrea Arcangeli
2004-02-28 14:55                       ` Rik van Riel
2004-02-28 15:06                         ` Arjan van de Ven
2004-02-29  1:43                         ` Andrea Arcangeli
     [not found]                           ` < 1078370073.3403.759.camel@abyss.local>
2004-03-04  3:14                           ` Peter Zaitsev
2004-03-04  3:33                             ` Andrew Morton
2004-03-04  3:44                               ` Peter Zaitsev
2004-03-04  4:07                                 ` Andrew Morton
2004-03-04  4:44                                   ` Peter Zaitsev
2004-03-04  4:52                                   ` Andrea Arcangeli
2004-03-04  5:10                                     ` Andrew Morton
2004-03-04  5:27                                       ` Andrea Arcangeli
2004-03-04  5:38                                         ` Andrew Morton
2004-03-05 20:19                                       ` Jamie Lokier
2004-03-05 20:33                                         ` Andrea Arcangeli
2004-03-05 21:44                                           ` Jamie Lokier
2004-03-04 12:12                                     ` Rik van Riel
2004-03-04 16:21                                     ` Peter Zaitsev
2004-03-04 18:13                                       ` Andrea Arcangeli
2004-03-04 17:35                                   ` Martin J. Bligh
2004-03-04 18:16                                     ` Andrea Arcangeli
2004-03-04 19:31                                       ` Martin J. Bligh
2004-03-04 20:21                                     ` Peter Zaitsev
2004-03-05 10:33                                 ` Ingo Molnar
2004-03-05 14:15                                   ` Andrea Arcangeli
2004-03-05 14:32                                     ` Ingo Molnar
2004-03-05 14:58                                       ` Andrea Arcangeli
2004-03-05 15:26                                         ` Ingo Molnar
2004-03-05 15:53                                           ` Andrea Arcangeli
2004-03-07  8:41                                             ` Ingo Molnar
2004-03-07 10:29                                               ` Nick Piggin
2004-03-07 17:33                                                 ` Andrea Arcangeli
2004-03-08  5:15                                                   ` Nick Piggin
2004-03-07 17:24                                               ` Andrea Arcangeli
2004-03-05 21:28                                           ` Martin J. Bligh
2004-03-05 18:42                                         ` Martin J. Bligh
2004-03-05 19:13                                           ` Andrea Arcangeli
2004-03-05 19:55                                             ` Martin J. Bligh
2004-03-05 20:29                                               ` Andrea Arcangeli
2004-03-05 20:41                                                 ` Andrew Morton
2004-03-05 21:07                                                   ` Andrea Arcangeli
2004-03-05 22:12                                                     ` Andrew Morton
2004-03-05 14:34                                     ` Ingo Molnar
2004-03-05 14:59                                       ` Andrea Arcangeli
2004-03-05 15:02                                         ` Ingo Molnar
     [not found]                                           ` <20040305150225.GA13237@elte.hu.suse.lists.linux.kernel>
2004-03-05 15:51                                             ` Andi Kleen
2004-03-05 16:23                                               ` Ingo Molnar
2004-03-05 16:39                                                 ` Andrea Arcangeli
2004-03-07  8:16                                                   ` Ingo Molnar
2004-03-10 13:21                                                 ` Andi Kleen
2004-03-05 16:42                                                   ` Andrea Arcangeli
2004-03-05 16:49                                                   ` Ingo Molnar
2004-03-05 16:58                                                     ` Andrea Arcangeli
2004-03-05 20:11                                           ` Jamie Lokier
2004-03-06  5:12                                             ` Jamie Lokier
2004-03-06 12:56                                               ` Magnus Naeslund(t)
2004-03-06 13:13                                                 ` Magnus Naeslund(t)
2004-03-07 11:55                                             ` Ingo Molnar
2004-03-07  6:50                                           ` Peter Zaitsev
2004-03-02  9:10                 ` Kurt Garloff
2004-03-02 15:32                   ` Martin J. Bligh
2004-02-27 21:42         ` Hugh Dickins
2004-02-27 23:18         ` Marcelo Tosatti
2004-02-27 22:39           ` Andrew Morton
2004-02-27 20:31       ` Andrea Arcangeli
2004-02-29  6:34       ` Mike Fedyk
     [not found] <20040304175821.GO4922@dualathlon.random>
2004-03-04 22:14 ` Rik van Riel
2004-03-04 23:24   ` Andrea Arcangeli
2004-03-05  3:43     ` Rik van Riel
     [not found] <1u7eQ-6Bz-1@gated-at.bofh.it>
     [not found] ` <1ue6M-45w-11@gated-at.bofh.it>
     [not found]   ` <1uofN-4Rh-25@gated-at.bofh.it>
     [not found]     ` <1vRz3-5p2-11@gated-at.bofh.it>
     [not found]       ` <1vRSn-5Fc-11@gated-at.bofh.it>
     [not found]         ` <1vS26-5On-21@gated-at.bofh.it>
     [not found]           ` <1wkUr-3QW-11@gated-at.bofh.it>
     [not found]             ` <1wolx-7ET-31@gated-at.bofh.it>
     [not found]               ` <1woEM-7Yx-41@gated-at.bofh.it>
     [not found]                 ` <1wp8b-7x-3@gated-at.bofh.it>
     [not found]                   ` <1wp8l-7x-25@gated-at.bofh.it>
     [not found]                     ` <1x0qG-Dr-3@gated-at.bofh.it>
2004-03-12 21:15                       ` Andi Kleen
2004-03-18 19:50                         ` Peter Zaitsev
     [not found]               ` <1woEJ-7Yx-25@gated-at.bofh.it>
     [not found]                 ` <1wp8c-7x-5@gated-at.bofh.it>
     [not found]                   ` <1wprd-qI-21@gated-at.bofh.it>
     [not found]                     ` <1wpUz-Tw-21@gated-at.bofh.it>
     [not found]                       ` <1x293-2nT-7@gated-at.bofh.it>
2004-03-12 21:25                         ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).