linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Subtle MM bug
@ 2001-01-08  5:29 Wayne Whitney
  2001-01-08  5:42 ` Andi Kleen
  2001-01-08 17:16 ` Rik van Riel
  0 siblings, 2 replies; 95+ messages in thread
From: Wayne Whitney @ 2001-01-08  5:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: William A. Stein


On Sunday, January 7, 20001, Rik van Riel <riel@conectiva.com.br> wrote:

> Now if 2.4 has worse _performance_ than 2.2 due to one reason or
> another, that I'd like to hear about ;)

Well, here is a workload that performs worse on 2.4.0 than on 2.2.19pre,
and as it is the usual workload on my little cluster of 3 machines, they
are all running 2.2.19pre:

The application is some mathematics computations (modular symbols) using a
package called MAGMA;  at times this requires very large matrices.  The
RSS can get up to 870MB; for some reason a MAGMA process under linux
thinks it has run out of memory at 870MB, regardless of the actual
memory/swap in the machine.  MAGMA is single-threaded.

The typical machine is a dual Intel box with 512MB RAM and 512MB swap.
There is no problem with just one MAGMA process, it just hits that 870MB
barrier and gracefully exits.  But if I do the following test, I notice
very different behaviour under 2.2 and 2.4:  while running 'top d 1' I
simultaneously launch two instances of a job which actually requires more
than 870MB of memory to complete.  So each instance will slowly grow in
RSS until it gets killed by OOM or hits that 870MB limit.

Under 2.2, everything proceeds smoothly: before physical RAM is exhausted,
top updates every second, and the jobs have all the CPU.  When swapping
kicks in, top updates every 1-2 seconds and lists most of the CPU as
'system' (kswapd), but I perceive not much loss of interactivity.
Eventually the 1GB of virtual memory is exhausted, the OOM killer kills
one of the MAGMA's, and the other runs till it hits the 870MB barrier and
exits.

But under 2.4, interactivity suffers as soon as physical RAM is exhausted.
Top only updates every 2-10 seconds, the load average hits 3-4, and top
reports the CPUs are 90% idle.  Eventually, the OOM killer kicks in and
all returns to normal.  For practical purposes, the machine is unusual
while swapping like this.

I have heard 'vmstat' mentioned here, so below is the output of a 'vmstat
1' concommitant with the test above (top and the two MAGMA jobs).  I would
be more than happy to provide any other relevant information about this.

I read the LKML via an archive that updates once a day, so please cc: me
if you would like a speedier response.  I wish I knew of a newsgroup
interface to the LKML, then I could read it more often :-).

Cheers,
Wayne


   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  0  0  49180 447840    840  54104 269 969    84   244   76   236  10   4  86
 1  0  0  49180 443276    852  55972   0   0   470     0  163   150  15   2  83
 2  0  0  49180 440060    852  56292   0   0    80     0  115    60  93   1   6
 2  0  0  49180 438236    856  56292   0   0     1     0  107    53  99   1   0
 2  0  0  49180 429468    856  56392   0   0    25     0  109    16  99   0   0
 2  0  0  49180 421296    856  56392   0   0     0     0  104    13  98   2   0
 2  0  0  49180 421132    856  56392   0   0     0     0  108    53 100   0   0
 2  0  0  49180 421128    856  56392   0   0     0     0  108    47 100   0   0
 2  0  0  49180 397520    856  56392   0   0     0     1  107    49  96   4   0
 2  0  0  49180 364860    856  56392   0   0     0     0  106    47  95   5   0
 2  0  0  49180 332244    856  56392   0   0     0     0  106    49  95   5   0
 2  0  0  49180 299660    856  56392   0   0     0     0  106    54  92   8   0
 2  0  0  49180 267076    856  56392   0   0     0     0  109    56  95   5   0
 2  0  0  49180 234632    856  56392   0   0     0     0  110    57  94   6   0
 2  0  0  49180 202096    872  56448  32   0    18     0  117    70  95   5   0
 2  0  0  49180 169544    872  56448   0   0     0     0  103    13  96   4   0
 2  0  0  49180 137108    872  56448   0   0     0     0  107    49  93   7   0
 2  0  0  49180 104600    872  56448   0   0     0     0  107    51  94   6   0
 2  0  0  49180  72368    872  56448   0   0     0    52  136    54  93   7   0
 2  0  0  49180  39964    872  56448   0   0     0     0  110    59  92   8   0
 2  0  2   7296   1576     96  13072   0 720     0   184  130   465  74  22   4
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 1  2  2  53620   1564    116  23512 1012 31876   565  7969  883  3802   1   8  92
 2  1  2  68800   1560     96  20128  68 15396    17  3850  291  2775   1   7  92
 3  0  1  99484   1556     96  26096  84 29552    21  7388  594  3832   1   4  95
 1  3  2 114708   1560    104  32528 284 14696   161  3674  374  3125   0   4  96
 1  4  2 175484   1560    124  31112 360 63000   237 15753 1404 14952   1   5  94
 1  2  2 205900   1560     96  32748  12 30080     3  7520  606  8356   1   5  94
 2  1  2 221156   1560     96  17848 412 14256   103  3564  308  8450   1  10  89
 1  2  2 222128   1564     96  12736   0 16100     7  4025  346  1010   0   5  95
 1  2  2 236580   1560    108  15220 276 13988    97  3497  347  4102   0   7  92
 2  1  2 267488   1560    104  32044 260 17376    69  4346  405  1265   0   7  93
 3  1  1 282756   1560     96  29380  16 15304     4  3827  335  4359   1   7  92
 2  1  2 282756   1580     96  11460  92 14948    23  3737  332  4120   1   5  94
 2  1  2 313496   1560    100  30476 200 15484    54  3871  318  2359   0   9  90
 2  1  2 313496   1560    100  14148   0 13076     1  3270  246  5165   1   8  91
 3  1  1 344564   1572     96  23892  16 18444    11  4613  419  1555   0   7  93
 2  1  2 375020   1560     96  25400 172 26988    43  6747  556  2910   1   7  93
 1  2  2 375020   1968     96  22760   8 17136     2  4284  378   787   0   2  98
 2  1  2 406056   1568     96  20432 212 17320    53  4330  393  2704   1  10  89
 3  0  3 421316   1560     96  25056  72 14416    18  3604  281  1731   0   5  94
 1  3  0 452120   1544    100  21216 240 31480   116  7870  715  2681   1   6  94
 2  2  2 467488   1588    108  27248 440 15056   123  3765  385  2206   0   5  94
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 2  1  0 467488   1564    136  13352  88 15376    49  3844  368  2913   1   4  95
 3  0  1 482864   1560     96  15256 128 15384    32  3846  296   986   1   7  92
 3  0  1 497920   1560     96  14144   0 12636     0  3159  245  2302   1   9  90
 3  1  1 529844   1540     96  18632 940 33340   569  8336 1104  1366   1  10  88
 0  1  0 269856 205944    148  21772 2628   0  1196     2  267   313   0   3  97
 0  1  0 269856 182736    156  33180 11180   0  2854     0  309   451   6   3  91
 0  1  0 269856 158668    156  44696 11516   0  2879     0  314   462  12   4  83
 0  1  0 269856 131928    156  57588 12892   0  3223     0  312   466   8   4  88
 0  1  0 269856 105176    156  70448 12864   0  3216     0  332   506  12   3  85
 0  1  0 269856  79056    156  82644 12196   0  3049     0  456   602  10   6  83
 1  1  0 269856  46948    156  96900 14252   0  3563     0  359   518  21   7  72

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-08  5:29 Subtle MM bug Wayne Whitney
@ 2001-01-08  5:42 ` Andi Kleen
  2001-01-08  6:04   ` Linus Torvalds
  2001-01-08 17:16 ` Rik van Riel
  1 sibling, 1 reply; 95+ messages in thread
From: Andi Kleen @ 2001-01-08  5:42 UTC (permalink / raw)
  To: Wayne Whitney; +Cc: linux-kernel, William A. Stein

On Sun, Jan 07, 2001 at 09:29:29PM -0800, Wayne Whitney wrote:
> The application is some mathematics computations (modular symbols) using a
> package called MAGMA;  at times this requires very large matrices.  The
> RSS can get up to 870MB; for some reason a MAGMA process under linux
> thinks it has run out of memory at 870MB, regardless of the actual
> memory/swap in the machine.  MAGMA is single-threaded.

I think it's caused by the way malloc maps its memory. 
Newer glibc should work a bit better by falling back to mmap even for smaller
allocations (older does it only for very big ones) 



-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-08  5:42 ` Andi Kleen
@ 2001-01-08  6:04   ` Linus Torvalds
  2001-01-08 17:44     ` Rik van Riel
  0 siblings, 1 reply; 95+ messages in thread
From: Linus Torvalds @ 2001-01-08  6:04 UTC (permalink / raw)
  To: linux-kernel

In article <20010108064225.B29026@gruyere.muc.suse.de>,
Andi Kleen  <ak@suse.de> wrote:
>On Sun, Jan 07, 2001 at 09:29:29PM -0800, Wayne Whitney wrote:
>> The application is some mathematics computations (modular symbols) using a
>> package called MAGMA;  at times this requires very large matrices.  The
>> RSS can get up to 870MB; for some reason a MAGMA process under linux
>> thinks it has run out of memory at 870MB, regardless of the actual
>> memory/swap in the machine.  MAGMA is single-threaded.
>
>I think it's caused by the way malloc maps its memory. 
>Newer glibc should work a bit better by falling back to mmap even for smaller
>allocations (older does it only for very big ones) 

That doesn't resolve the "2.4.x behaves badly" thing, though.

I've seen that one myself, and it seems to be simply due to the fact
that we're usually so good at gettign memory from page_launder() that we
never bother to try to swap stuff out. And when we _do_ start swapping
stuff out it just moves to the dirty list, and page_launder() will take
care of it.

So far so good. The problem appears to be that we don't swap stuff out
smoothly: we start doing the VM scanning, but when we get enough dirty
pages, we'll let it be, and go back to page_launder() again. Which means
that we don't walk theough the whole VM space, we just do some "spot
cleaning".

		Linus 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-08  5:29 Subtle MM bug Wayne Whitney
  2001-01-08  5:42 ` Andi Kleen
@ 2001-01-08 17:16 ` Rik van Riel
  2001-01-08 17:58   ` Linus Torvalds
  2001-01-08 21:30   ` Wayne Whitney
  1 sibling, 2 replies; 95+ messages in thread
From: Rik van Riel @ 2001-01-08 17:16 UTC (permalink / raw)
  To: Wayne Whitney; +Cc: linux-kernel, Linus Torvalds, William A. Stein

On Sun, 7 Jan 2001, Wayne Whitney wrote:

> Well, here is a workload that performs worse on 2.4.0 than on 2.2.19pre,

> The typical machine is a dual Intel box with 512MB RAM and 512MB swap.

How does 2.4 perform when you add an extra GB of swap ?

2.4 keeps dirty pages in the swap cache, so you will need
more swap to run the same programs...

Linus: is this something we want to keep or should we give
the user the option to run in a mode where swap space is
freed when we swap in something non-shared ?

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-08  6:04   ` Linus Torvalds
@ 2001-01-08 17:44     ` Rik van Riel
  2001-01-08 18:02       ` Linus Torvalds
  0 siblings, 1 reply; 95+ messages in thread
From: Rik van Riel @ 2001-01-08 17:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On 7 Jan 2001, Linus Torvalds wrote:

> That doesn't resolve the "2.4.x behaves badly" thing, though.
> 
> I've seen that one myself, and it seems to be simply due to the
> fact that we're usually so good at gettign memory from
> page_launder() that we never bother to try to swap stuff out.
> And when we _do_ start swapping stuff out it just moves to the
> dirty list, and page_launder() will take care of it.
> 
> So far so good. The problem appears to be that we don't swap
> stuff out smoothly: we start doing the VM scanning, but when we
> get enough dirty pages, we'll let it be, and go back to
> page_launder() again. Which means that we don't walk theough the
> whole VM space, we just do some "spot cleaning".

You are right in that we need to refill the inactive list
before calling page_launder(), but we'll also need a few
other modifications:

1. adopt the latest FreeBSD tactic in page_launder()
	- mark dirty pages we see but don't flush
	- in the first loop, flush up to maxlaunder of the
	  already seen dirty pages
	- in the second loop, flush as many pages as we
	  need to refill the free&inactive_clean list

2. go back to having a _static_ free target, at
   max(freepages.high, SUM(zone->pages_high) ... this
   means free_shortage() will never be very big

3. keep track of how many pages we need to free in
   page_launder() and substract one from the target
   when we submit a page for IO ... no need to flush
   20MB of dirty pages when we only need 1MB pages
   cleaned

I have these things in my local tree and it seems to smooth
out the load quite well for a very large haskell run and for
the fillmem program from Juan Quintela's memtest suite.

When combined with your idea of refilling the freelist _first_,
we should be able to get the VM quite a bit smoother under loads
with lots of dirty pages.

I will work on this while travelling to and being in Australia.
Expect a clean patch to fix this problem once the 2.4 bugfix-only
period is over.

Other people on this list are invited to apply the VM patches from
my home page and give them a good beating. I want to be able to
submit a well-tested, known-good patch to Linus once 2.4 is out of
the bugfix-only period...

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-08 17:16 ` Rik van Riel
@ 2001-01-08 17:58   ` Linus Torvalds
  2001-01-08 23:41     ` Zlatko Calusic
  2001-01-08 21:30   ` Wayne Whitney
  1 sibling, 1 reply; 95+ messages in thread
From: Linus Torvalds @ 2001-01-08 17:58 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Wayne Whitney, linux-kernel, William A. Stein



On Mon, 8 Jan 2001, Rik van Riel wrote:

> On Sun, 7 Jan 2001, Wayne Whitney wrote:
> 
> > Well, here is a workload that performs worse on 2.4.0 than on 2.2.19pre,
> 
> > The typical machine is a dual Intel box with 512MB RAM and 512MB swap.
> 
> How does 2.4 perform when you add an extra GB of swap ?
> 
> 2.4 keeps dirty pages in the swap cache, so you will need
> more swap to run the same programs...
> 
> Linus: is this something we want to keep or should we give
> the user the option to run in a mode where swap space is
> freed when we swap in something non-shared ?

I'd prefer just documenting it and keeping it. I'd hate to have two fairly
different modes of behaviour. It's always been the suggested "twice the
amount of RAM", although there's historically been the "Linux doesn't
really need that much" that we just killed with 2.4.x.

If you have 512MB or RAM, you can probably afford another 40GB or so of
harddisk. They are disgustingly cheap these days.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-08 17:44     ` Rik van Riel
@ 2001-01-08 18:02       ` Linus Torvalds
  0 siblings, 0 replies; 95+ messages in thread
From: Linus Torvalds @ 2001-01-08 18:02 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel



On Mon, 8 Jan 2001, Rik van Riel wrote:
> 
> You are right in that we need to refill the inactive list
> before calling page_launder(), but we'll also need a few
> other modifications:

NONE of your three additions do _anything_ to help us at all if we don't
even see the dirty bit because the page is on the active list and the
dirty bit is in somebodys VM space.

I agree that they look ok, but they are all complicating the code. I
propose getting rid of complications, and getting rid of the precarious
"when do we actually scan the VM tables" balancing issue.

Quite frankly, I'd rather see somebody try the vmscan stuff FIRST. Your
suggestions look fine, but apart from the "let dirty pages go twice
through the list" they look like tweaks that would need re-tweaking after
the balancing stuff is ripped out.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-08 17:16 ` Rik van Riel
  2001-01-08 17:58   ` Linus Torvalds
@ 2001-01-08 21:30   ` Wayne Whitney
  1 sibling, 0 replies; 95+ messages in thread
From: Wayne Whitney @ 2001-01-08 21:30 UTC (permalink / raw)
  To: Rik van Riel; +Cc: LKML, Linus Torvalds, William A. Stein

On Mon, 8 Jan 2001, Rik van Riel wrote:

> How does 2.4 perform when you add an extra GB of swap ?

OK, some more data:

First, I tried booting 2.4.0 with "nosmp" to see if the behavior I observe
is SMP related.  It isn't, there was no difference under 2.4.0 between
512MB/512MB/1CPU and 512MB/512MB/2CPUs.

Second, I tried going to 2GB of swap with 2.4.0, so 512MB/2GB/2CPUs.
Again, there is no difference:  as soon as swapping begins with two MAGMA
processes, interactivity suffers.  I notice that while swapping in this
situation, the HD light is blinking only intermittently.

I also tried logging in to a fourth VT during this second test, and it got
nowhere.  In fact, this stopped the top updates completely and the HD
light also stopped.  After 30 seconds of nothing (all I could do is switch
VT's), I gave up and sent a ^Z to one MAGMA process; this eventually was
received, and the system immediately recovered.

Perhaps there is some sort of I/O starvation triggered by two swapping
processes?

Again, under 2.2.19pre6, the exact same tests yield hardly any loss of
interactivity, I can log in fine (a little slowly) during the top / two
MAGMA process test.  And once swapping begins, the HD light is continually
lit.

Again, I'd be happy to do any additional tests, provide more info about my
machine, etc.

Cheers,
Wayne




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-08 17:58   ` Linus Torvalds
@ 2001-01-08 23:41     ` Zlatko Calusic
  2001-01-09  2:58       ` Linus Torvalds
  2001-01-09  6:20       ` Eric W. Biederman
  0 siblings, 2 replies; 95+ messages in thread
From: Zlatko Calusic @ 2001-01-08 23:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:

> On Mon, 8 Jan 2001, Rik van Riel wrote:
> 
> > On Sun, 7 Jan 2001, Wayne Whitney wrote:
> > 
> > > Well, here is a workload that performs worse on 2.4.0 than on 2.2.19pre,
> > 
> > > The typical machine is a dual Intel box with 512MB RAM and 512MB swap.
> > 
> > How does 2.4 perform when you add an extra GB of swap ?
> > 
> > 2.4 keeps dirty pages in the swap cache, so you will need
> > more swap to run the same programs...
> > 
> > Linus: is this something we want to keep or should we give
> > the user the option to run in a mode where swap space is
> > freed when we swap in something non-shared ?
> 
> I'd prefer just documenting it and keeping it. I'd hate to have two fairly
> different modes of behaviour. It's always been the suggested "twice the
> amount of RAM", although there's historically been the "Linux doesn't
> really need that much" that we just killed with 2.4.x.
> 
> If you have 512MB or RAM, you can probably afford another 40GB or so of
> harddisk. They are disgustingly cheap these days.
> 

Yes, but a lot more data on the swap also means degraded performance,
because the disk head has to seek around in the much bigger area. Are
you sure this is all OK?
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-08 23:41     ` Zlatko Calusic
@ 2001-01-09  2:58       ` Linus Torvalds
  2001-01-09  6:20       ` Eric W. Biederman
  1 sibling, 0 replies; 95+ messages in thread
From: Linus Torvalds @ 2001-01-09  2:58 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Rik van Riel, linux-kernel



On 9 Jan 2001, Zlatko Calusic wrote:
> 
> Yes, but a lot more data on the swap also means degraded performance,
> because the disk head has to seek around in the much bigger area. Are
> you sure this is all OK?

Yes and no.

I'm not _sure_, obviously.

However, one thing I _am_ sure of is that the sticky page-cache simplifies
some things enormously, and make some things possible that simply weren't
possible before. 

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-08 23:41     ` Zlatko Calusic
  2001-01-09  2:58       ` Linus Torvalds
@ 2001-01-09  6:20       ` Eric W. Biederman
  2001-01-09  7:27         ` Linus Torvalds
  1 sibling, 1 reply; 95+ messages in thread
From: Eric W. Biederman @ 2001-01-09  6:20 UTC (permalink / raw)
  To: zlatko; +Cc: Linus Torvalds, Rik van Riel, linux-kernel

Zlatko Calusic <zlatko@iskon.hr> writes:

> 
> Yes, but a lot more data on the swap also means degraded performance,
> because the disk head has to seek around in the much bigger area. Are
> you sure this is all OK?

I don't think we have more data on the swap, just more data has an
allocated home on the swap.  With the earlier allocation we should
(I haven't verified) allocate contiguous chunks of memory contiguously
on the swap.   And reusing the same swap pages helps out with this.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-09  6:20       ` Eric W. Biederman
@ 2001-01-09  7:27         ` Linus Torvalds
  2001-01-09 11:38           ` Eric W. Biederman
  2001-01-09 12:29           ` Zlatko Calusic
  0 siblings, 2 replies; 95+ messages in thread
From: Linus Torvalds @ 2001-01-09  7:27 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: zlatko, Rik van Riel, linux-kernel



On 8 Jan 2001, Eric W. Biederman wrote:

> Zlatko Calusic <zlatko@iskon.hr> writes:> 
> > 
> > Yes, but a lot more data on the swap also means degraded performance,
> > because the disk head has to seek around in the much bigger area. Are
> > you sure this is all OK?
> 
> I don't think we have more data on the swap, just more data has an
> allocated home on the swap.

I think Zlatko's point is that because of the extra allocations, we will
have worse locality (more seeks etc). 

Clearly we should not actually do any more actual IO. But the sticky
allocation _might_ make the IO we do be more spread out.

To offset that, I think the sticky allocation makes us much better able to
handle things like clustering etc more intelligently, which is why I think
it's very much worth it.  But let's not close our eyes to potential
downsides.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-09  7:27         ` Linus Torvalds
@ 2001-01-09 11:38           ` Eric W. Biederman
  2001-01-09 12:29           ` Zlatko Calusic
  1 sibling, 0 replies; 95+ messages in thread
From: Eric W. Biederman @ 2001-01-09 11:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: zlatko, Rik van Riel, linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:

> On 8 Jan 2001, Eric W. Biederman wrote:
> 
> > Zlatko Calusic <zlatko@iskon.hr> writes:> 
> > > 
> > > Yes, but a lot more data on the swap also means degraded performance,
> > > because the disk head has to seek around in the much bigger area. Are
> > > you sure this is all OK?
> > 
> > I don't think we have more data on the swap, just more data has an
> > allocated home on the swap.
> 
> I think Zlatko's point is that because of the extra allocations, we will
> have worse locality (more seeks etc). 
> 
> Clearly we should not actually do any more actual IO. But the sticky
> allocation _might_ make the IO we do be more spread out.

The tradeoff when implemented correctly is that writes will tend to be
more spread out and reads should be better clustered together. 

> To offset that, I think the sticky allocation makes us much better able to
> handle things like clustering etc more intelligently, which is why I think
> it's very much worth it.  But let's not close our eyes to potential
> downsides.

Certainly, keeping ours eyes open is a good a good thing.

But it has been apparent for a long time that by doing allocation as
we were doing it, that when it came to heavy swapping we were taking a
performance hit.  So I'm relieved that we are now being more aggressive.

>From the sounds of it what we are currently doing actually sucks worse
for some heavy loads.  But it still feels like the right direction.

It's been my impression that work loads where we are actively swapping
are a lot different from work loads where we really don't swap.  To
the extent that it might make sense to make the actively swapping case
a config option to get our attention in the code.  It would be nice
to have a linux kernel for once that handles heavy swapping (below
the level of thrashing) gracefully. :)

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-09  7:27         ` Linus Torvalds
  2001-01-09 11:38           ` Eric W. Biederman
@ 2001-01-09 12:29           ` Zlatko Calusic
  2001-01-09 18:47             ` Linus Torvalds
  1 sibling, 1 reply; 95+ messages in thread
From: Zlatko Calusic @ 2001-01-09 12:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Eric W. Biederman, Rik van Riel, linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:

> On 8 Jan 2001, Eric W. Biederman wrote:
> 
> > Zlatko Calusic <zlatko@iskon.hr> writes:> 
> > > 
> > > Yes, but a lot more data on the swap also means degraded performance,
> > > because the disk head has to seek around in the much bigger area. Are
> > > you sure this is all OK?
> > 
> > I don't think we have more data on the swap, just more data has an
> > allocated home on the swap.
> 
> I think Zlatko's point is that because of the extra allocations, we will
> have worse locality (more seeks etc).

Yes that was my concern.

But in the end I'm not sure. I made two simple tests and haven't found
any problems with 2.4.0 mm logic (opposed to 2.2.17). In fact, the new
kernel was faster in the more interesting (make -j32) test.

Also I have found that new kernel allocates 4 times more swap space
under some circumstances. That may or may not be alarming, it remains
to be seen.

-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-09 12:29           ` Zlatko Calusic
@ 2001-01-09 18:47             ` Linus Torvalds
  2001-01-09 19:09               ` Daniel Phillips
                                 ` (2 more replies)
  0 siblings, 3 replies; 95+ messages in thread
From: Linus Torvalds @ 2001-01-09 18:47 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Eric W. Biederman, Rik van Riel, linux-kernel



On 9 Jan 2001, Zlatko Calusic wrote:
> 
> But in the end I'm not sure. I made two simple tests and haven't found
> any problems with 2.4.0 mm logic (opposed to 2.2.17). In fact, the new
> kernel was faster in the more interesting (make -j32) test.

I personally think 2.4.x is going to be as fast or faster at just about
anything. We do have some MM issues still to hash out, and tuning to do,
but I'm absolutely convinced that 2.4.x is going to be a _lot_ easier to
tune than 2.2.x ever was. The "scan the page tables without doing any IO"
thing just makes the 2.4.x memory management several orders of magnitude
more flexible than 2.2.x ever was.

(This is why I worked so hard at getting the PageDirty semantics right in
the last two months or so - and why I released 2.4.0 when I did. Getting
PageDirty right was the big step to make all of the VM stuff possible in
the first place. Even if it probably looked a bit foolhardy to change the
semantics of "writepage()" quite radically just before 2.4 was released).

> Also I have found that new kernel allocates 4 times more swap space
> under some circumstances. That may or may not be alarming, it remains
> to be seen.

Yes. The new VM will allocate the swap space a _lot_ more aggressively.
Many of those allocations will not necessarily ever actually be used, but
the fact that we _have_ allocated backing store for a page is what allows
us to drop it from the VM page tables, so that it can be processed by
page_launder().

And this _is_ a downside, there's no question about it. There's the worry
about the potential loss of locality, but there's also the fact that you
effectively need a bigger swap partition with 2.4.x - never mind that
large portions of the allocations may never be used. You still need the
disk space for good VM behaviour.

There are always trade-offs, I think the 2.4.x tradeoff is a good one.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-09 18:47             ` Linus Torvalds
@ 2001-01-09 19:09               ` Daniel Phillips
  2001-01-09 19:29                 ` Trond Myklebust
                                   ` (2 more replies)
  2001-01-09 19:53               ` Simon Kirby
  2001-01-10  1:45               ` David Woodhouse
  2 siblings, 3 replies; 95+ messages in thread
From: Daniel Phillips @ 2001-01-09 19:09 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

Linus Torvalds wrote:
> (This is why I worked so hard at getting the PageDirty semantics right in
> the last two months or so - and why I released 2.4.0 when I did. Getting
> PageDirty right was the big step to make all of the VM stuff possible in
> the first place. Even if it probably looked a bit foolhardy to change the
> semantics of "writepage()" quite radically just before 2.4 was released).

On the topic of writepage, it's not symmetric with readpage at the
moment - it still takes (struct file *).  Is this in the cleanup
pipeline?  It looks like nfs_readpage already ignores the struct file *,
but maybe some other net filesystems are still depending on it.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-09 19:09               ` Daniel Phillips
@ 2001-01-09 19:29                 ` Trond Myklebust
  2001-01-10 17:32                   ` Andi Kleen
  2001-01-09 19:37                 ` Linus Torvalds
  2001-01-17  8:46                 ` Rik van Riel
  2 siblings, 1 reply; 95+ messages in thread
From: Trond Myklebust @ 2001-01-09 19:29 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Linus Torvalds, linux-kernel

>>>>> " " == Daniel Phillips <phillips@innominate.de> writes:

     > Linus Torvalds wrote:
    >> (This is why I worked so hard at getting the PageDirty
    >> semantics right in the last two months or so - and why I
    >> released 2.4.0 when I did. Getting PageDirty right was the big
    >> step to make all of the VM stuff possible in the first
    >> place. Even if it probably looked a bit foolhardy to change the
    >> semantics of "writepage()" quite radically just before 2.4 was
    >> released).

     > On the topic of writepage, it's not symmetric with readpage at
     > the moment - it still takes (struct file *).  Is this in the
     > cleanup pipeline?  It looks like nfs_readpage already ignores
     > the struct file *, but maybe some other net filesystems are
     > still depending on it.

NO! We definitely want to pass the struct file down to nfs_readpage()
when it's available.

Al has mentioned that he wants us to move towards a *BSD-like system
of credentials (i.e. struct ucred) that could be used here, but that's
in the far future. In the meantime, we cache RPC credentials in the
struct file...

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-09 19:09               ` Daniel Phillips
  2001-01-09 19:29                 ` Trond Myklebust
@ 2001-01-09 19:37                 ` Linus Torvalds
  2001-01-17  8:46                 ` Rik van Riel
  2 siblings, 0 replies; 95+ messages in thread
From: Linus Torvalds @ 2001-01-09 19:37 UTC (permalink / raw)
  To: linux-kernel

In article <3A5B61F7.FB0E79C1@innominate.de>,
Daniel Phillips  <phillips@innominate.de> wrote:
>Linus Torvalds wrote:
>> (This is why I worked so hard at getting the PageDirty semantics right in
>> the last two months or so - and why I released 2.4.0 when I did. Getting
>> PageDirty right was the big step to make all of the VM stuff possible in
>> the first place. Even if it probably looked a bit foolhardy to change the
>> semantics of "writepage()" quite radically just before 2.4 was released).
>
>On the topic of writepage, it's not symmetric with readpage at the
>moment - it still takes (struct file *).  Is this in the cleanup
>pipeline?  It looks like nfs_readpage already ignores the struct file *,
>but maybe some other net filesystems are still depending on it.

readpage() is always a synchronous operation, and is actually much more
closely linked to "prepare_write()"/"commit_write()" than to writepage,
despite the naming similarities.

So no, the two are not symmetric, and they really shouldn't be. 

"readpage()" is for reading a page into the page cache, and is always
synchronous with the reader (even prefetching is "synchronous" in the
sense that it's done by the reader: it's asynchronous in the sense that
we don't wait for the results, but the _calling_ of readpage() is
synchronous, if you see what I mean).

Similarly, prepare_write() and commit_write() are synchronous to the
writer (again, we do not wait for the writes to have actually
_happened_, but we call the functions synchronously and they can choose
to let the actual IO happen asynchronously - the VM doesn't care about
that small detail). 

So "readpage()" and "prepare_write()/commit_write()" are pairs.  They
are different simply because reading is assumed to be a cacheable and
prefetchable operation (think regular CPU caches), while writing
obviously has to give a much stricter "write _these_ bytes, not the
whole cache line". 

In contrast, writepage() is a completely different animal. It's
basically a cache eviction notice, and happens asynchronously to any
operations that actually fill or dirty the cache. So despite the name,
it really as an operation has absolutely nothing in common with
readpage(), other than the fact that it is supposed to obviously do the
IO associated with the name.

Writepage has a friend in "sync_page()", which is another asynchronous
call-back that basically says "we want you to start your IO _now_". It's
similar to "writepage()" in that it's a kind of cache state
notification: while writepage() notifies that the cached page wants to
be evicted, "sync_page()" notifies that the cached page is waited upon
by somebody else and that we want to speed up any background IO on it.

You'll notice that writepage()/sync_page() have similar calling
convention, while readpage/prepare_write/commit_write have similar
calling conventions.

The one operation that _really_ stands out is "bmap()".  It has
absolutely no calling convention at all, and is not symmetric with
anything. Pretty ugly, but easily supported.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-09 18:47             ` Linus Torvalds
  2001-01-09 19:09               ` Daniel Phillips
@ 2001-01-09 19:53               ` Simon Kirby
  2001-01-09 20:08                 ` Linus Torvalds
  2001-01-09 20:10                 ` Zlatko Calusic
  2001-01-10  1:45               ` David Woodhouse
  2 siblings, 2 replies; 95+ messages in thread
From: Simon Kirby @ 2001-01-09 19:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zlatko Calusic, Eric W. Biederman, Rik van Riel, linux-kernel

On Tue, Jan 09, 2001 at 10:47:57AM -0800, Linus Torvalds wrote:

> And this _is_ a downside, there's no question about it. There's the worry
> about the potential loss of locality, but there's also the fact that you
> effectively need a bigger swap partition with 2.4.x - never mind that
> large portions of the allocations may never be used. You still need the
> disk space for good VM behaviour.
> 
> There are always trade-offs, I think the 2.4.x tradeoff is a good one.

Hmm, perhaps you could clarify...

For boxes that rarely ever use swap with 2.2, will they now need more
swap space on 2.4 to perform well, or just boxes which don't have enough
RAM to handle everything nicely?

I've always been tending to make swap partitions smaller lately, as it
helps in the case where we have to wait for a runaway process to eat up
all of the swap space before it gets killed.  Making the swap size
smaller speeds up the time it takes for this to happen, albeit something
which isn't supposed to happen anyway.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[       sim@stormix.com       ][       sim@netnation.com        ]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-09 19:53               ` Simon Kirby
@ 2001-01-09 20:08                 ` Linus Torvalds
  2001-01-09 20:10                 ` Zlatko Calusic
  1 sibling, 0 replies; 95+ messages in thread
From: Linus Torvalds @ 2001-01-09 20:08 UTC (permalink / raw)
  To: Simon Kirby; +Cc: Zlatko Calusic, Eric W. Biederman, Rik van Riel, linux-kernel



On Tue, 9 Jan 2001, Simon Kirby wrote:
>
> On Tue, Jan 09, 2001 at 10:47:57AM -0800, Linus Torvalds wrote:
> 
> > And this _is_ a downside, there's no question about it. There's the worry
> > about the potential loss of locality, but there's also the fact that you
> > effectively need a bigger swap partition with 2.4.x - never mind that
> > large portions of the allocations may never be used. You still need the
> > disk space for good VM behaviour.
> > 
> > There are always trade-offs, I think the 2.4.x tradeoff is a good one.
> 
> Hmm, perhaps you could clarify...
> 
> For boxes that rarely ever use swap with 2.2, will they now need more
> swap space on 2.4 to perform well, or just boxes which don't have enough
> RAM to handle everything nicely?

If you don't have any swap, or if you run out of swap, the major
difference between 2.2.x and 2.4.x is probably going to be the oom
handling: I suspect that 2.4.x might be more likely to kill things off
sooner (but it tries to be graceful about which processes to kill).

Not having any swap is going to be a performance issue for both 2.2.x and
2.4.x - Linux likes to push inactive dirty pages out to swap where they
can lie around without bothering anybody, even if there is no _major_
memory crunch going on.

If you do have swap, but it's smaller than your available physical RAM, I
suspect that the Linux-2.4 swap pre-allocate may cause that kind of
performance degradation earlier than 2.2.x would have. Another way of
putting this: in 2.2.x you could use a fairly small swap partition to pick
up some of the slack, and in 2.4.x a really small swap-partition doesn't
really buy you much anything.

> I've always been tending to make swap partitions smaller lately, as it
> helps in the case where we have to wait for a runaway process to eat up
> all of the swap space before it gets killed.  Making the swap size
> smaller speeds up the time it takes for this to happen, albeit something
> which isn't supposed to happen anyway.

Yes, that kind of swap size tuning will still work in 2.4.x, but the sizes
you tune for would be different, I'm afraid. If you have, say, 128MB or
RAM, and you used to make a smallish partition of 64MB for "slop" in
2.2.x, I really suspect that you might like to increase it to 128MB or
196MB.

Of course, if you really only used your swap for "slop", I don't think
you'll necessarily notice the difference.

NOTE! The above guide-lines are pure guesses. The machines I use have had
big swap-partitions or none at all, so I think we'll just have to wait and
see.

			Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-09 19:53               ` Simon Kirby
  2001-01-09 20:08                 ` Linus Torvalds
@ 2001-01-09 20:10                 ` Zlatko Calusic
  1 sibling, 0 replies; 95+ messages in thread
From: Zlatko Calusic @ 2001-01-09 20:10 UTC (permalink / raw)
  To: Simon Kirby; +Cc: Linus Torvalds, Eric W. Biederman, Rik van Riel, linux-kernel

Simon Kirby <sim@stormix.com> writes:

> On Tue, Jan 09, 2001 at 10:47:57AM -0800, Linus Torvalds wrote:
> 
> > And this _is_ a downside, there's no question about it. There's the worry
> > about the potential loss of locality, but there's also the fact that you
> > effectively need a bigger swap partition with 2.4.x - never mind that
> > large portions of the allocations may never be used. You still need the
> > disk space for good VM behaviour.
> > 
> > There are always trade-offs, I think the 2.4.x tradeoff is a good one.
> 
> Hmm, perhaps you could clarify...
> 
> For boxes that rarely ever use swap with 2.2, will they now need more
> swap space on 2.4 to perform well, or just boxes which don't have enough
> RAM to handle everything nicely?
>

Just boxes that were already short on memory (swapped a lot) will need
more swap, empirically up to 4 times as much. If you already had
enough memory than things will stay almost the same for you.

But anyway, after some testing I've done recently I would now not
recommend anybody to have less than 2 x RAM size swap partition.

> I've always been tending to make swap partitions smaller lately, as it
> helps in the case where we have to wait for a runaway process to eat up
> all of the swap space before it gets killed.  Making the swap size
> smaller speeds up the time it takes for this to happen, albeit something
> which isn't supposed to happen anyway.
> 

Well, if you continue with that practice now you will be even more
successful in killing such processes, I would say. :)
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-09 18:47             ` Linus Torvalds
  2001-01-09 19:09               ` Daniel Phillips
  2001-01-09 19:53               ` Simon Kirby
@ 2001-01-10  1:45               ` David Woodhouse
  2001-01-10  2:26                 ` Andrea Arcangeli
  2001-01-10  6:57                 ` Linus Torvalds
  2 siblings, 2 replies; 95+ messages in thread
From: David Woodhouse @ 2001-01-10  1:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zlatko Calusic, Eric W. Biederman, Rik van Riel, linux-kernel

On Tue, 9 Jan 2001, Linus Torvalds wrote:

> And this _is_ a downside, there's no question about it. There's the worry
> about the potential loss of locality, but there's also the fact that you
> effectively need a bigger swap partition with 2.4.x - never mind that
> large portions of the allocations may never be used. You still need the
> disk space for good VM behaviour.
>
> There are always trade-offs, I think the 2.4.x tradeoff is a good one.

How does this affect embedded systems with no swap space at all?

-- 
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10  1:45               ` David Woodhouse
@ 2001-01-10  2:26                 ` Andrea Arcangeli
  2001-01-10  6:57                 ` Linus Torvalds
  1 sibling, 0 replies; 95+ messages in thread
From: Andrea Arcangeli @ 2001-01-10  2:26 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Linus Torvalds, Zlatko Calusic, Eric W. Biederman, Rik van Riel,
	linux-kernel

On Wed, Jan 10, 2001 at 01:45:47AM +0000, David Woodhouse wrote:
> How does this affect embedded systems with no swap space at all?

If there's no swap the swap-cache dirty-sticky issue can't arise.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10  1:45               ` David Woodhouse
  2001-01-10  2:26                 ` Andrea Arcangeli
@ 2001-01-10  6:57                 ` Linus Torvalds
  2001-01-10 11:46                   ` David Woodhouse
  1 sibling, 1 reply; 95+ messages in thread
From: Linus Torvalds @ 2001-01-10  6:57 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Zlatko Calusic, Eric W. Biederman, Rik van Riel, linux-kernel



On Wed, 10 Jan 2001, David Woodhouse wrote:
> 
> How does this affect embedded systems with no swap space at all?

The no-swap behaviour shoul dactually be pretty much identical, simply
because both 2.2 and 2.4 will do the same thing: just skip dirty pages in
the page tables because they cannot do anything about them.

That said, the _other_ VM differences in 2.4.x may obviously make a
difference, just not the sticky swap cache one..

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10  6:57                 ` Linus Torvalds
@ 2001-01-10 11:46                   ` David Woodhouse
  2001-01-10 14:56                     ` Andrea Arcangeli
  2001-01-10 17:03                     ` Linus Torvalds
  0 siblings, 2 replies; 95+ messages in thread
From: David Woodhouse @ 2001-01-10 11:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zlatko Calusic, Eric W. Biederman, Rik van Riel, linux-kernel


torvalds@transmeta.com said:
>  The no-swap behaviour shoul dactually be pretty much identical,
> simply because both 2.2 and 2.4 will do the same thing: just skip
> dirty pages in the page tables because they cannot do anything about
> them. 

So the VM code spends a fair amount of time scanning lists of pages which 
it really can't do anything about?

Would it be possible to put such pages on different list, so that the VM
code doesn't have to keep skipping them?

(forgive me if I'm displaying my utter ignorance of the VM code here)

--
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 11:46                   ` David Woodhouse
@ 2001-01-10 14:56                     ` Andrea Arcangeli
  2001-01-10 17:46                       ` Eric W. Biederman
  2001-01-10 17:03                     ` Linus Torvalds
  1 sibling, 1 reply; 95+ messages in thread
From: Andrea Arcangeli @ 2001-01-10 14:56 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Linus Torvalds, Zlatko Calusic, Eric W. Biederman, Rik van Riel,
	linux-kernel

On Wed, Jan 10, 2001 at 11:46:03AM +0000, David Woodhouse wrote:
> So the VM code spends a fair amount of time scanning lists of pages which 
> it really can't do anything about?

Yes.

> Would it be possible to put such pages on different list, so that the VM

Currently to unmap the other pages we have to waste time on those unfreeable
pages as well.

Once I or other developer finishes with the reverse lookup from page to
pte-chain (an implementation from DaveM just exists) we'll be able to put them
in a separate lru, but it's certainly not a 2.4.1-pre2 thing.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 11:46                   ` David Woodhouse
  2001-01-10 14:56                     ` Andrea Arcangeli
@ 2001-01-10 17:03                     ` Linus Torvalds
  2001-01-11 14:36                       ` Jim Gettys
  1 sibling, 1 reply; 95+ messages in thread
From: Linus Torvalds @ 2001-01-10 17:03 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Zlatko Calusic, Eric W. Biederman, Rik van Riel, linux-kernel



On Wed, 10 Jan 2001, David Woodhouse wrote:

> 
> torvalds@transmeta.com said:
> >  The no-swap behaviour shoul dactually be pretty much identical,
> > simply because both 2.2 and 2.4 will do the same thing: just skip
> > dirty pages in the page tables because they cannot do anything about
> > them. 
> 
> So the VM code spends a fair amount of time scanning lists of pages which 
> it really can't do anything about?

It can do _tons_ of stuff.

Remember, on platforms like this, one of the reasons for being low on
memory is things like running X and netscape: maybe you have 64MB of RAM
and you don't think you need a swap device, and you want to have a web
browser.

The fact that we cannot touch _dirty_ pages doesn't mean that there's
nothing to do: instead of running out of memory we can at least make the
machine usable by dropping the text pages and the page cache..

> Would it be possible to put such pages on different list, so that the VM
> code doesn't have to keep skipping them?

If we don't have any swapspace, the dirty pages will not be on any lists:
they will never have exited the page tables, and they will just be dirty
anonymous, unlisted pages.

We'll still scan the page tables (and see them there), but we have to do
that to find the clean and unreferenced pages - we don't have separate
"dirty page tables" and "clean page tables" ;)

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-09 19:29                 ` Trond Myklebust
@ 2001-01-10 17:32                   ` Andi Kleen
  2001-01-10 19:31                     ` Alan Cox
  0 siblings, 1 reply; 95+ messages in thread
From: Andi Kleen @ 2001-01-10 17:32 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Daniel Phillips, Linus Torvalds, linux-kernel

On Tue, Jan 09, 2001 at 08:29:02PM +0100, Trond Myklebust wrote:
> Al has mentioned that he wants us to move towards a *BSD-like system
> of credentials (i.e. struct ucred) that could be used here, but that's
> in the far future. In the meantime, we cache RPC credentials in the
> struct file...

struct ucred is also needed to get LinuxThreads POSIX compliant (sharing
credentials between threads, but still keeping system calls atomic in
relation to credential changes) 


-Andi (who doesn't want to know how many security holes are in linux ported
programs using threads and set*id() because of that) 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 14:56                     ` Andrea Arcangeli
@ 2001-01-10 17:46                       ` Eric W. Biederman
  2001-01-10 18:33                         ` Andrea Arcangeli
  2001-01-10 19:03                         ` Linus Torvalds
  0 siblings, 2 replies; 95+ messages in thread
From: Eric W. Biederman @ 2001-01-10 17:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: David Woodhouse, Linus Torvalds, Zlatko Calusic,
	Eric W. Biederman, Rik van Riel, linux-kernel

Andrea Arcangeli <andrea@suse.de> writes:

> On Wed, Jan 10, 2001 at 11:46:03AM +0000, David Woodhouse wrote:
> > So the VM code spends a fair amount of time scanning lists of pages which 
> > it really can't do anything about?
> 
> Yes.
> 
> > Would it be possible to put such pages on different list, so that the VM
> 
> Currently to unmap the other pages we have to waste time on those unfreeable
> pages as well.
> 
> Once I or other developer finishes with the reverse lookup from page to
> pte-chain (an implementation from DaveM just exists) we'll be able to put them
> in a separate lru, but it's certainly not a 2.4.1-pre2 thing.

Why do we even want to do reverse page tables?
It seems everyone is assuming this is a good thing and except for being
a touch more flexible I don't see what this buys us (besides more locked memory).

My impression with the MM stuff is that everyone except linux is
trying hard to clone BSD instead of thinking through the issues
ourselves.

And because of the extra overhead this doesn't look to be a win on a
heavily loaded box with no swap.  And probably only glibc mmaped.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 17:46                       ` Eric W. Biederman
@ 2001-01-10 18:33                         ` Andrea Arcangeli
  2001-01-17 14:26                           ` Rik van Riel
  2001-01-10 19:03                         ` Linus Torvalds
  1 sibling, 1 reply; 95+ messages in thread
From: Andrea Arcangeli @ 2001-01-10 18:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Woodhouse, Linus Torvalds, Zlatko Calusic, Rik van Riel,
	linux-kernel

On Wed, Jan 10, 2001 at 10:46:07AM -0700, Eric W. Biederman wrote:
> Why do we even want to do reverse page tables?
> It seems everyone is assuming this is a good thing and except for being

I'm not assuming it's a good thing, but I believe it's something to try.

> My impression with the MM stuff is that everyone except linux is
> trying hard to clone BSD instead of thinking through the issues
> ourselves.

I wasn't even thinking about BSD and I always though about the issues myself,
no panic ;).

> And because of the extra overhead this doesn't look to be a win on a
> heavily loaded box with no swap.  And probably only glibc mmaped.

It can make sense also without swap. We could drop clean pages from the lru
directly that way without wasting time on pages that we don't have a chance to
free (incidentally it's exactly the optimization requested by David W. for
embedded systems).  Note that I'm not convinced that it would be worthwhile to
separate the anonymous and shm pages from the other mapped pages but in theory
we could do that.

I didn't meant that it is certainly the right way to go, but with reverse
lookup we could do very ""interesting"" things and I think it's worthwhile to
research and benchmark what happens (note also that depending on the
implementation very different things can happen at runtime)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 17:46                       ` Eric W. Biederman
  2001-01-10 18:33                         ` Andrea Arcangeli
@ 2001-01-10 19:03                         ` Linus Torvalds
  2001-01-10 19:27                           ` David S. Miller
                                             ` (2 more replies)
  1 sibling, 3 replies; 95+ messages in thread
From: Linus Torvalds @ 2001-01-10 19:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrea Arcangeli, David Woodhouse, Zlatko Calusic, Rik van Riel,
	linux-kernel



On 10 Jan 2001, Eric W. Biederman wrote:

> Andrea Arcangeli <andrea@suse.de> writes:
> > 
> > Once I or other developer finishes with the reverse lookup from page to
> > pte-chain (an implementation from DaveM just exists) we'll be able to put them
> > in a separate lru, but it's certainly not a 2.4.1-pre2 thing.
> 
> Why do we even want to do reverse page tables?

We don't.

But it does come up every once in a while, and it will probably continue
to do so.

I looked at it a year or two ago myself, and came to the conclusion that I
don't want to blow up our page table size by a factor of three or more, so
I'm not personally interested any more. Maybe somebody else comes up with
a better way to do it, or with a really compelling reason to.

"Feel free to try" is definitely the open source motto.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:03                         ` Linus Torvalds
@ 2001-01-10 19:27                           ` David S. Miller
  2001-01-10 19:36                           ` Alan Cox
  2001-01-17 14:28                           ` Subtle MM bug Rik van Riel
  2 siblings, 0 replies; 95+ messages in thread
From: David S. Miller @ 2001-01-10 19:27 UTC (permalink / raw)
  To: torvalds; +Cc: ebiederm, andrea, dwmw2, zlatko, riel, linux-kernel

   Date: 	Wed, 10 Jan 2001 11:03:21 -0800 (PST)
   From: Linus Torvalds <torvalds@transmeta.com>

   "Feel free to try" is definitely the open source motto.

I basically came to the conclusion that it sucks when I
gave it a go.

In my scheme I tried to save space by using very small descriptors to
keep track of anonymous areas in processes.  This was essentially a
vma->vm_anon pointer that kept track of the pages for you.

After trying to fight this for a few days I determined that this
doesn't work at all because of how COW dups the pages around on you.
Also it was a devil to work out anonymous pages created due to writes
to private mmaps of a file, as soon as one of these were made for the
first time on a vma you had to cook up one of the anon descriptors.

Yeah, I got the anon descriptor down to 2 pointers and an atomic
counter, but it didn't work so this achievement was worthless :-)

There are a few approaches that work, but they tend to take up too
much space to be considerable, as Linus mentioned.

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 17:32                   ` Andi Kleen
@ 2001-01-10 19:31                     ` Alan Cox
  2001-01-10 19:33                       ` Andi Kleen
  2001-01-10 20:11                       ` Linus Torvalds
  0 siblings, 2 replies; 95+ messages in thread
From: Alan Cox @ 2001-01-10 19:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Trond Myklebust, Daniel Phillips, Linus Torvalds, linux-kernel

> struct ucred is also needed to get LinuxThreads POSIX compliant (sharing
> credentials between threads, but still keeping system calls atomic in
> relation to credential changes) 

That is extremely undesirable behaviour. setuid() changes for pthreads crud
should be done by the library emulation layer. Many people have very real
and very good reasons for running multiple parallel ids. Just try writing
a threaded ftp daemon (non anonymous) without that, or an nfs server

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:31                     ` Alan Cox
@ 2001-01-10 19:33                       ` Andi Kleen
  2001-01-10 19:40                         ` Alan Cox
  2001-01-10 20:11                       ` Linus Torvalds
  1 sibling, 1 reply; 95+ messages in thread
From: Andi Kleen @ 2001-01-10 19:33 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Trond Myklebust, Daniel Phillips, Linus Torvalds,
	linux-kernel

On Wed, Jan 10, 2001 at 07:31:52PM +0000, Alan Cox wrote:
> > struct ucred is also needed to get LinuxThreads POSIX compliant (sharing
> > credentials between threads, but still keeping system calls atomic in
> > relation to credential changes) 
> 
> That is extremely undesirable behaviour. setuid() changes for pthreads crud
> should be done by the library emulation layer. Many people have very real
> and very good reasons for running multiple parallel ids. Just try writing
> a threaded ftp daemon (non anonymous) without that, or an nfs server

Of course not by default, it would be a new clone flag (with default to on in
linuxthreads though, to not cause security holes in ported programs like today) 


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:03                         ` Linus Torvalds
  2001-01-10 19:27                           ` David S. Miller
@ 2001-01-10 19:36                           ` Alan Cox
  2001-01-10 23:56                             ` David Weinehall
  2001-01-17 14:28                           ` Subtle MM bug Rik van Riel
  2 siblings, 1 reply; 95+ messages in thread
From: Alan Cox @ 2001-01-10 19:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Andrea Arcangeli, David Woodhouse,
	Zlatko Calusic, Rik van Riel, linux-kernel

> I looked at it a year or two ago myself, and came to the conclusion that I
> don't want to blow up our page table size by a factor of three or more, so
> I'm not personally interested any more. Maybe somebody else comes up with
> a better way to do it, or with a really compelling reason to.

There is only one reason I know for reverse page tables. That is ARM2/ARM3 
support - which is still not fully merged because of this issue

The MMU on these systems is a CAM, and the mmu table is thus backwards to
convention. (It also means you can notionally map two physical addresses to
one virtual but thats undefined in the implementation ;))


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:33                       ` Andi Kleen
@ 2001-01-10 19:40                         ` Alan Cox
  2001-01-10 19:43                           ` Andi Kleen
  0 siblings, 1 reply; 95+ messages in thread
From: Alan Cox @ 2001-01-10 19:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Cox, Andi Kleen, Trond Myklebust, Daniel Phillips,
	Linus Torvalds, linux-kernel

> Of course not by default, it would be a new clone flag (with default to on in
> linuxthreads though, to not cause security holes in ported programs like today) 

I've seen exactly nil cases where there are any security holes in apps caused
by that pthreads api non adherance. There are also far too many overheads
imposed by implementing something in kernel space that is nearly useless,
not needed for any application 99.9999% of users (possibly 100%) have and can
be done just as well in the pthreads library glue - where it will only be
a penalty to pthread using apps.

Making everyone suffer for a bad standard corner case is bad. Especially when
the 'security hole' is pure FUD


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:40                         ` Alan Cox
@ 2001-01-10 19:43                           ` Andi Kleen
  2001-01-10 19:48                             ` Alan Cox
  2001-01-11  9:51                             ` Trond Myklebust
  0 siblings, 2 replies; 95+ messages in thread
From: Andi Kleen @ 2001-01-10 19:43 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Trond Myklebust, Daniel Phillips, Linus Torvalds,
	linux-kernel

On Wed, Jan 10, 2001 at 07:40:49PM +0000, Alan Cox wrote:
> > Of course not by default, it would be a new clone flag (with default to on in
> > linuxthreads though, to not cause security holes in ported programs like today) 
> 
> I've seen exactly nil cases where there are any security holes in apps caused
> by that pthreads api non adherance. There are also far too many overheads
> imposed by implementing something in kernel space that is nearly useless,
> not needed for any application 99.9999% of users (possibly 100%) have and can
> be done just as well in the pthreads library glue - where it will only be
> a penalty to pthread using apps.

I have not seen a good way to implement it in user space yet.

> Making everyone suffer for a bad standard corner case is bad. Especially when
> the 'security hole' is pure FUD
>
As the thread started it's not only only needed for pthreads, but also for NFS
and setuid (actually NFS already implements it privately), and probably other network
file systems too.  So it's far from being only a "bad standard corner case". 


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:43                           ` Andi Kleen
@ 2001-01-10 19:48                             ` Alan Cox
  2001-01-10 19:48                               ` Andi Kleen
  2001-01-11  9:51                             ` Trond Myklebust
  1 sibling, 1 reply; 95+ messages in thread
From: Alan Cox @ 2001-01-10 19:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Cox, Andi Kleen, Trond Myklebust, Daniel Phillips,
	Linus Torvalds, linux-kernel

> As the thread started it's not only only needed for pthreads, but also for NFS
> and setuid (actually NFS already implements it privately), and probably other network
> file systems too.  So it's far from being only a "bad standard corner case". 

I wonder how Linux 2.2 worked, that doesnt have them. Now if its a clean way
of sorting out a pile of other things and it does pthreads as a side effect
I've no problem, but arguing for it because of a tiny pthreads corner case
is coming from the wrong end

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:48                             ` Alan Cox
@ 2001-01-10 19:48                               ` Andi Kleen
  0 siblings, 0 replies; 95+ messages in thread
From: Andi Kleen @ 2001-01-10 19:48 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Trond Myklebust, Daniel Phillips, Linus Torvalds,
	linux-kernel

On Wed, Jan 10, 2001 at 07:48:04PM +0000, Alan Cox wrote:
> > As the thread started it's not only only needed for pthreads, but also for NFS
> > and setuid (actually NFS already implements it privately), and probably other network
> > file systems too.  So it's far from being only a "bad standard corner case". 
> 
> I wonder how Linux 2.2 worked, that doesnt have them. Now if its a clean way
> of sorting out a pile of other things and it does pthreads as a side effect

Linux 2.2 setuid in nfs never worked quite like traditional Unix, and there
were lots of reports because users were regularly rediscovering it.

I think the nfs patches merged in 2.2.18 fixed it (?) 

> I've no problem, but arguing for it because of a tiny pthreads corner case
> is coming from the wrong end

I'm not so sure the thread corner case is that tiny. 

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:31                     ` Alan Cox
  2001-01-10 19:33                       ` Andi Kleen
@ 2001-01-10 20:11                       ` Linus Torvalds
  2001-01-11 12:56                         ` Stephen C. Tweedie
  2001-01-11 13:12                         ` Trond Myklebust
  1 sibling, 2 replies; 95+ messages in thread
From: Linus Torvalds @ 2001-01-10 20:11 UTC (permalink / raw)
  To: Alan Cox; +Cc: Andi Kleen, Trond Myklebust, Daniel Phillips, linux-kernel



On Wed, 10 Jan 2001, Alan Cox wrote:
> 
> That is extremely undesirable behaviour. setuid() changes for pthreads crud
> should be done by the library emulation layer. Many people have very real
> and very good reasons for running multiple parallel ids. Just try writing
> a threaded ftp daemon (non anonymous) without that, or an nfs server

I absolutely think that "one thread, one ID" is the way to go.

That said, we can easily support the notion of CLONE_CRED if we absolutely
have to (and sane people just shouldn't use it), so if somebody wants to
work on this for 2.5.x...

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:36                           ` Alan Cox
@ 2001-01-10 23:56                             ` David Weinehall
  2001-01-11  0:24                               ` Alan Cox
  2001-01-12  5:56                               ` Ralf Baechle
  0 siblings, 2 replies; 95+ messages in thread
From: David Weinehall @ 2001-01-10 23:56 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Eric W. Biederman, Andrea Arcangeli,
	David Woodhouse, Zlatko Calusic, Rik van Riel, linux-kernel

On Wed, Jan 10, 2001 at 07:36:43PM +0000, Alan Cox wrote:
> > I looked at it a year or two ago myself, and came to the conclusion that I
> > don't want to blow up our page table size by a factor of three or more, so
> > I'm not personally interested any more. Maybe somebody else comes up with
> > a better way to do it, or with a really compelling reason to.
> 
> There is only one reason I know for reverse page tables. That is ARM2/ARM3 
> support - which is still not fully merged because of this issue
> 
> The MMU on these systems is a CAM, and the mmu table is thus backwards to
> convention. (It also means you can notionally map two physical addresses to
> one virtual but thats undefined in the implementation ;))

Are there any other (not yet supported) platforms with similar (or other
unrelated, but hard to support because of the current architecture of
the kernel) problems?

(No, I have no secret trumps up my sleeve, I'm just curious.)


/David
  _                                                                 _
 // David Weinehall <tao@acc.umu.se> /> Northern lights wander      \\
//  Project MCA Linux hacker        //  Dance across the winter sky //
\>  http://www.acc.umu.se/~tao/    </   Full colour fire           </
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 23:56                             ` David Weinehall
@ 2001-01-11  0:24                               ` Alan Cox
  2001-01-12  5:56                               ` Ralf Baechle
  1 sibling, 0 replies; 95+ messages in thread
From: Alan Cox @ 2001-01-11  0:24 UTC (permalink / raw)
  To: David Weinehall
  Cc: Alan Cox, Linus Torvalds, Eric W. Biederman, Andrea Arcangeli,
	David Woodhouse, Zlatko Calusic, Rik van Riel, linux-kernel

> > The MMU on these systems is a CAM, and the mmu table is thus backwards to
> > convention. (It also means you can notionally map two physical addresses to
> > one virtual but thats undefined in the implementation ;))
> 
> Are there any other (not yet supported) platforms with similar (or other
> unrelated, but hard to support because of the current architecture of
> the kernel) problems?

I believe its uniquely deranged. There are people who have asked for reverse
tables for other purposes (eg cache flush handling) but their mmu is the normal
way around.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:43                           ` Andi Kleen
  2001-01-10 19:48                             ` Alan Cox
@ 2001-01-11  9:51                             ` Trond Myklebust
  1 sibling, 0 replies; 95+ messages in thread
From: Trond Myklebust @ 2001-01-11  9:51 UTC (permalink / raw)
  To: Alan Cox; +Cc: Andi Kleen, Daniel Phillips, Linus Torvalds, linux-kernel

>>>>> " " == Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

    >> As the thread started it's not only only needed for pthreads,
    >> but also for NFS and setuid (actually NFS already implements it
    >> privately), and probably other network file systems too.  So
    >> it's far from being only a "bad standard corner case".

     > I wonder how Linux 2.2 worked, that doesnt have them. Now if
     > its a clean way of sorting out a pile of other things and it
     > does pthreads as a side effect I've no problem, but arguing for
     > it because of a tiny pthreads corner case is coming from the
     > wrong end


How about this then:

Sure NFS can work without ucreds, but there are limitations.  For
instance the MVFS folks recently complained. They're trying to keep
mmap consistency between their own filesystem layer and the underlying
storage filesystem using i_mapping (a la CODAfs). The problem then is
that the vma will be using the wrong 'struct file' to call the
underlying storage.

This sort of problem would indeed disappear if we have a generic
credential stored in the struct file as we could make the VFS pass the
credential directly to readpage (and writepage?) rather than passing
the whole struct file.

If you use the same credentials in the task structure, then there are
other advantages even to NFS itself.
You may for example want to attach an ACL cache at some point in time
(to avoid the messiness of calling NFSv3/v4 permissions routines at
each and every file lookup). Ditto for strong RPC authentication
schemes that require an upcall to some userspace daemon.

That said, we'd first have to find a way to reconcile fsuid/fsgid with
the BSD model in some way: I'd rather not have 2 'ucred's per task (1
for threads + 1 for filesystems).

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 20:11                       ` Linus Torvalds
@ 2001-01-11 12:56                         ` Stephen C. Tweedie
  2001-01-11 13:10                           ` Andi Kleen
                                             ` (2 more replies)
  2001-01-11 13:12                         ` Trond Myklebust
  1 sibling, 3 replies; 95+ messages in thread
From: Stephen C. Tweedie @ 2001-01-11 12:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Andi Kleen, Trond Myklebust, Daniel Phillips,
	linux-kernel, Stephen Tweedie

Hi,

On Wed, Jan 10, 2001 at 12:11:16PM -0800, Linus Torvalds wrote:
> 
> That said, we can easily support the notion of CLONE_CRED if we absolutely
> have to (and sane people just shouldn't use it), so if somebody wants to
> work on this for 2.5.x...

But is it really worth the pain?  I'd hate to have to audit the entire
VFS to make sure that it works if another thread changes our
credentials in the middle of a syscall, so we either end up having to
lock the credentials over every VFS syscall, or take a copy of the
credentials and pass it through every VFS internal call that we make.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-11 12:56                         ` Stephen C. Tweedie
@ 2001-01-11 13:10                           ` Andi Kleen
  2001-01-11 16:50                           ` Albert D. Cahalan
  2001-01-11 19:01                           ` Alexander Viro
  2 siblings, 0 replies; 95+ messages in thread
From: Andi Kleen @ 2001-01-11 13:10 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Andi Kleen, Trond Myklebust,
	Daniel Phillips, linux-kernel

On Thu, Jan 11, 2001 at 12:56:04PM +0000, Stephen C. Tweedie wrote:
> Hi,
> 
> On Wed, Jan 10, 2001 at 12:11:16PM -0800, Linus Torvalds wrote:
> > 
> > That said, we can easily support the notion of CLONE_CRED if we absolutely
> > have to (and sane people just shouldn't use it), so if somebody wants to
> > work on this for 2.5.x...
> 
> But is it really worth the pain?  I'd hate to have to audit the entire
> VFS to make sure that it works if another thread changes our
> credentials in the middle of a syscall, so we either end up having to
> lock the credentials over every VFS syscall, or take a copy of the
> credentials and pass it through every VFS internal call that we make.

That is what NFS does already, it would just move into generic VFS then.
(NFS copies) 


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 20:11                       ` Linus Torvalds
  2001-01-11 12:56                         ` Stephen C. Tweedie
@ 2001-01-11 13:12                         ` Trond Myklebust
  2001-01-11 14:13                           ` Stephen C. Tweedie
  1 sibling, 1 reply; 95+ messages in thread
From: Trond Myklebust @ 2001-01-11 13:12 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Andi Kleen, Daniel Phillips, linux-kernel

>>>>> " " == Stephen C Tweedie <sct@redhat.com> writes:

     > Hi, On Wed, Jan 10, 2001 at 12:11:16PM -0800, Linus Torvalds
     > wrote:
    >>
    >> That said, we can easily support the notion of CLONE_CRED if we
    >> absolutely have to (and sane people just shouldn't use it), so
    >> if somebody wants to work on this for 2.5.x...

     > But is it really worth the pain?  I'd hate to have to audit the
     > entire VFS to make sure that it works if another thread changes
     > our credentials in the middle of a syscall, so we either end up
     > having to lock the credentials over every VFS syscall, or take
     > a copy of the credentials and pass it through every VFS
     > internal call that we make.

 What's wrong with copy-on-write style semantics? IOW, anyone who
wants to change the credentials needs to make a private copy of the
existing structure first.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-11 13:12                         ` Trond Myklebust
@ 2001-01-11 14:13                           ` Stephen C. Tweedie
  2001-01-11 19:03                             ` Alexander Viro
  0 siblings, 1 reply; 95+ messages in thread
From: Stephen C. Tweedie @ 2001-01-11 14:13 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Andi Kleen,
	Daniel Phillips, linux-kernel

Hi,

On Thu, Jan 11, 2001 at 02:12:05PM +0100, Trond Myklebust wrote:
> 
>  What's wrong with copy-on-write style semantics? IOW, anyone who
> wants to change the credentials needs to make a private copy of the
> existing structure first.

Because COW only solves the problem if each task is only changing its
own, local, private copy of the credentials.  Posix threads demand
that one thread changing credentials also affects all the other
threads immediately, and making your own local private copy won't help
you to change the other tasks' credentials safely.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 17:03                     ` Linus Torvalds
@ 2001-01-11 14:36                       ` Jim Gettys
  0 siblings, 0 replies; 95+ messages in thread
From: Jim Gettys @ 2001-01-11 14:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Woodhouse, Zlatko Calusic, Eric W. Biederman, Rik van Riel,
	linux-kernel


> Sender: linux-kernel-owner@vger.kernel.org
> From: Linus Torvalds <torvalds@transmeta.com>
> Date: 	Wed, 10 Jan 2001 09:03:03 -0800 (PST)
> To: David Woodhouse <dwmw2@infradead.org>
> Cc: Zlatko Calusic <zlatko@iskon.hr>,
>         "Eric W. Biederman" <ebiederm@xmission.com>,
>         Rik van Riel <riel@conectiva.com.br>, linux-kernel@vger.kernel.org
> Subject: Re: Subtle MM bug
> -----
> On Wed, 10 Jan 2001, David Woodhouse wrote:
> 
> >
> > torvalds@transmeta.com said:
> > >  The no-swap behaviour shoul dactually be pretty much identical,
> > > simply because both 2.2 and 2.4 will do the same thing: just skip
> > > dirty pages in the page tables because they cannot do anything about
> > > them.
> >
> > So the VM code spends a fair amount of time scanning lists of pages which
> > it really can't do anything about?
> 
> It can do _tons_ of stuff.
> 
> Remember, on platforms like this, one of the reasons for being low on
> memory is things like running X and netscape: maybe you have 64MB of RAM
> and you don't think you need a swap device, and you want to have a web
> browser.
> 
> The fact that we cannot touch _dirty_ pages doesn't mean that there's
> nothing to do: instead of running out of memory we can at least make the
> machine usable by dropping the text pages and the page cache..
> 

And pushing out old text pages is a very good idea on most embedded systems.
Getting the pages back is a (relatively) cheap operation: no disk seeks,
some joules spent on decompression (if on CRAMFS or other compressed file
system).

There is an interesting question on such devices as to whether you are
better off dropping text pages or pages out of the page cache first,
or to what degree... 
				- Jim

--
Jim Gettys
Technology and Corporate Development
Compaq Computer Corporation
jg@pa.dec.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-11 12:56                         ` Stephen C. Tweedie
  2001-01-11 13:10                           ` Andi Kleen
@ 2001-01-11 16:50                           ` Albert D. Cahalan
  2001-01-11 17:35                             ` Stephen C. Tweedie
  2001-01-11 19:01                           ` Alexander Viro
  2 siblings, 1 reply; 95+ messages in thread
From: Albert D. Cahalan @ 2001-01-11 16:50 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Andi Kleen, Trond Myklebust,
	Daniel Phillips, linux-kernel, Stephen Tweedie

Stephen C. Tweedie writes:
> On Wed, Jan 10, 2001 at 12:11:16PM -0800, Linus Torvalds wrote:

>> That said, we can easily support the notion of CLONE_CRED if
>> we absolutely have to (and sane people just shouldn't use it),
>> so if somebody wants to work on this for 2.5.x...
>
> But is it really worth the pain?  I'd hate to have to audit the
> entire VFS to make sure that it works if another thread changes our
> credentials in the middle of a syscall, so we either end up having to
> lock the credentials over every VFS syscall, or take a copy of the
> credentials and pass it through every VFS internal call that we make.

1. each thread has a copy, and doesn't need to lock it
2. threads are commanded to change their own copy

Credentials could be changed on syscall exit. It is a bit like
doing signals I think, with less overhead than making userspace
muck around with signal handlers and synchronization crud.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-11 16:50                           ` Albert D. Cahalan
@ 2001-01-11 17:35                             ` Stephen C. Tweedie
  2001-01-11 19:38                               ` Albert D. Cahalan
  0 siblings, 1 reply; 95+ messages in thread
From: Stephen C. Tweedie @ 2001-01-11 17:35 UTC (permalink / raw)
  To: Albert D. Cahalan
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Andi Kleen,
	Trond Myklebust, Daniel Phillips, linux-kernel

Hi,

On Thu, Jan 11, 2001 at 11:50:21AM -0500, Albert D. Cahalan wrote:
> Stephen C. Tweedie writes:
> >
> > But is it really worth the pain?  I'd hate to have to audit the
> > entire VFS to make sure that it works if another thread changes our
> > credentials in the middle of a syscall, so we either end up having to
> > lock the credentials over every VFS syscall, or take a copy of the
> > credentials and pass it through every VFS internal call that we make.
> 
> 1. each thread has a copy, and doesn't need to lock it

We already have that...

> 2. threads are commanded to change their own copy

We already do that: that's how the current pthreads works.
 
> Credentials could be changed on syscall exit. It is a bit like
> doing signals I think, with less overhead than making userspace
> muck around with signal handlers and synchronization crud.

Yuck.  Far better to send a signal than to pollute the syscall exit
path.  And what about syscalls which block indefinitely?  We _want_
the signal so that they get woken up to do the credentials change.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-11 12:56                         ` Stephen C. Tweedie
  2001-01-11 13:10                           ` Andi Kleen
  2001-01-11 16:50                           ` Albert D. Cahalan
@ 2001-01-11 19:01                           ` Alexander Viro
  2 siblings, 0 replies; 95+ messages in thread
From: Alexander Viro @ 2001-01-11 19:01 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Andi Kleen, Trond Myklebust,
	Daniel Phillips, linux-kernel



On Thu, 11 Jan 2001, Stephen C. Tweedie wrote:

> Hi,
> 
> On Wed, Jan 10, 2001 at 12:11:16PM -0800, Linus Torvalds wrote:
> > 
> > That said, we can easily support the notion of CLONE_CRED if we absolutely
> > have to (and sane people just shouldn't use it), so if somebody wants to
> > work on this for 2.5.x...
> 
> But is it really worth the pain?  I'd hate to have to audit the entire
> VFS to make sure that it works if another thread changes our
> credentials in the middle of a syscall, so we either end up having to
> lock the credentials over every VFS syscall, or take a copy of the
> credentials and pass it through every VFS internal call that we make.

COW. Pthreads are simply irrelevant here - if you want set*id in one
thread to change the credentials of the rest you can do it in libpthreads.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-11 14:13                           ` Stephen C. Tweedie
@ 2001-01-11 19:03                             ` Alexander Viro
  2001-01-11 19:47                               ` Stephen C. Tweedie
  0 siblings, 1 reply; 95+ messages in thread
From: Alexander Viro @ 2001-01-11 19:03 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Trond Myklebust, Linus Torvalds, Alan Cox, Andi Kleen,
	Daniel Phillips, linux-kernel



On Thu, 11 Jan 2001, Stephen C. Tweedie wrote:

> Hi,
> 
> On Thu, Jan 11, 2001 at 02:12:05PM +0100, Trond Myklebust wrote:
> > 
> >  What's wrong with copy-on-write style semantics? IOW, anyone who
> > wants to change the credentials needs to make a private copy of the
> > existing structure first.
> 
> Because COW only solves the problem if each task is only changing its
> own, local, private copy of the credentials.  Posix threads demand
> that one thread changing credentials also affects all the other
> threads immediately, and making your own local private copy won't help
> you to change the other tasks' credentials safely.

And how is that different from the current situation?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-11 17:35                             ` Stephen C. Tweedie
@ 2001-01-11 19:38                               ` Albert D. Cahalan
  0 siblings, 0 replies; 95+ messages in thread
From: Albert D. Cahalan @ 2001-01-11 19:38 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Albert D. Cahalan, Stephen C. Tweedie, Linus Torvalds, Alan Cox,
	Andi Kleen, Trond Myklebust, Daniel Phillips, linux-kernel

Stephen C. Tweedie writes:
> On Thu, Jan 11, 2001 at 11:50:21AM -0500, Albert D. Cahalan wrote:
>> Stephen C. Tweedie writes:

>>> But is it really worth the pain?  I'd hate to have to audit the
>>> entire VFS to make sure that it works if another thread changes our
>>> credentials in the middle of a syscall, so we either end up having to
>>> lock the credentials over every VFS syscall, or take a copy of the
>>> credentials and pass it through every VFS internal call that we make.
>>
>> 1. each thread has a copy, and doesn't need to lock it
>
> We already have that...
>
>> 2. threads are commanded to change their own copy
>
> We already do that: that's how the current pthreads works.

I thought it was unimplemented. Even so, it is at least one
extra round trip to/from the kernel. (I'd guess trips>1)

>> Credentials could be changed on syscall exit. It is a bit like
>> doing signals I think, with less overhead than making userspace
>> muck around with signal handlers and synchronization crud.
>
> Yuck.  Far better to send a signal than to pollute the syscall exit
> path.  And what about syscalls which block indefinitely?  We _want_
> the signal so that they get woken up to do the credentials change.

The syscall exit path itself need not be polluted. Changes to
recalc_sigpending and do_signal would get the job done.
For the former, either add an extra word of kernel-internal
signal data or just check a simple flag. For do_signal, maybe
add an extra "if(foo)" at the top of the main loop. (that would
depend on what was done to recalc_sigpending)

I suppose the goodness or badness of this depends partly on how
much you are willing to pay for pthreads that are fast and correct.
People around here seem to like burying their heads in hope that
pthreads will just go away, while app developers stubbornly try to
use the API.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-11 19:03                             ` Alexander Viro
@ 2001-01-11 19:47                               ` Stephen C. Tweedie
  2001-01-11 19:57                                 ` Alexander Viro
  0 siblings, 1 reply; 95+ messages in thread
From: Stephen C. Tweedie @ 2001-01-11 19:47 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Stephen C. Tweedie, Trond Myklebust, Linus Torvalds, Alan Cox,
	Andi Kleen, Daniel Phillips, linux-kernel

Hi,

On Thu, Jan 11, 2001 at 02:03:48PM -0500, Alexander Viro wrote:
> On Thu, 11 Jan 2001, Stephen C. Tweedie wrote:
> 
> > On Thu, Jan 11, 2001 at 02:12:05PM +0100, Trond Myklebust wrote:
> > > 
> > >  What's wrong with copy-on-write style semantics? IOW, anyone who
> > > wants to change the credentials needs to make a private copy of the
> > > existing structure first.
> > 
> > Because COW only solves the problem if each task is only changing its
> > own, local, private copy of the credentials.  Posix threads demand
> > that one thread changing credentials also affects all the other
> > threads immediately, and making your own local private copy won't help
> > you to change the other tasks' credentials safely.
> 
> And how is that different from the current situation?

It's not, which is the point I was making: COW doesn't actually solve
the pthreads problem.  Far better to do it in user space.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-11 19:47                               ` Stephen C. Tweedie
@ 2001-01-11 19:57                                 ` Alexander Viro
  0 siblings, 0 replies; 95+ messages in thread
From: Alexander Viro @ 2001-01-11 19:57 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Trond Myklebust, Linus Torvalds, Alan Cox, Andi Kleen,
	Daniel Phillips, linux-kernel



On Thu, 11 Jan 2001, Stephen C. Tweedie wrote:

> > And how is that different from the current situation?
> 
> It's not, which is the point I was making: COW doesn't actually solve
> the pthreads problem.  Far better to do it in user space.

Oh, certainly. We need COW for completely unrelated reasons - suppose
you open() a file and then change your *ID. You definitely want credentials
on the opened file to stay unchanged.

Pthreads are non-issue as far as I'm concerned. I'ld rather avoid mixing
them with credentials' cache. BTW, what about *BSD implementations? Do
they change creds of all threads upon set*id(2)?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 23:56                             ` David Weinehall
  2001-01-11  0:24                               ` Alan Cox
@ 2001-01-12  5:56                               ` Ralf Baechle
  2001-01-12 16:10                                 ` Eric W. Biederman
  1 sibling, 1 reply; 95+ messages in thread
From: Ralf Baechle @ 2001-01-12  5:56 UTC (permalink / raw)
  To: David Weinehall
  Cc: Alan Cox, Linus Torvalds, Eric W. Biederman, Andrea Arcangeli,
	David Woodhouse, Zlatko Calusic, Rik van Riel, linux-kernel

On Thu, Jan 11, 2001 at 12:56:57AM +0100, David Weinehall wrote:

> > The MMU on these systems is a CAM, and the mmu table is thus backwards to
> > convention. (It also means you can notionally map two physical addresses to
> > one virtual but thats undefined in the implementation ;))
> 
> Are there any other (not yet supported) platforms with similar (or other
> unrelated, but hard to support because of the current architecture of
> the kernel) problems?
> 
> (No, I have no secret trumps up my sleeve, I'm just curious.)

Having a reverse mappings is the least sucky way to handle virtual aliases
of certain types of MIPS caches.

  Ralf

--
"Embrace, Enhance, Eliminate" - it worked for the pope, it'll work for Bill.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-12  5:56                               ` Ralf Baechle
@ 2001-01-12 16:10                                 ` Eric W. Biederman
  2001-01-12 21:11                                   ` Russell King
  2001-01-15  2:53                                   ` Ralf Baechle
  0 siblings, 2 replies; 95+ messages in thread
From: Eric W. Biederman @ 2001-01-12 16:10 UTC (permalink / raw)
  To: Ralf Baechle
  Cc: David Weinehall, Alan Cox, Linus Torvalds, Andrea Arcangeli,
	David Woodhouse, Zlatko Calusic, Rik van Riel, linux-kernel

Ralf Baechle <ralf@conectiva.com.br> writes:

> On Thu, Jan 11, 2001 at 12:56:57AM +0100, David Weinehall wrote:
> 
> > > The MMU on these systems is a CAM, and the mmu table is thus backwards to
> > > convention. (It also means you can notionally map two physical addresses to
> > > one virtual but thats undefined in the implementation ;))
> > 
> > Are there any other (not yet supported) platforms with similar (or other
> > unrelated, but hard to support because of the current architecture of
> > the kernel) problems?
> > 
> > (No, I have no secret trumps up my sleeve, I'm just curious.)
> 
> Having a reverse mappings is the least sucky way to handle virtual aliases
> of certain types of MIPS caches.

Hmm.  I would think that increasing the logical page size in the kernel would
be the trivial way to handle virtual aliases.  (i.e.) with a large enough page
size you can't actually have a virtual alias.

You could also play some games with simply allocating pages only with the proper 
proper high bits.   These games might also be useful on architectures for L2 caches
who have significant physical bits than PAGE_SHIFT bits.

But how does a reverse mapping help to handle virtual aliases?  What are those
caches doing?  The only model in my head is having a virtually indexed cache
where you have more index bits than PAGE_SHIFT bits.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-12 16:10                                 ` Eric W. Biederman
@ 2001-01-12 21:11                                   ` Russell King
  2001-01-15  2:56                                     ` Ralf Baechle
  2001-01-15  2:53                                   ` Ralf Baechle
  1 sibling, 1 reply; 95+ messages in thread
From: Russell King @ 2001-01-12 21:11 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Ralf Baechle, riel, Andrea Arcangeli, linux-kernel

Eric W. Biederman writes:
> Hmm.  I would think that increasing the logical page size in the kernel
> would be the trivial way to handle virtual aliases.  (i.e.) with a large
> enough page size you can't actually have a virtual alias.

There are types of caches out there that no matter how large the page size,
you will always have alias issues.  These are ones where the cache lines
are indexed independent of virtual address (and therefore can have funny
cache line replacement algorithms).

And yes, you guessed which processor has it. ;)

(Sorry the CC list got trimmed, elm ate some of it.  I'm sure most of the
people who where on it were on lkml anyway)
   _____
  |_____| ------------------------------------------------- ---+---+-
  |   |         Russell King        rmk@arm.linux.org.uk      --- ---
  | | | | http://www.arm.linux.org.uk/personal/aboutme.html   /  /  |
  | +-+-+                                                     --- -+-
  /   |               THE developer of ARM Linux              |+| /|\
 /  | | |                                                     ---  |
    +-+-+ -------------------------------------------------  /\\\  |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-12 16:10                                 ` Eric W. Biederman
  2001-01-12 21:11                                   ` Russell King
@ 2001-01-15  2:53                                   ` Ralf Baechle
  2001-01-15  8:41                                     ` Caches, page coloring, virtual indexed caches, and more Eric W. Biederman
  1 sibling, 1 reply; 95+ messages in thread
From: Ralf Baechle @ 2001-01-15  2:53 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Weinehall, Alan Cox, Linus Torvalds, Andrea Arcangeli,
	David Woodhouse, Zlatko Calusic, Rik van Riel, linux-kernel

On Fri, Jan 12, 2001 at 09:10:54AM -0700, Eric W. Biederman wrote:

> > Having a reverse mappings is the least sucky way to handle virtual aliases
> > of certain types of MIPS caches.
> 
> Hmm.  I would think that increasing the logical page size in the kernel would
> be the trivial way to handle virtual aliases.  (i.e.) with a large enough page
> size you can't actually have a virtual alias.

That's a possible solution; I'm not clear how bad the overhead would be.
Right now a virtual alias is a relativly rare event and we don't want the
common case of no virtual alias to make pay a high price.  Or?

> You could also play some games with simply allocating pages only with the
> proper proper high bits.   These games might also be useful on architectures
> for L2 caches who have significant physical bits than PAGE_SHIFT bits.

An alternative but less efficient solution.  I tried to implement it; I ran
into problems with running out of larger pages soon as I had to split order 2
pages into 4 order 0 pages to implement this; the fragmentation was _really_
bad.

> But how does a reverse mapping help to handle virtual aliases?  What are those
> caches doing?

You leave only mappings of one color accessible.  All other mappings are made
unaccessible in the page table, so accessing will result in a TLB fault.
The TLB fault handler then flushes the active mappings, makes them
unaccessible by clearing the MIPS hw dirty / accessible bits, then makes the
mapping of the new color accessible in the page table.  This is already
possible right now but doing the necessary reverse mappings can be rather
inefficient as is.

> The only model in my head is having a virtually indexed cache where you
> have more index bits than PAGE_SHIFT bits.

Which is exactly what many MIPS implementations are suffering from.  At
least they're tagged with the physical address, so no flushes on context
switch necessary.

  Ralf

--
"Embrace, Enhance, Eliminate" - it worked for the pope, it'll work for Bill.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-12 21:11                                   ` Russell King
@ 2001-01-15  2:56                                     ` Ralf Baechle
  2001-01-15  6:59                                       ` Eric W. Biederman
  0 siblings, 1 reply; 95+ messages in thread
From: Ralf Baechle @ 2001-01-15  2:56 UTC (permalink / raw)
  To: Russell King; +Cc: Eric W. Biederman, riel, Andrea Arcangeli, linux-kernel

On Fri, Jan 12, 2001 at 09:11:43PM +0000, Russell King wrote:

> Eric W. Biederman writes:
> > Hmm.  I would think that increasing the logical page size in the kernel
> > would be the trivial way to handle virtual aliases.  (i.e.) with a large
> > enough page size you can't actually have a virtual alias.
> 
> There are types of caches out there that no matter how large the page size,
> you will always have alias issues.  These are ones where the cache lines
> are indexed independent of virtual address (and therefore can have funny
> cache line replacement algorithms).
> 
> And yes, you guessed which processor has it. ;)

I recently spoke with some CPU architecture researcher at some university
about cache architectures; I suspect in the near future we'll see more
funny cache indexing and replacment algorithems ...

  Ralf

--
"Embrace, Enhance, Eliminate" - it worked for the pope, it'll work for Bill.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-15  2:56                                     ` Ralf Baechle
@ 2001-01-15  6:59                                       ` Eric W. Biederman
  0 siblings, 0 replies; 95+ messages in thread
From: Eric W. Biederman @ 2001-01-15  6:59 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Russell King, riel, Andrea Arcangeli, linux-kernel

Ralf Baechle <ralf@uni-koblenz.de> writes:

> On Fri, Jan 12, 2001 at 09:11:43PM +0000, Russell King wrote:
> 
> > Eric W. Biederman writes:
> > > Hmm.  I would think that increasing the logical page size in the kernel
> > > would be the trivial way to handle virtual aliases.  (i.e.) with a large
> > > enough page size you can't actually have a virtual alias.
> > 
> > There are types of caches out there that no matter how large the page size,
> > you will always have alias issues.  These are ones where the cache lines
> > are indexed independent of virtual address (and therefore can have funny
> > cache line replacement algorithms).
> > 
> > And yes, you guessed which processor has it. ;)

Odd.  Does this affect correctness?

> I recently spoke with some CPU architecture researcher at some university
> about cache architectures; I suspect in the near future we'll see more
> funny cache indexing and replacment algorithems ...

But I doubt many of those will run incorrectly if just less efficiently if
the OS doesn't help you avoid aliases.  


Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Caches, page coloring, virtual indexed caches, and more
  2001-01-15  2:53                                   ` Ralf Baechle
@ 2001-01-15  8:41                                     ` Eric W. Biederman
  2001-01-15 11:54                                       ` Ralf Baechle
  2001-01-15 12:51                                       ` Anton Blanchard
  0 siblings, 2 replies; 95+ messages in thread
From: Eric W. Biederman @ 2001-01-15  8:41 UTC (permalink / raw)
  To: Ralf Baechle
  Cc: David Weinehall, Alan Cox, Linus Torvalds, Andrea Arcangeli,
	David Woodhouse, Zlatko Calusic, Rik van Riel, linux-kernel,
	linux-mm

Ralf Baechle <ralf@uni-koblenz.de> writes:

> On Fri, Jan 12, 2001 at 09:10:54AM -0700, Eric W. Biederman wrote:
> 
> > > Having a reverse mappings is the least sucky way to handle virtual aliases
> > > of certain types of MIPS caches.
> > 
> > Hmm.  I would think that increasing the logical page size in the kernel would
> > be the trivial way to handle virtual aliases.  (i.e.) with a large enough page

O.k. I stepped back and took a little refresher to make certain I know what
is going on.  The only problem besides context switches with a virtually mapped
cache is that without some care you can have multiple cache blocks for
the same data. This is what we must avoid to be correct.

I admit that using a reverse mapping is one way we could prevent these
duplicate blocks. 

#define VIRT_INDEX_BITS 18 /* number of bits in the L1 virtually indexed cache */

These are the places I know of in the kernel that create page mappings.
fork, anonymous pages, mmap, sysv shared memory, mremap, kmap

fork just duplicates something that is already there but in a
different mm, so no bad virtual aliases are created.

anonymous pages only belong to one process, and have effectively only
one mapping so again not a problem.  Unless you need kmap.  To make
that work well we'd have to make the restriction that the swap cache
index and the virtual address are identical in their VIRT_INDEX_BITS.
That's better than doing it in alloc_pages especially as you never
alloc high order swap pages but it worries me a little.   This is
fairly close to what we do with swap clustering but it's still
a pain.

shared mmap.  This is the important one.  Since we have a logical
backing store this is easy to handle.  We just enforce that the
virtual address in a process that we mmap something to must match the
logical address to VIRT_INDEX_BITS.  The effect is the same as a
larger page size with virtually no overhead.

sysv shared memory is exactly the same as shared mmap.  Except instead
of a file offset you have an offset into the sysv segment.

mremap.  Linux specific but pretty much the same as mmap, but easier.
We just enforce that the virtual address of the source of mremap,
and the destination of mremap match on VIRT_INDEX_BITS.

kmap is a little different.  using VIRT_INDEX_BITS is a little
subtle but should work.  Currently kmap is used only with the page
cache so we can take advantage of the page->index field.  From page->index 
we can compute the logical offset of the page and make certain the
page mapped with all VIRT_INDEX_BITS the same as a mmap alias.

kmap and the swap cache are a little different.  Since index holds
the location of a page on the swap file we'd have to make that index
be the same for VIRT_INDEX_BITS as well.


> 
> > size you can't actually have a virtual alias.
> 
> That's a possible solution; I'm not clear how bad the overhead would be.
> Right now a virtual alias is a relativly rare event and we don't want the
> common case of no virtual alias to make pay a high price.  Or?

I guess the question is how big would these logical pages need to be?
Answer big enough to turn your virtually indexed cache into a
physically indexed cache.  Which means they would have to be cache
size.  

Increasing PAGE_SIZE a few bits shouldn't be bad but going up two
orders of magnitude would likely skewer your swapping, and memory
management performance.  You'd just have way to few pages.

But I have a better suggestion so see above.

> > You could also play some games with simply allocating pages only with the
> > proper proper high bits.   These games might also be useful on architectures
> > for L2 caches who have significant physical bits than PAGE_SHIFT bits.
> 
> An alternative but less efficient solution.  I tried to implement it; I ran
> into problems with running out of larger pages soon as I had to split order 2
> pages into 4 order 0 pages to implement this; the fragmentation was _really_
> bad.

O.k. this is scratched off my list of possible good ideas.  Duh.  This
fails for exactly the same reason as increasing as increasing page
size.  at 256K cache and 4K PAGE_SIZE you'd need 256/4 = 64 different
types of pages, fairly nasty.
 
> > But how does a reverse mapping help to handle virtual aliases?  What are those
> 
> > caches doing?
> 
> You leave only mappings of one color accessible.  All other mappings are made
> unaccessible in the page table, so accessing will result in a TLB fault.
> The TLB fault handler then flushes the active mappings, makes them
> unaccessible by clearing the MIPS hw dirty / accessible bits, then makes the
> mapping of the new color accessible in the page table.  This is already
> possible right now but doing the necessary reverse mappings can be rather
> inefficient as is.

Hmm.  This doesn't sound right.  And this sounds like a silly way to
use reverse mappings anyway, since you can do it up front in mmap and
their kin.  Which means you don't have to slow any of the page fault
logic up.

> 
> > The only model in my head is having a virtually indexed cache where you
> > have more index bits than PAGE_SHIFT bits.
> 
> Which is exactly what many MIPS implementations are suffering from.  At
> least they're tagged with the physical address, so no flushes on context
> switch necessary.

Hmm.  Correct.  If you have the page aliases appropriately colored across
address spaces you will always hit the same cache block, and since you
do virtual to physical before the tag compare a false hit won't hurt
either.

Well virtually indexed caches look like worth supporting in the kernel
since it is easy to do, and can be compiled out on architectures that
don't support it.

For keeping cache collisions down I think we probably do a decent job
already.  All we need to do is to continuously cycle through cache
aliases.

For not ensuring too many cache collisions I think we probably do a
decent job already.  Only the least significant bits are significant.
And virtual addresses matter not at all.  In the buddy system where we
walk backward linearly through memory it feels o.k.  Only profiling
would tell if we were helping of if we could even help with that.
Reverse page tables of course are totally irrelevant when you are
dealing in all physical addresses though ;)

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Caches, page coloring, virtual indexed caches, and more
  2001-01-15  8:41                                     ` Caches, page coloring, virtual indexed caches, and more Eric W. Biederman
@ 2001-01-15 11:54                                       ` Ralf Baechle
  2001-01-15 12:53                                         ` Anton Blanchard
                                                           ` (2 more replies)
  2001-01-15 12:51                                       ` Anton Blanchard
  1 sibling, 3 replies; 95+ messages in thread
From: Ralf Baechle @ 2001-01-15 11:54 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel, linux-mm

On Mon, Jan 15, 2001 at 01:41:06AM -0700, Eric W. Biederman wrote:

(Cc list truncated since probably not so many people do care ...)

> shared mmap.  This is the important one.  Since we have a logical
> backing store this is easy to handle.  We just enforce that the
> virtual address in a process that we mmap something to must match the
> logical address to VIRT_INDEX_BITS.  The effect is the same as a
> larger page size with virtually no overhead.

I'm told this is going to break software.  Bad since it's otherwise it'd
be such a nice silver bullet solution.

> sysv shared memory is exactly the same as shared mmap.  Except instead
> of a file offset you have an offset into the sysv segment.

No, it's simpler in the MIPS case.  The ABI guys were nice and did define
that the virtual addresses have to be multiple of 256kbyte which is
more than sufficient to kill the problem.

> mremap.  Linux specific but pretty much the same as mmap, but easier.
> We just enforce that the virtual address of the source of mremap,
> and the destination of mremap match on VIRT_INDEX_BITS.

Correct and as mremap doesn't take any address argument we won't break
any expecations on the properties of the returned address in mmap.

> kmap is a little different.  using VIRT_INDEX_BITS is a little
> subtle but should work.  Currently kmap is used only with the page
> cache so we can take advantage of the page->index field.  From page->index 
> we can compute the logical offset of the page and make certain the
> page mapped with all VIRT_INDEX_BITS the same as a mmap alias.

Yup.  It gets somewhat tricker due to the page cache being in in KSEG0,
an memory area which is essentially like a 512mb page that is hardwired
in the CPU.  It's preferable to stick with since it means we never take
any TLB faults for pages in the page cache on MIPS.

> kmap and the swap cache are a little different.  Since index holds
> the location of a page on the swap file we'd have to make that index
> be the same for VIRT_INDEX_BITS as well.

> > That's a possible solution; I'm not clear how bad the overhead would be.
> > Right now a virtual alias is a relativly rare event and we don't want the
> > common case of no virtual alias to make pay a high price.  Or?
> 
> I guess the question is how big would these logical pages need to be?

Depending of the CPU 8kb to 32kb; the hardware supports page sizes 4kb, 16kb,
64kb ... 16mb.

> Answer big enough to turn your virtually indexed cache into a
> physically indexed cache.  Which means they would have to be cache
> size.  

For above mentioned CPU versions which have 8kb rsp. 16kb per primary cache
we want 32kb as mentioned.

> Increasing PAGE_SIZE a few bits shouldn't be bad but going up two
> orders of magnitude would likely skewer your swapping, and memory
> management performance.  You'd just have way to few pages.
> 
> But I have a better suggestion so see above.

> O.k. this is scratched off my list of possible good ideas.  Duh.  This
> fails for exactly the same reason as increasing as increasing page
> size.  at 256K cache and 4K PAGE_SIZE you'd need 256/4 = 64 different
> types of pages, fairly nasty.

You say it; yet it seems like it could be part of a good solution.  Just
forcefully allocating a single page by splitting a large page and before
that even swapping until we can actually allocate a higher order page is
bad.

> Hmm.  This doesn't sound right.  And this sounds like a silly way to
> use reverse mappings anyway, since you can do it up front in mmap and
> their kin.  Which means you don't have to slow any of the page fault
> logic up.

Then how do you handle something like:

  fd = open(TESTFILE, O_RDWR | O_CREAT, 664);
  res = write(fd, one, 4096);
  mmap(addr            , PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
  mmap(addr + PAGE_SIZE, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

If both mappings are immediately created accessible you'll directly endup
with aliases.  There is no choice, if the pagesize is only 4kb an R4x00
will create aliases in the case.  Bad.

> Hmm.  Correct.  If you have the page aliases appropriately colored across
> address spaces you will always hit the same cache block, and since you
> do virtual to physical before the tag compare a false hit won't hurt
> either.

As above example shows you may even get aliases in a single address space.

> Well virtually indexed caches look like worth supporting in the kernel
> since it is easy to do, and can be compiled out on architectures that
> don't support it.

At least for sparc it's already supported.  Right now I don't feel like
looking into the 2.4 solution but checkout srmmu_vac_update_mmu_cache in
the 2.2 kernel.

> For keeping cache collisions down I think we probably do a decent job
> already.  All we need to do is to continuously cycle through cache
> aliases.
>
> For not ensuring too many cache collisions I think we probably do a
> decent job already.

Virtual aliases are the kind of harmful collision that must be avoid or
data corruption will result.  We just happen to be lucky that there are
only very few applications which will actually suffer from this problem.
(Which is why we don't handle it correctly for all MIPSes ...)

>                       Only the least significant bits are significant.
> And virtual addresses matter not at all.  In the buddy system where we
> walk backward linearly through memory it feels o.k.  Only profiling
> would tell if we were helping of if we could even help with that.

Other Unices have choosen this implementation; of course they probably
already had the reverse mapping facilities present and didn't implement
them just for this purpose.

  Ralf

--
"Embrace, Enhance, Eliminate" - it worked for the pope, it'll work for Bill.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Caches, page coloring, virtual indexed caches, and more
  2001-01-15  8:41                                     ` Caches, page coloring, virtual indexed caches, and more Eric W. Biederman
  2001-01-15 11:54                                       ` Ralf Baechle
@ 2001-01-15 12:51                                       ` Anton Blanchard
  1 sibling, 0 replies; 95+ messages in thread
From: Anton Blanchard @ 2001-01-15 12:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ralf Baechle, David Weinehall, Alan Cox, Linus Torvalds,
	Andrea Arcangeli, David Woodhouse, Zlatko Calusic, Rik van Riel,
	linux-kernel, linux-mm

 
Hi,

> shared mmap.  This is the important one.  Since we have a logical
> backing store this is easy to handle.  We just enforce that the
> virtual address in a process that we mmap something to must match the
> logical address to VIRT_INDEX_BITS.  The effect is the same as a
> larger page size with virtually no overhead.

Check out arch/sparc64/kernel/sys_sparc32.c. Dave and I fixed this a while
ago. In particular look at the arch specific SPARC_FLAG_MMAPSHARED flag.

> sysv shared memory is exactly the same as shared mmap.  Except instead
> of a file offset you have an offset into the sysv segment.

sysv shared mem when you specify an attach address should work fine (ie
aligned to SHMLBA). On the other hand sysv shared mem without requesting
an address is broken and I havent got around to fixing it yet.

> mremap.  Linux specific but pretty much the same as mmap, but easier.
> We just enforce that the virtual address of the source of mremap,
> and the destination of mremap match on VIRT_INDEX_BITS.

See above.

Cheers,
Anton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Caches, page coloring, virtual indexed caches, and more
  2001-01-15 11:54                                       ` Ralf Baechle
@ 2001-01-15 12:53                                         ` Anton Blanchard
  2001-01-15 17:41                                           ` Ralf Baechle
  2001-01-16  9:34                                           ` Eric W. Biederman
  2001-01-15 17:16                                         ` Eric W. Biederman
  2001-01-15 18:22                                         ` Jamie Lokier
  2 siblings, 2 replies; 95+ messages in thread
From: Anton Blanchard @ 2001-01-15 12:53 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Eric W. Biederman, linux-kernel, linux-mm

 
 
> At least for sparc it's already supported.  Right now I don't feel like
> looking into the 2.4 solution but checkout srmmu_vac_update_mmu_cache in
> the 2.2 kernel.

I killed that hack now that we align all shared mmaps to the same virtual
colour :)

Anton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Caches, page coloring, virtual indexed caches, and more
  2001-01-15 11:54                                       ` Ralf Baechle
  2001-01-15 12:53                                         ` Anton Blanchard
@ 2001-01-15 17:16                                         ` Eric W. Biederman
  2001-01-16  4:58                                           ` Ralf Baechle
  2001-01-15 18:22                                         ` Jamie Lokier
  2 siblings, 1 reply; 95+ messages in thread
From: Eric W. Biederman @ 2001-01-15 17:16 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Eric W. Biederman, linux-kernel, linux-mm

Ralf Baechle <ralf@uni-koblenz.de> writes:

> On Mon, Jan 15, 2001 at 01:41:06AM -0700, Eric W. Biederman wrote:
> 
> (Cc list truncated since probably not so many people do care ...)
> 
> > shared mmap.  This is the important one.  Since we have a logical
> > backing store this is easy to handle.  We just enforce that the
> > virtual address in a process that we mmap something to must match the
> > logical address to VIRT_INDEX_BITS.  The effect is the same as a
> > larger page size with virtually no overhead.
> 
> I'm told this is going to break software.  Bad since it's otherwise it'd
> be such a nice silver bullet solution.

Heck if we wanted to we could even lie about PAGE_SIZE, and say it was huge.
I'd have to have a clear example before I give it up that easily.
mmap has never allowed totally arbitrary offsets, and mmap(MAP_FIXED)
is highly discouraged so I'd like to see it.

And on architectures that don't need this it should compile out with
no overhead.

> 
> > sysv shared memory is exactly the same as shared mmap.  Except instead
> > of a file offset you have an offset into the sysv segment.
> 
> No, it's simpler in the MIPS case.  The ABI guys were nice and did define
> that the virtual addresses have to be multiple of 256kbyte which is
> more than sufficient to kill the problem.

If VIRT_INDEX_BITS == 18 and because you can only map starting at
the beginning of a sysv shared memory segment this is exactly what
my code boils down to.

> 
> > mremap.  Linux specific but pretty much the same as mmap, but easier.
> > We just enforce that the virtual address of the source of mremap,
> > and the destination of mremap match on VIRT_INDEX_BITS.
> 
> Correct and as mremap doesn't take any address argument we won't break
> any expecations on the properties of the returned address in mmap.
> 
> > kmap is a little different.  using VIRT_INDEX_BITS is a little
> > subtle but should work.  Currently kmap is used only with the page
> > cache so we can take advantage of the page->index field.  From page->index 
> > we can compute the logical offset of the page and make certain the
> > page mapped with all VIRT_INDEX_BITS the same as a mmap alias.
> 
> Yup.  It gets somewhat tricker due to the page cache being in in KSEG0,
> an memory area which is essentially like a 512mb page that is hardwired
> in the CPU.  It's preferable to stick with since it means we never take
> any TLB faults for pages in the page cache on MIPS.

Good.  Then we don't need (at least for mips) to worry about this case.
I was just thinking through the general case.  

> > kmap and the swap cache are a little different.  Since index holds
> > the location of a page on the swap file we'd have to make that index
> > be the same for VIRT_INDEX_BITS as well.
> 
> > > That's a possible solution; I'm not clear how bad the overhead would be.
> > > Right now a virtual alias is a relativly rare event and we don't want the
> > > common case of no virtual alias to make pay a high price.  Or?
> > 
> > I guess the question is how big would these logical pages need to be?
> 
> Depending of the CPU 8kb to 32kb; the hardware supports page sizes 4kb, 16kb,
> 64kb ... 16mb.

If all you need is 32kb that is better than the 256K number I had in my head.
Still as far as an application is concerned the results are the same as
my silver bullet above.

> > Answer big enough to turn your virtually indexed cache into a
> > physically indexed cache.  Which means they would have to be cache
> > size.  
> 
> For above mentioned CPU versions which have 8kb rsp. 16kb per primary cache
> we want 32kb as mentioned.
> 
> > Increasing PAGE_SIZE a few bits shouldn't be bad but going up two
> > orders of magnitude would likely skewer your swapping, and memory
> > management performance.  You'd just have way to few pages.
> > 
> > But I have a better suggestion so see above.
> 
> > O.k. this is scratched off my list of possible good ideas.  Duh.  This
> > fails for exactly the same reason as increasing as increasing page
> > size.  at 256K cache and 4K PAGE_SIZE you'd need 256/4 = 64 different
> > types of pages, fairly nasty.
> 
> You say it; yet it seems like it could be part of a good solution.  Just
> forcefully allocating a single page by splitting a large page and before
> that even swapping until we can actually allocate a higher order page is
> bad.

I totally agree.  Larger pages don't suck but are unnecessary.  At least
I haven't been convinced otherwise yet.


> > Hmm.  This doesn't sound right.  And this sounds like a silly way to
> > use reverse mappings anyway, since you can do it up front in mmap and
> > their kin.  Which means you don't have to slow any of the page fault
> > logic up.
> 
> Then how do you handle something like:
> 
>   fd = open(TESTFILE, O_RDWR | O_CREAT, 664);
>   res = write(fd, one, 4096);
>   mmap(addr            , PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
>   mmap(addr + PAGE_SIZE, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
> 
> If both mappings are immediately created accessible you'll directly endup
> with aliases.  There is no choice, if the pagesize is only 4kb an R4x00
> will create aliases in the case.  Bad.

If MAP_FIXED isn't being used, I allocate them 256K apart. (Totally legal)
If MAP_FIXED is being used I fail the second(legal), or I lie and say that 
PAGE_SIZE is 256K while I'm at it, so it falls out naturally.

> > Hmm.  Correct.  If you have the page aliases appropriately colored across
> > address spaces you will always hit the same cache block, and since you
> > do virtual to physical before the tag compare a false hit won't hurt
> > either.
> 
> As above example shows you may even get aliases in a single address space.

Right.  I had thought through that case, and by just catching it at mmap
time is sufficient.  And again if we lie about PAGE_SIZE to the application
it must work.

> > Well virtually indexed caches look like worth supporting in the kernel
> > since it is easy to do, and can be compiled out on architectures that
> > don't support it.
> 
> At least for sparc it's already supported.  Right now I don't feel like
> looking into the 2.4 solution but checkout srmmu_vac_update_mmu_cache in
> the 2.2 kernel.

Hmm.  I see.  At least as of 2.2.12 (most recent I have on hand) the idea 
looks o.k. (Though the code itself looks broken).  It's kind of an
expensive idea though.  Even if we had reverse page tables, it's extra
work every page fault.


> > For keeping cache collisions down I think we probably do a decent job
> > already.  All we need to do is to continuously cycle through cache
> > aliases.
> >
> > For not ensuring too many cache collisions I think we probably do a
> > decent job already.
> 
> Virtual aliases are the kind of harmful collision that must be avoid or
> data corruption will result.  We just happen to be lucky that there are
> only very few applications which will actually suffer from this problem.
> (Which is why we don't handle it correctly for all MIPSes ...)

Exactly.  So we must handle this.  If you could comment on which apps
break with my solution, I'd like to hear about it. 

> 
> >                       Only the least significant bits are significant.
> > And virtual addresses matter not at all.  In the buddy system where we
> > walk backward linearly through memory it feels o.k.  Only profiling
> > would tell if we were helping of if we could even help with that.
> 
> Other Unices have choosen this implementation; of course they probably
> already had the reverse mapping facilities present and didn't implement
> them just for this purpose.

Different issue here.  I was thinking to about performance optimization by
avoiding cache contention.  

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Caches, page coloring, virtual indexed caches, and more
  2001-01-15 12:53                                         ` Anton Blanchard
@ 2001-01-15 17:41                                           ` Ralf Baechle
  2001-01-17  4:36                                             ` Anton Blanchard
  2001-01-16  9:34                                           ` Eric W. Biederman
  1 sibling, 1 reply; 95+ messages in thread
From: Ralf Baechle @ 2001-01-15 17:41 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Eric W. Biederman, linux-kernel, linux-mm

On Mon, Jan 15, 2001 at 11:53:40PM +1100, Anton Blanchard wrote:

> > At least for sparc it's already supported.  Right now I don't feel like
> > looking into the 2.4 solution but checkout srmmu_vac_update_mmu_cache in
> > the 2.2 kernel.
> 
> I killed that hack now that we align all shared mmaps to the same virtual
> colour:)

Did you find any software that breaks due to the additional restriction
on the virtual addresses of mappings?

  Ralf
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Caches, page coloring, virtual indexed caches, and more
  2001-01-15 11:54                                       ` Ralf Baechle
  2001-01-15 12:53                                         ` Anton Blanchard
  2001-01-15 17:16                                         ` Eric W. Biederman
@ 2001-01-15 18:22                                         ` Jamie Lokier
  2 siblings, 0 replies; 95+ messages in thread
From: Jamie Lokier @ 2001-01-15 18:22 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Eric W. Biederman, linux-kernel, linux-mm

Ralf Baechle wrote:
> > mremap.  Linux specific but pretty much the same as mmap, but easier.
> > We just enforce that the virtual address of the source of mremap,
> > and the destination of mremap match on VIRT_INDEX_BITS.
> 
> Correct and as mremap doesn't take any address argument we won't break
> any expecations on the properties of the returned address in mmap.

See MREMAP_FIXED.  There is an address argument, not mentioned in the
manpage (man-pages 1.30).

> > Hmm.  This doesn't sound right.  And this sounds like a silly way to
> > use reverse mappings anyway, since you can do it up front in mmap and
> > their kin.  Which means you don't have to slow any of the page fault
> > logic up.
> 
> Then how do you handle something like:
> 
>   fd = open(TESTFILE, O_RDWR | O_CREAT, 664);
>   res = write(fd, one, 4096);
>   mmap(addr            , PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
>   mmap(addr + PAGE_SIZE, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
> 
> If both mappings are immediately created accessible you'll directly endup
> with aliases.  There is no choice, if the pagesize is only 4kb an R4x00
> will create aliases in the case.  Bad.

Indeed, a particularly nice way to handle circular buffers for DSP
algorithms provided it works :-)

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Caches, page coloring, virtual indexed caches, and more
  2001-01-15 17:16                                         ` Eric W. Biederman
@ 2001-01-16  4:58                                           ` Ralf Baechle
  0 siblings, 0 replies; 95+ messages in thread
From: Ralf Baechle @ 2001-01-16  4:58 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel, linux-mm

On Mon, Jan 15, 2001 at 10:16:57AM -0700, Eric W. Biederman wrote:

> Heck if we wanted to we could even lie about PAGE_SIZE, and say it was huge.
> I'd have to have a clear example before I give it up that easily.
> mmap has never allowed totally arbitrary offsets, and mmap(MAP_FIXED)
> is highly discouraged so I'd like to see it.
> 
> And on architectures that don't need this it should compile out with
> no overhead.

> > > sysv shared memory is exactly the same as shared mmap.  Except instead
> > > of a file offset you have an offset into the sysv segment.
> > 
> > No, it's simpler in the MIPS case.  The ABI guys were nice and did define
> > that the virtual addresses have to be multiple of 256kbyte which is
> > more than sufficient to kill the problem.
> 
> If VIRT_INDEX_BITS == 18 and because you can only map starting at
> the beginning of a sysv shared memory segment this is exactly what
> my code boils down to.

> > > kmap is a little different.  using VIRT_INDEX_BITS is a little
> > > subtle but should work.  Currently kmap is used only with the page
> > > cache so we can take advantage of the page->index field.  From
> > > page->index we can compute the logical offset of the page and make
> > > certain the page mapped with all VIRT_INDEX_BITS the same as a mmap
> > > alias.
> > 
> > Yup.  It gets somewhat tricker due to the page cache being in in KSEG0,
> > an memory area which is essentially like a 512mb page that is hardwired
> > in the CPU.  It's preferable to stick with since it means we never take
> > any TLB faults for pages in the page cache on MIPS.
> 
> Good.  Then we don't need (at least for mips) to worry about this case.
> I was just thinking through the general case.  

Not that good.  Think of mmaping a file, then write(2)ing to it, then reading
from it's mapping.  Same for a mmap(2), read(2) sequence followed by a read
from the memory. This might result in aliases between the page cache and
the userspace.

A solution would be to use a kmap mapping outside KSEG0 for accessing the
pagecache of all data that is also mapped to userspace, if aliases might
occur.

> I totally agree.  Larger pages don't suck but are unnecessary.  At least
> I haven't been convinced otherwise yet.

The big iron guys actually love LARGE pages; I think IRIX on Origins uses
something lie 64kb pages or so and may make use of even larger pages in
it's page tables and mappings to get the TLB fault rate down.  There are
Usenix papers from HP and SGI about the issue; the performance increase
they report for certain apps are impressive.

> > If both mappings are immediately created accessible you'll directly endup
> > with aliases.  There is no choice, if the pagesize is only 4kb an R4x00
> > will create aliases in the case.  Bad.
> 
> If MAP_FIXED isn't being used, I allocate them 256K apart. (Totally legal)
> If MAP_FIXED is being used I fail the second(legal), or I lie and say that 
> PAGE_SIZE is 256K while I'm at it, so it falls out naturally.

MIPS ABI says larger page size is ok; it's just that on Linux a page size
of only 4kb (and 8kb for Alpha) has been hardcoded in tons of places.  Oh
well, let's break what's broken.  Luckily the IA64 guys are already doing
alot of the required fixing.

> > At least for sparc it's already supported.  Right now I don't feel like
> > looking into the 2.4 solution but checkout srmmu_vac_update_mmu_cache in
> > the 2.2 kernel.
> 
> Hmm.  I see.  At least as of 2.2.12 (most recent I have on hand) the idea 
> looks o.k. (Though the code itself looks broken).  It's kind of an
> expensive idea though.

Indeed - which is why I never was able to get myself a barf bag and
implement the same for MIPS ;-)

> Even if we had reverse page tables, it's extra work every page fault.

It's only going to impact pages which actually have aliases.  IRIX for
example uses a dual strategy.  They don't restrict addresses for MAP_FIXED
but try hard to use non-conflicting addresses whereever possible.  The
part with the reverse mappings which I just explained is only the last
alternative when a user actualy enforced the creation of mappings at
conflicting addresses.

Jamie Lokier's posting already mentioned it - mapping the same address
space twice as in the code snipet I gave is a nice way of implementing
circular buffers; I've already seen such code ...  on Intel boxen.

> > Virtual aliases are the kind of harmful collision that must be avoid or
> > data corruption will result.  We just happen to be lucky that there are
> > only very few applications which will actually suffer from this problem.
> > (Which is why we don't handle it correctly for all MIPSes ...)
> 
> Exactly.  So we must handle this.  If you could comment on which apps
> break with my solution, I'd like to hear about it. 

The problem with simply ignoring the problem is that it results in silent
data corruption.  So even if your solution is breaking more code I like it
more - a syscall error return is a obvious problem which application people
know how to handle.

The know application which breaks due to aliases is the lock manager of
some database product running on Cobalt's MIPS boxen.

  Ralf
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Caches, page coloring, virtual indexed caches, and more
  2001-01-15 12:53                                         ` Anton Blanchard
  2001-01-15 17:41                                           ` Ralf Baechle
@ 2001-01-16  9:34                                           ` Eric W. Biederman
  2001-01-17  4:43                                             ` Anton Blanchard
  1 sibling, 1 reply; 95+ messages in thread
From: Eric W. Biederman @ 2001-01-16  9:34 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Ralf Baechle, linux-kernel, linux-mm

Anton Blanchard <anton@linuxcare.com.au> writes:

>  
>  
> > At least for sparc it's already supported.  Right now I don't feel like
> > looking into the 2.4 solution but checkout srmmu_vac_update_mmu_cache in
> > the 2.2 kernel.
> 
> I killed that hack now that we align all shared mmaps to the same virtual
> colour :)

Nice.

Where do you do this?  And how do you handle the case of aliases with kseg,
the giant kernel mapping.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Caches, page coloring, virtual indexed caches, and more
  2001-01-15 17:41                                           ` Ralf Baechle
@ 2001-01-17  4:36                                             ` Anton Blanchard
  0 siblings, 0 replies; 95+ messages in thread
From: Anton Blanchard @ 2001-01-17  4:36 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Eric W. Biederman, linux-kernel, linux-mm

 
> Did you find any software that breaks due to the additional restriction
> on the virtual addresses of mappings?

Not yet. A good test of shared mmap coherency is a recent samba
(2.2 and above) that uses tdb. Tdb relies on shared mmaps heavily and
uncovered the bug when running on a dual ultrasparc pretty quickly.

Anton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Caches, page coloring, virtual indexed caches, and more
  2001-01-16  9:34                                           ` Eric W. Biederman
@ 2001-01-17  4:43                                             ` Anton Blanchard
  2001-01-17  8:35                                               ` Eric W. Biederman
  0 siblings, 1 reply; 95+ messages in thread
From: Anton Blanchard @ 2001-01-17  4:43 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Ralf Baechle, linux-kernel, linux-mm


Hi,
 
> Where do you do this?  And how do you handle the case of aliases with kseg,
> the giant kernel mapping.

Aliases between user and kernel mappings of a page are handled by
flush_page_to_ram the old interface) or {copy,clear}_user_page,
flush_dcache_page and update_mmu_cache (new interface). Sparc64 already
uses the new interface and there are patches for ppc and ia64 to use it.

The new interface allows flushes to be avoided, leading to rather nice
performance increases.

See Documentation/cachetlb.txt for more info.

Cheers,
Anton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Caches, page coloring, virtual indexed caches, and more
  2001-01-17  4:43                                             ` Anton Blanchard
@ 2001-01-17  8:35                                               ` Eric W. Biederman
  0 siblings, 0 replies; 95+ messages in thread
From: Eric W. Biederman @ 2001-01-17  8:35 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Ralf Baechle, linux-kernel, linux-mm

Anton Blanchard <anton@linuxcare.com.au> writes:

> Hi,
>  
> > Where do you do this?  And how do you handle the case of aliases with kseg,
> > the giant kernel mapping.
> 
> Aliases between user and kernel mappings of a page are handled by
> flush_page_to_ram the old interface) or {copy,clear}_user_page,
> flush_dcache_page and update_mmu_cache (new interface). Sparc64 already
> uses the new interface and there are patches for ppc and ia64 to use it.
> 
> The new interface allows flushes to be avoided, leading to rather nice
> performance increases.
> 
> See Documentation/cachetlb.txt for more info.

Thanks,

Well they are a step in the right direction....
But they are still racy, especially on SMP.

The bad case is:
Process A in kernel space calls flush_dcache_page.
Then process B in a separate thread writes to the first word in a
cache line. The Process A writes to the last word in the cache line. 

Assuming the virtual addresses from Process A and Process B are of a
different color this gives two non overlapping writes with a well
defined meaning, which the kernel gets wrong.  In particular the ram
will only see one write or the other not both.

What it looks like to me is that SHMLBA needs to be extended to normal
mmapings, making all pages in user space
(page->index << PAGE_SHIFT) % SHMLBA 
virtually aligned.

And whenever we access a page in the page cache that is not
appropriately virtually aligned in the fixed kernel mapping, 
we can use the kmap infrastructure to map it to a better kernel
location.  If we reuse the same optimizations from flush_dcache_page
it shouldn't be any worse, and in the pathological cases it will be
faster.   While removing the races seen above.

Any thoughts?

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-09 19:09               ` Daniel Phillips
  2001-01-09 19:29                 ` Trond Myklebust
  2001-01-09 19:37                 ` Linus Torvalds
@ 2001-01-17  8:46                 ` Rik van Riel
  2001-01-25 22:51                   ` Daniel Phillips
  2 siblings, 1 reply; 95+ messages in thread
From: Rik van Riel @ 2001-01-17  8:46 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Linus Torvalds, linux-kernel

On Tue, 9 Jan 2001, Daniel Phillips wrote:
> Linus Torvalds wrote:
> > (This is why I worked so hard at getting the PageDirty semantics right in
> > the last two months or so - and why I released 2.4.0 when I did. Getting
> > PageDirty right was the big step to make all of the VM stuff possible in
> > the first place. Even if it probably looked a bit foolhardy to change the
> > semantics of "writepage()" quite radically just before 2.4 was released).
>
> On the topic of writepage, it's not symmetric with readpage at
> the moment - it still takes (struct file *).  Is this in the
> cleanup pipeline?  It looks like nfs_readpage already ignores
> the struct file *, but maybe some other net filesystems are
> still depending on it.

writepage() and readpage() will never be symmetric...

readpage()
	program can't continue until data is there
	reading in larger clusters eats (wastes?) more memory
	done when we think a process needs data

writepage()
	called after the process has written data and moved on
	writing larger clusters has no influence on memory use
	often done to free up memory

Since readpage() needs to tune readahead behaviour, we will
always want to give it some information (eg. in the file *)
so it can do the extra things it needs to do.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 18:33                         ` Andrea Arcangeli
@ 2001-01-17 14:26                           ` Rik van Riel
  0 siblings, 0 replies; 95+ messages in thread
From: Rik van Riel @ 2001-01-17 14:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Eric W. Biederman, David Woodhouse, Linus Torvalds,
	Zlatko Calusic, linux-kernel

On Wed, 10 Jan 2001, Andrea Arcangeli wrote:
> On Wed, Jan 10, 2001 at 10:46:07AM -0700, Eric W. Biederman wrote:

> > My impression with the MM stuff is that everyone except linux is
> > trying hard to clone BSD instead of thinking through the issues
> > ourselves.
>
> I wasn't even thinking about BSD and I always though about the
> issues myself, no panic ;).

Andrea, if you have the time, please do check out the
FreeBSD and NetBSD VM code.

The FreeBSD code has the original Mach overengineered
abstraction layer, but an absolutely kickass page
replacement strategy.

The NetBSD code has cleaned up the abstraction layer
into something nice and lower overhead, but has a lot
simpler (probably lower performance) page replacement.

It would be cool if some of the Linux hackers could take
the time and look at this code to see if there are some
good ideas we might want to have in Linux.

It might just be the case that we DON'T want to reinvent
the wheel (that others have made into a nice round shape
with 15 years of trial, error and redesigning).

(though I know some people prefer reinventing wheels ;))

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:03                         ` Linus Torvalds
  2001-01-10 19:27                           ` David S. Miller
  2001-01-10 19:36                           ` Alan Cox
@ 2001-01-17 14:28                           ` Rik van Riel
  2001-01-18  1:23                             ` Linus Torvalds
  2 siblings, 1 reply; 95+ messages in thread
From: Rik van Riel @ 2001-01-17 14:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Andrea Arcangeli, David Woodhouse,
	Zlatko Calusic, linux-kernel

On Wed, 10 Jan 2001, Linus Torvalds wrote:

> I looked at it a year or two ago myself, and came to the
> conclusion that I don't want to blow up our page table size by a
> factor of three or more, so I'm not personally interested any
> more. Maybe somebody else comes up with a better way to do it,
> or with a really compelling reason to.

OTOH, it _would_ get rid of all the balancing issues in one
blow. And it would fix the aliasing issues and possibly the
memory fragmentation problem too.

And using something like Davem's lower-overhead reverse
mapping layer, we might just be able to pull off all (or most)
of the advantages with lower overhead ;)

[this is something I will be looking into for 2.5]

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-17 14:28                           ` Subtle MM bug Rik van Riel
@ 2001-01-18  1:23                             ` Linus Torvalds
  2001-01-18 11:48                               ` Rik van Riel
  0 siblings, 1 reply; 95+ messages in thread
From: Linus Torvalds @ 2001-01-18  1:23 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.31.0101180126240.31432-100000@localhost.localdomain>,
Rik van Riel  <riel@conectiva.com.br> wrote:
>On Wed, 10 Jan 2001, Linus Torvalds wrote:
>
>> I looked at it a year or two ago myself, and came to the
>> conclusion that I don't want to blow up our page table size by a
>> factor of three or more, so I'm not personally interested any
>> more. Maybe somebody else comes up with a better way to do it,
>> or with a really compelling reason to.
>
>OTOH, it _would_ get rid of all the balancing issues in one
>blow. And it would fix the aliasing issues and possibly the
>memory fragmentation problem too.

I totally disagree.

It might help fragmentation, but it has absolutely _no_ impact on
balancing. See my comments about not seeing the "accessed" bit until way
too late with a "find by physical" approach.

You simply _cannot_ use "find by physical" for balancing, unless you're
willing to pay the price of doing software accessed bits even on
hardware that does it for you in the page tables.  Which is a price MUCH
too high to pay, I suspect. 

The current vmscanning is the way to go.  Getting PageDirty was a big
step for it, because it is needed so that we can drop pages without
having to do IO like we historically did.  I doubt find-by-physical will
help AT ALL wrt balancing. 

		Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-18  1:23                             ` Linus Torvalds
@ 2001-01-18 11:48                               ` Rik van Riel
  0 siblings, 0 replies; 95+ messages in thread
From: Rik van Riel @ 2001-01-18 11:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On 17 Jan 2001, Linus Torvalds wrote:
> Rik van Riel  <riel@conectiva.com.br> wrote:
> >On Wed, 10 Jan 2001, Linus Torvalds wrote:
> >
> >> I looked at it a year or two ago myself, and came to the
> >> conclusion that I don't want to blow up our page table size by a
> >> factor of three or more, so I'm not personally interested any
> >> more. Maybe somebody else comes up with a better way to do it,
> >> or with a really compelling reason to.
> >
> >OTOH, it _would_ get rid of all the balancing issues in one
> >blow. And it would fix the aliasing issues and possibly the
> >memory fragmentation problem too.
>
> I totally disagree.

I still haven't seen anything that might get us a
"universally correct" balancing between swap_out()
and refill_inactive_scan().

We either scan both categories at the same relative
rate, which gives mapped pages an advantage because
they may get unmapped later than the unmapped pages
get deactivated.

Alternatively, you do the scanning between these two
at different rates, which gives an advantage to one
or the other.

(or am I overlooking something stupid here?)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-17  8:46                 ` Rik van Riel
@ 2001-01-25 22:51                   ` Daniel Phillips
  0 siblings, 0 replies; 95+ messages in thread
From: Daniel Phillips @ 2001-01-25 22:51 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel

Rik van Riel wrote:
> 
> On Tue, 9 Jan 2001, Daniel Phillips wrote:
> > Linus Torvalds wrote:
> > > (This is why I worked so hard at getting the PageDirty semantics right in
> > > the last two months or so - and why I released 2.4.0 when I did. Getting
> > > PageDirty right was the big step to make all of the VM stuff possible in
> > > the first place. Even if it probably looked a bit foolhardy to change the
> > > semantics of "writepage()" quite radically just before 2.4 was released).
> >
> > On the topic of writepage, it's not symmetric with readpage at
> > the moment - it still takes (struct file *).  Is this in the
> > cleanup pipeline?  It looks like nfs_readpage already ignores
> > the struct file *, but maybe some other net filesystems are
> > still depending on it.
> 
> writepage() and readpage() will never be symmetric...
> 
> readpage()
>         program can't continue until data is there
>         reading in larger clusters eats (wastes?) more memory
>         done when we think a process needs data
> 
> writepage()
>         called after the process has written data and moved on
>         writing larger clusters has no influence on memory use
>         often done to free up memory
> 
> Since readpage() needs to tune readahead behaviour, we will
> always want to give it some information (eg. in the file *)
> so it can do the extra things it needs to do.

Which extra information did you have in mind?

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-18  1:32         ` Rik van Riel
@ 2001-04-17 19:37           ` H. Peter Anvin
  0 siblings, 0 replies; 95+ messages in thread
From: H. Peter Anvin @ 2001-04-17 19:37 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <Pine.LNX.4.31.0101181230020.31432-100000@localhost.localdomain>
By author:    Rik van Riel <riel@conectiva.com.br>
In newsgroup: linux.dev.kernel
> 
> Suppose you have 8 high-priority tasks waiting on kswapd
> and one lower-priority (but still higher than kswapd)
> process running and preventing kswapd from doing its work.
> Oh .. and also preventing the higher-priority tasks from
> being woken up and continuing...
> 

Classic priority inversion.  In this particular case it seems like it
should be unusually simple to apply priority inheritance, though (the
general case is complicated by the fact that the dependency matrix
usually isn't readily available.)

	-hpa
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-17 18:53       ` Zlatko Calusic
@ 2001-01-18  1:32         ` Rik van Riel
  2001-04-17 19:37           ` H. Peter Anvin
  0 siblings, 1 reply; 95+ messages in thread
From: Rik van Riel @ 2001-01-18  1:32 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: linux-kernel, linux-mm

On 17 Jan 2001, Zlatko Calusic wrote:
> Rik van Riel <riel@conectiva.com.br> writes:
>
> > > Second test: kernel compile make -j32 (empirically this puts the
> > > VM under load, but not excessively!)
> > >
> > > 2.2.17 -> make -j32  392.49s user 47.87s system 168% cpu 4:21.13 total
> > > 2.4.0  -> make -j32  389.59s user 31.29s system 182% cpu 3:50.24 total
> > >
> > > Now, is this great news or what, 2.4.0 is definitely faster.
> >
> > One problem is that these tasks may be waiting on kswapd when
> > kswapd might not get scheduled in on time. On the one hand this
> > will mean lower load and less thrashing, on the other hand it
> > means more IO wait.
>
> Hm, if all tasks are waiting for memory, what is stopping kswapd
> to run? :)

Suppose you have 8 high-priority tasks waiting on kswapd
and one lower-priority (but still higher than kswapd)
process running and preventing kswapd from doing its work.
Oh .. and also preventing the higher-priority tasks from
being woken up and continuing...


Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-17  4:48     ` Rik van Riel
@ 2001-01-17 18:53       ` Zlatko Calusic
  2001-01-18  1:32         ` Rik van Riel
  0 siblings, 1 reply; 95+ messages in thread
From: Zlatko Calusic @ 2001-01-17 18:53 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

Rik van Riel <riel@conectiva.com.br> writes:

> > Second test: kernel compile make -j32 (empirically this puts the
> > VM under load, but not excessively!)
> >
> > 2.2.17 -> make -j32  392.49s user 47.87s system 168% cpu 4:21.13 total
> > 2.4.0  -> make -j32  389.59s user 31.29s system 182% cpu 3:50.24 total
> >
> > Now, is this great news or what, 2.4.0 is definitely faster.
> 
> One problem is that these tasks may be waiting on kswapd when
> kswapd might not get scheduled in on time. On the one hand this
> will mean lower load and less thrashing, on the other hand it
> means more IO wait.
> 

Hm, if all tasks are waiting for memory, what is stopping kswapd to
run? :)
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-09  2:01   ` Zlatko Calusic
@ 2001-01-17  4:48     ` Rik van Riel
  2001-01-17 18:53       ` Zlatko Calusic
  0 siblings, 1 reply; 95+ messages in thread
From: Rik van Riel @ 2001-01-17  4:48 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: linux-kernel, linux-mm

On 9 Jan 2001, Zlatko Calusic wrote:
> Rik van Riel <riel@conectiva.com.br> writes:
>
> > Now if 2.4 has worse _performance_ than 2.2 due to one
> > reason or another, that I'd like to hear about ;)
> >
>
> Oh, well, it seems that I was wrong. :)
>
> First test: hogmem 180 5 = allocate 180MB and dirty it 5 times (on a
> 192MB machine)
>
> kernel | swap usage | speed
> -------------------------------
> 2.2.17 |  48 MB     | 11.8 MB/s
> -------------------------------
> 2.4.0  | 206 MB     | 11.1 MB/s
> -------------------------------
>
> So 2.2 is only marginally faster. Also it can be seen that 2.4
> uses 4 times more swap space. If Linus says it's ok... :)

I have been working on some changes to page_launder() which
might just fix this problem. Quick and dirty patches are on
my home page and I'll try to clean things up and make something
correct & clean later today or tomorrow ;)

> Second test: kernel compile make -j32 (empirically this puts the
> VM under load, but not excessively!)
>
> 2.2.17 -> make -j32  392.49s user 47.87s system 168% cpu 4:21.13 total
> 2.4.0  -> make -j32  389.59s user 31.29s system 182% cpu 3:50.24 total
>
> Now, is this great news or what, 2.4.0 is definitely faster.

One problem is that these tasks may be waiting on kswapd when
kswapd might not get scheduled in on time. On the one hand this
will mean lower load and less thrashing, on the other hand it
means more IO wait.

This is another area where we may be able to improve some things.

(btw, according to Alan the 2.4 kernel is the first one to break
the 1.2 kernel compiling speed record on an 8MB machine he has ;))

cheers,

Rik  (stuck in australia on a conference)
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
@ 2001-01-10 19:57 Chris Wing
  0 siblings, 0 replies; 95+ messages in thread
From: Chris Wing @ 2001-01-10 19:57 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

Alan:

> I've seen exactly nil cases where there are any security holes in apps caused
> by that pthreads api non adherance. 

I don't know of any exploitable bugs that were found in it, but the identd
server included in Red Hat 6.1 (pidentd 3.0.10) unintentionally ran as
root instead of nobody because its programmer used pthreads and assumed
that setuid() would affect all threads.

I pointed this out to the author and Red Hat, and it was fixed in
pidentd 3.0.11 and Red Hat 6.2.

-Chris Wing
wingc@engin.umich.edu

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-07 21:37 ` Rik van Riel
  2001-01-07 22:33   ` Zlatko Calusic
@ 2001-01-09  2:01   ` Zlatko Calusic
  2001-01-17  4:48     ` Rik van Riel
  1 sibling, 1 reply; 95+ messages in thread
From: Zlatko Calusic @ 2001-01-09  2:01 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

Rik van Riel <riel@conectiva.com.br> writes:

> Now if 2.4 has worse _performance_ than 2.2 due to one
> reason or another, that I'd like to hear about ;)
> 

Oh, well, it seems that I was wrong. :)


First test: hogmem 180 5 = allocate 180MB and dirty it 5 times (on a
192MB machine)

kernel | swap usage | speed
-------------------------------
2.2.17 |  48 MB     | 11.8 MB/s
-------------------------------
2.4.0  | 206 MB     | 11.1 MB/s
-------------------------------

So 2.2 is only marginally faster. Also it can be seen that 2.4 uses 4
times more swap space. If Linus says it's ok... :)


Second test: kernel compile make -j32 (empirically this puts the VM
under load, but not excessively!)

2.2.17 -> make -j32  392.49s user 47.87s system 168% cpu 4:21.13 total
2.4.0  -> make -j32  389.59s user 31.29s system 182% cpu 3:50.24 total

Now, is this great news or what, 2.4.0 is definitely faster.

-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-08 23:30     ` Andrea Arcangeli
@ 2001-01-09  0:37       ` Linus Torvalds
  0 siblings, 0 replies; 95+ messages in thread
From: Linus Torvalds @ 2001-01-09  0:37 UTC (permalink / raw)
  To: linux-kernel

In article <20010109003002.L27646@athlon.random>,
Andrea Arcangeli  <andrea@suse.de> wrote:
>On Mon, Jan 08, 2001 at 03:22:44PM -0800, Wayne Whitney wrote:
>> I guess I conclude that either (1) MAGMA does not use libc's malloc
>> (checking on this, I doubt it) or (2) glibc-2.1.92 knows of these
>> variables but has not yet implemented the tuning (I'll try glibc-2.2) or
>> (3) this is not the problem.
>
>You should monitor the program with strace while it fails (last few syscalls).
>You can breakpoint at exit() and run `cat /proc/pid/maps` to show us the vma
>layout of the task. Then we'll see why it's failing.  With CONFIG_1G in 2.2.x
>or 2.4.x (confinguration option doesn't matter) you should at least reach
>something like 1.5G.

It might be doing its own memory management with brk() directly - some
older UNIX programs will do that (for various reasons - it can be faster
than malloc() etc if you know your access patterns, for example).

If you do that, and you have shared libraries, you'll get a failure
around the point Wayne sees it. 

But your suggestion to check with strace is a good one.

		Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-08 23:22   ` Wayne Whitney
@ 2001-01-08 23:30     ` Andrea Arcangeli
  2001-01-09  0:37       ` Linus Torvalds
  0 siblings, 1 reply; 95+ messages in thread
From: Andrea Arcangeli @ 2001-01-08 23:30 UTC (permalink / raw)
  To: Wayne Whitney; +Cc: Szabolcs Szakacsits, LKML, Andi Kleen

On Mon, Jan 08, 2001 at 03:22:44PM -0800, Wayne Whitney wrote:
> I guess I conclude that either (1) MAGMA does not use libc's malloc
> (checking on this, I doubt it) or (2) glibc-2.1.92 knows of these
> variables but has not yet implemented the tuning (I'll try glibc-2.2) or
> (3) this is not the problem.

You should monitor the program with strace while it fails (last few syscalls).
You can breakpoint at exit() and run `cat /proc/pid/maps` to show us the vma
layout of the task. Then we'll see why it's failing.  With CONFIG_1G in 2.2.x
or 2.4.x (confinguration option doesn't matter) you should at least reach
something like 1.5G.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-08 21:56 ` Wayne Whitney
@ 2001-01-08 23:22   ` Wayne Whitney
  2001-01-08 23:30     ` Andrea Arcangeli
  0 siblings, 1 reply; 95+ messages in thread
From: Wayne Whitney @ 2001-01-08 23:22 UTC (permalink / raw)
  To: Szabolcs Szakacsits; +Cc: LKML, Andi Kleen

On Mon, 8 Jan 2001, Wayne Whitney wrote:

> On Mon, 8 Jan 2001, Szabolcs Szakacsits wrote:
>
> > AFAIK newer glibc = CVS glibc but the malloc() tune parameters work
> > via environment variables for the current stable ones as well,
>
> I'll arrange a binary linked against glibc2.2, and then your suggestion
> will hopefully do the trick.  Thanks for your kind help!

OK, I now have a binary dynamically linked against /lib/libc.so.6,
(according to ldd), and that points to glibc-2.1.92.  And I tried setting
the environment variables you suggested, I checked that they are set and
checked that they appear in /lib/libc.so.6.  But the behaviour is
unchanged:  MAGMA still hits this barrier at 830M (not 870M, that was a
typo).

I guess I conclude that either (1) MAGMA does not use libc's malloc
(checking on this, I doubt it) or (2) glibc-2.1.92 knows of these
variables but has not yet implemented the tuning (I'll try glibc-2.2) or
(3) this is not the problem.

I'll look at Andrea's hack as well.  Thanks for everybody's help!

Cheers, Wayne






-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-08 22:00 ` Wayne Whitney
@ 2001-01-08 22:15   ` Andrea Arcangeli
  0 siblings, 0 replies; 95+ messages in thread
From: Andrea Arcangeli @ 2001-01-08 22:15 UTC (permalink / raw)
  To: Wayne Whitney; +Cc: Szabolcs Szakacsits, LKML, Andi Kleen

On Mon, Jan 08, 2001 at 02:00:19PM -0800, Wayne Whitney wrote:
> I'd ask if this jives with your theory:  if I configure the linux kernel
> to be able to use 2GB of RAM, then the 870MB limit becomes much lower, to
> 230MB.

It's because the virtual address space for userspace tasks gets reduced
from 3G to 2G to give an additional giga of direct mapping to the kernel.

Also the other limit you hit (at around 800mbyte) is partly because
of the too low userspace virtual address space.

You can use this hack by me to allow the tasks to grow up to 3.5G per task on
IA32 on 2.4.0 (equivalent hack exists for 2.2.19pre6aa1 with bigmem, btw it
makes sense also without bigmem if you have lots of swap, that's all about
virtual memory not physical RAM).  However it doesn't work with PAE enabled
yet.

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.0-test11-pre5/per-process-3.5G-IA32-no-PAE-1

If you run your program on any 64bit architecture (in 64bit userspace mode)
supported by linux, you won't run into those per-process address space limits.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-08 20:39 Szabolcs Szakacsits
  2001-01-08 21:56 ` Wayne Whitney
@ 2001-01-08 22:00 ` Wayne Whitney
  2001-01-08 22:15   ` Andrea Arcangeli
  1 sibling, 1 reply; 95+ messages in thread
From: Wayne Whitney @ 2001-01-08 22:00 UTC (permalink / raw)
  To: Szabolcs Szakacsits; +Cc: LKML, Andi Kleen

On Mon, 8 Jan 2001, Szabolcs Szakacsits wrote:

> AFAIK newer glibc = CVS glibc but the malloc() tune parameters work
> via environment variables for the current stable ones as well, e.g. to
> overcome the above "out of memory" one could do,
>
> % export MALLOC_MMAP_MAX_=1000000
> % export MALLOC_MMAP_THRESHOLD_=0
> % magma

As I just mentioned, I haven't been able to test this yet due to my
current binary being linked against an older libc with doesn't seem to
have these parameters.  But here's one other data point, I just thought
I'd ask if this jives with your theory:  if I configure the linux kernel
to be able to use 2GB of RAM, then the 870MB limit becomes much lower, to
230MB.

Cheers, Wayne



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-08 20:39 Szabolcs Szakacsits
@ 2001-01-08 21:56 ` Wayne Whitney
  2001-01-08 23:22   ` Wayne Whitney
  2001-01-08 22:00 ` Wayne Whitney
  1 sibling, 1 reply; 95+ messages in thread
From: Wayne Whitney @ 2001-01-08 21:56 UTC (permalink / raw)
  To: Szabolcs Szakacsits; +Cc: LKML

On Mon, 8 Jan 2001, Szabolcs Szakacsits wrote:

> AFAIK newer glibc = CVS glibc but the malloc() tune parameters work
> via environment variables for the current stable ones as well,

Hmm, this must have been introduced in libc6?  Unfortunately, I don't have
the source code to MAGMA, and the binary I have is statically linked.  It
does not contain the names of the environment variables you mentioned.

I'll arrange a binary linked against glibc2.2, and then your suggestion
will hopefully do the trick.  Thanks for your kind help!

Cheers,
Wayne



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
@ 2001-01-08 20:39 Szabolcs Szakacsits
  2001-01-08 21:56 ` Wayne Whitney
  2001-01-08 22:00 ` Wayne Whitney
  0 siblings, 2 replies; 95+ messages in thread
From: Szabolcs Szakacsits @ 2001-01-08 20:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andi Kleen, Wayne Whitney


Andi Kleen <ak@suse.de> wrote:
> On Sun, Jan 07, 2001 at 09:29:29PM -0800, Wayne Whitney wrote:
> > package called MAGMA; at times this requires very large matrices. The
> > RSS can get up to 870MB; for some reason a MAGMA process under linux
> > thinks it has run out of memory at 870MB, regardless of the actual
> > memory/swap in the machine. MAGMA is single-threaded.
> I think it's caused by the way malloc maps its memory.
> Newer glibc should work a bit better by falling back to mmap even
> for smaller allocations (older does it only for very big ones)

AFAIK newer glibc = CVS glibc but the malloc() tune parameters
work via environment variables for the current stable ones as well,
e.g. to overcome the above "out of memory" one could do,
% export MALLOC_MMAP_MAX_=1000000
% export MALLOC_MMAP_THRESHOLD_=0
% magma

At default, on a 32bit Linux current stable glibc malloc uses brk
between 0x08??????-0x40000000 and max (MALLOC_MMAP_MAX_) 128 mmap if
the requested chunk is greater than 128 kB (MALLOC_MMAP_THRESHOLD_).
If MAGMA mallocs memory in less than 128 kB chunks then the above out
of memory behaviour is expected.

	Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-07 21:37 ` Rik van Riel
@ 2001-01-07 22:33   ` Zlatko Calusic
  2001-01-09  2:01   ` Zlatko Calusic
  1 sibling, 0 replies; 95+ messages in thread
From: Zlatko Calusic @ 2001-01-07 22:33 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

Rik van Riel <riel@conectiva.com.br> writes:

> On 7 Jan 2001, Zlatko Calusic wrote:
> 
> > Things go berzerk if you have one big process whose working set
> > is around your physical memory size.
> 
> "go berzerk" in what way?  Does the system cause lots of extra
> swap IO and does it make the system thrash where 2.2 didn't
> even touch the disk ?
>

Well, I think yes. I'll do some testing on the 2.2 before I can tell
you for sure, but definitely the system is behaving badly where I
think it should not.

> > Final effect is that physical memory gets extremely flooded with
> > the swap cache pages and at the same time the system absorbs
> > ridiculous amount of the swap space.
> 
> This is mostly because Linux 2.4 keeps dirty pages in the
> swap cache. Under Linux 2.2 a page would be deleted from the
> swap cache when a program writes to it, but in Linux 2.4 it
> can stay in the swap cache.
>

OK, I can buy that.

> Oh, and don't forget that pages in the swap cache can also
> be resident in the process, so it's not like the swap cache
> is "eating into" the process' RSS ;)
>

So far so good... A little bit weird but not alarming per se.

> > For instance on my 192MB configuration, firing up the hogmem
> > program which allocates let's say 170MB of memory and dirties it
> > leads to 215MB of swap used.
> 
> So that's 170MB of swap space for hogmem and 45MB for
> the other things in the system (daemons, X, ...).
>

Yes, that's it. So it looks like all of my processes are on the
swap. That can't be good. I mean, even Solaris (known to eat swap
space like there's no tomorrow :)) would probably be more polite.

> Sounds pretty ok, except maybe for the fact that now
> Linux allocates (not uses!) a lot more swap space then
> before and some people may need to add some swap space
> to their system ...
>

Yes, I would say really a lot more. Big diffeence.

Also, I don't see a diference between allocated and used swap space on
the Linux. Could you elaborate on that?

> 
> Now if 2.4 has worse _performance_ than 2.2 due to one
> reason or another, that I'd like to hear about ;)
> 

I'll get back to you later with more data. Time to boot 2.2. :)
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Subtle MM bug
  2001-01-07 20:59 Zlatko Calusic
@ 2001-01-07 21:37 ` Rik van Riel
  2001-01-07 22:33   ` Zlatko Calusic
  2001-01-09  2:01   ` Zlatko Calusic
  0 siblings, 2 replies; 95+ messages in thread
From: Rik van Riel @ 2001-01-07 21:37 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: linux-kernel, linux-mm

On 7 Jan 2001, Zlatko Calusic wrote:

> Things go berzerk if you have one big process whose working set
> is around your physical memory size.

"go berzerk" in what way?  Does the system cause lots of extra
swap IO and does it make the system thrash where 2.2 didn't
even touch the disk ?

> Final effect is that physical memory gets extremely flooded with
> the swap cache pages and at the same time the system absorbs
> ridiculous amount of the swap space.

This is mostly because Linux 2.4 keeps dirty pages in the
swap cache. Under Linux 2.2 a page would be deleted from the
swap cache when a program writes to it, but in Linux 2.4 it
can stay in the swap cache.

Oh, and don't forget that pages in the swap cache can also
be resident in the process, so it's not like the swap cache
is "eating into" the process' RSS ;)

> For instance on my 192MB configuration, firing up the hogmem
> program which allocates let's say 170MB of memory and dirties it
> leads to 215MB of swap used.

So that's 170MB of swap space for hogmem and 45MB for
the other things in the system (daemons, X, ...).

Sounds pretty ok, except maybe for the fact that now
Linux allocates (not uses!) a lot more swap space then
before and some people may need to add some swap space
to their system ...


Now if 2.4 has worse _performance_ than 2.2 due to one
reason or another, that I'd like to hear about ;)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Subtle MM bug
@ 2001-01-07 20:59 Zlatko Calusic
  2001-01-07 21:37 ` Rik van Riel
  0 siblings, 1 reply; 95+ messages in thread
From: Zlatko Calusic @ 2001-01-07 20:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm

I'm trying to get more familiar with the MM code in 2.4.0, as can be
seen from lots of questions I have on the subject. I discovered nasty
mm behaviour under even moderate load (2.2 didn't have troubles).

Things go berzerk if you have one big process whose working set is
around your physical memory size. Typical memory hoggers are good
enough to trigger the bad behaviour. Final effect is that physical
memory gets extremely flooded with the swap cache pages and at the
same time the system absorbs ridiculous amount of the swap space.
xmem is as usual very good at detecting this and you just need to
press Alt-SysReq-M to see that most of the memory (e.g. 90%) is
populated with the swap cache pages.

For instance on my 192MB configuration, firing up the hogmem program
which allocates let's say 170MB of memory and dirties it leads to
215MB of swap used. vmstat 1 shows that the pagecache size is
constantly growing - that is swapcache enlarging in fact - during the
second pass of the hogmem program.

...
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd  free buff  cache   si   so    bi    bo   in    cs  us  sy  id
 0  1  1 131488  1592  400  62384 4172 5188  1092  1298  353  1447   2   4  94
 0  1  1 136584  1592  400  67428 5860 4104  1465  1034  322  1327   3   3  93
 0  1  1 141668  1592  388  72536 5504 4420  1376  1106  323  1423   1   3  95
 0  1  1 146724  1592  380  77592 5996 4236  1499  1060  335  1096   2   3  94
 0  1  1 151876  1600  320  82764 6264 3712  1566   936  327  1226   3   4  93
 0  1  1 157016  1600  320  87908 5284 4268  1321  1068  315  1248   1   2  96
 1  0  0 157016  1600  308  87792 1836 5168   459  1293  281  1324   3   3  94
 0  1  0 162204  1600  304  92892 7784 5236  1946  1315  385  1353   3   5  92
 0  1  0 167216  1600  304  97780 3496 5016   874  1256  301  1222   0   2  97
 0  1  1 177904  1608  284 108276 5160 5168  1290  1300  330  1453   1   4  94
 0  1  2 182008  1588  288 112264 4936 3344  1268   838  293   801   2   3  95
 0  2  1 183620  1588  260 114012 3064 1756   830   445  290   846   0  15  85
 0  2  2 185384  1596  180 115864 2320 2620   635   658  285   722   1  29  70
 0  3  2 187528  1592  220 117892 2488 2224   657   557  273   754   3  30  67
 0  4  1 190512  1592  236 120772 2524 3012   725   760  343  1080   1  14  85
 0  4  1 195780  1592  240 125868 2336 5316   613  1331  381  1624   2   2  96
 1  0  1 200992  1592  248 131052 2080 2176   623   552  234  1044   3  23  74
 0  1  0 200996  1592  252 130948 2208 3048   580   762  256  1065  10  10  80
 0  1  1 206240  1592  252 136076 2988 5252   760  1314  309  1406   7   4  8
 0  2  1 211408  1592  256 141080 5424 5180  1389  1303  395  1885   3   5  91
 0  2  0 214744  1592  264 144280 4756 3328  1223   834  327  1211   1   5  95
 1  0  0 214868  1592  244 144468 4344 5148  1087  1295  303  1189  11   2  86
 0  1  1 214900  1592  248 144496 4360 3244  1098   812  318  1467   7   4  89
 0  1  1 214916  1592  248 144520 4280 3452  1070   865  336  1602   3   3  94
 0  1  1 214964  1592  248 144580 4972 4184  1243  1054  368  1620   3   5  92
 0  2  2 214956  1592  272 144548 3700 4544  1081  1142  665  2952   1   1  98
 0  1  0 214992  1592  272 144588 1220 5088   305  1274  282  1363   1   4  95
 0  1  1 215012  1592  272 144600 3640 4420   910  1106  325  1579   3   2  9

Any thoughts on this?
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 95+ messages in thread

end of thread, other threads:[~2001-04-17 19:40 UTC | newest]

Thread overview: 95+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-01-08  5:29 Subtle MM bug Wayne Whitney
2001-01-08  5:42 ` Andi Kleen
2001-01-08  6:04   ` Linus Torvalds
2001-01-08 17:44     ` Rik van Riel
2001-01-08 18:02       ` Linus Torvalds
2001-01-08 17:16 ` Rik van Riel
2001-01-08 17:58   ` Linus Torvalds
2001-01-08 23:41     ` Zlatko Calusic
2001-01-09  2:58       ` Linus Torvalds
2001-01-09  6:20       ` Eric W. Biederman
2001-01-09  7:27         ` Linus Torvalds
2001-01-09 11:38           ` Eric W. Biederman
2001-01-09 12:29           ` Zlatko Calusic
2001-01-09 18:47             ` Linus Torvalds
2001-01-09 19:09               ` Daniel Phillips
2001-01-09 19:29                 ` Trond Myklebust
2001-01-10 17:32                   ` Andi Kleen
2001-01-10 19:31                     ` Alan Cox
2001-01-10 19:33                       ` Andi Kleen
2001-01-10 19:40                         ` Alan Cox
2001-01-10 19:43                           ` Andi Kleen
2001-01-10 19:48                             ` Alan Cox
2001-01-10 19:48                               ` Andi Kleen
2001-01-11  9:51                             ` Trond Myklebust
2001-01-10 20:11                       ` Linus Torvalds
2001-01-11 12:56                         ` Stephen C. Tweedie
2001-01-11 13:10                           ` Andi Kleen
2001-01-11 16:50                           ` Albert D. Cahalan
2001-01-11 17:35                             ` Stephen C. Tweedie
2001-01-11 19:38                               ` Albert D. Cahalan
2001-01-11 19:01                           ` Alexander Viro
2001-01-11 13:12                         ` Trond Myklebust
2001-01-11 14:13                           ` Stephen C. Tweedie
2001-01-11 19:03                             ` Alexander Viro
2001-01-11 19:47                               ` Stephen C. Tweedie
2001-01-11 19:57                                 ` Alexander Viro
2001-01-09 19:37                 ` Linus Torvalds
2001-01-17  8:46                 ` Rik van Riel
2001-01-25 22:51                   ` Daniel Phillips
2001-01-09 19:53               ` Simon Kirby
2001-01-09 20:08                 ` Linus Torvalds
2001-01-09 20:10                 ` Zlatko Calusic
2001-01-10  1:45               ` David Woodhouse
2001-01-10  2:26                 ` Andrea Arcangeli
2001-01-10  6:57                 ` Linus Torvalds
2001-01-10 11:46                   ` David Woodhouse
2001-01-10 14:56                     ` Andrea Arcangeli
2001-01-10 17:46                       ` Eric W. Biederman
2001-01-10 18:33                         ` Andrea Arcangeli
2001-01-17 14:26                           ` Rik van Riel
2001-01-10 19:03                         ` Linus Torvalds
2001-01-10 19:27                           ` David S. Miller
2001-01-10 19:36                           ` Alan Cox
2001-01-10 23:56                             ` David Weinehall
2001-01-11  0:24                               ` Alan Cox
2001-01-12  5:56                               ` Ralf Baechle
2001-01-12 16:10                                 ` Eric W. Biederman
2001-01-12 21:11                                   ` Russell King
2001-01-15  2:56                                     ` Ralf Baechle
2001-01-15  6:59                                       ` Eric W. Biederman
2001-01-15  2:53                                   ` Ralf Baechle
2001-01-15  8:41                                     ` Caches, page coloring, virtual indexed caches, and more Eric W. Biederman
2001-01-15 11:54                                       ` Ralf Baechle
2001-01-15 12:53                                         ` Anton Blanchard
2001-01-15 17:41                                           ` Ralf Baechle
2001-01-17  4:36                                             ` Anton Blanchard
2001-01-16  9:34                                           ` Eric W. Biederman
2001-01-17  4:43                                             ` Anton Blanchard
2001-01-17  8:35                                               ` Eric W. Biederman
2001-01-15 17:16                                         ` Eric W. Biederman
2001-01-16  4:58                                           ` Ralf Baechle
2001-01-15 18:22                                         ` Jamie Lokier
2001-01-15 12:51                                       ` Anton Blanchard
2001-01-17 14:28                           ` Subtle MM bug Rik van Riel
2001-01-18  1:23                             ` Linus Torvalds
2001-01-18 11:48                               ` Rik van Riel
2001-01-10 17:03                     ` Linus Torvalds
2001-01-11 14:36                       ` Jim Gettys
2001-01-08 21:30   ` Wayne Whitney
  -- strict thread matches above, loose matches on Subject: below --
2001-01-10 19:57 Chris Wing
2001-01-08 20:39 Szabolcs Szakacsits
2001-01-08 21:56 ` Wayne Whitney
2001-01-08 23:22   ` Wayne Whitney
2001-01-08 23:30     ` Andrea Arcangeli
2001-01-09  0:37       ` Linus Torvalds
2001-01-08 22:00 ` Wayne Whitney
2001-01-08 22:15   ` Andrea Arcangeli
2001-01-07 20:59 Zlatko Calusic
2001-01-07 21:37 ` Rik van Riel
2001-01-07 22:33   ` Zlatko Calusic
2001-01-09  2:01   ` Zlatko Calusic
2001-01-17  4:48     ` Rik van Riel
2001-01-17 18:53       ` Zlatko Calusic
2001-01-18  1:32         ` Rik van Riel
2001-04-17 19:37           ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).