linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: mempool design
  2001-12-15 19:40 mempool design Ingo Molnar
@ 2001-12-15 18:47 ` Benjamin LaHaise
  2001-12-15 22:18   ` Ingo Molnar
  0 siblings, 1 reply; 15+ messages in thread
From: Benjamin LaHaise @ 2001-12-15 18:47 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Rik van Riel, linux-kernel

On Sat, Dec 15, 2001 at 08:40:19PM +0100, Ingo Molnar wrote:
> With all respect, even if i had read it before, i'd have done mempool.c
> the same way as it is now. (but i'd obviously have Cc:-ed Ben on it during
> its development.) I'd like to sum up Ben's patch (Ben please correct me if
> i misrepresent your patch in any way):

You're making the assumption that an incomplete patch is useless and 
has no design pricipals behind it.  What I disagree with is the design 
of mempool, not the implementation.  The design for reservations is to 
use enforced accounting limits to achive the effect of seperate memory 
pools.  Mempool's design is to build seperate pools on top of existing 
pools of memory.  Can't you see the obvious duplication that implies?

The first implementation of the reservation patch is full of bogosities, 
I'm the first one to admit that.  But am I going to go off and write an 
entirely new patch that fixes everything and gets the design right to 
replace mempool?  Not with the current rate of patches being ignored.

		-ben

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: mempool design
@ 2001-12-15 19:40 Ingo Molnar
  2001-12-15 18:47 ` Benjamin LaHaise
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2001-12-15 19:40 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Ben LaHaise, linux-kernel


On Sat, 15 Dec 2001, Rik van Riel wrote:

> > such scenarios can only be solved by using/creating independent pools,
> > and/or by using 'composite' pools like raid1.c does. One common
>
> OK, you've convinced me ...
> ... of the fact that you're reinventing Ben's reservation
> mechanism, poorly.

i have to admit that i did not know Ben's patch until today. I must have
missed it when he released it, and apparently there were no followup
releases(?). I now understand why Ben had to flame me. Anyway, here is his
patch:

	http://lwn.net/2001/0531/a/bcrl-reservation.php3

With all respect, even if i had read it before, i'd have done mempool.c
the same way as it is now. (but i'd obviously have Cc:-ed Ben on it during
its development.) I'd like to sum up Ben's patch (Ben please correct me if
i misrepresent your patch in any way):

the patch adds a reservation feature to the page allocator. It defines a
'reservation structure', which causes the true free pages count of
particular page zones to be decreased artificially, thus creating a
virtual reserve of pages. These reservation structures can be assigned to
processes on a codepath basis. Eg. on IRQ entry the current process gets
assigned the IRQ-atomic reservation - and any original reservation is
restored on IRQ-exit. On swapping-code entry, arbitrary processes get the
swapping reservation. kswapd, kupdated and bdflush have their own,
permanent reservations. Freeing into the reserved pools is done by linking
the reservation structure to it's "home-zone", which the __free_pages()
code polls and refills. One process has a single active reservation
structure to allocate from.

this approach IMO does not answer some fundamental issues:

- Allocations might still fail with NULL. With mempool, allocations in
  process contexts are guaranteed to always succeed.

- it does not allow the reservation of higher order allocations, which can
  be especially important given the poor higher-order behavior of the page
  allocator.

- the reservation patch does not offer deadlock avoidance in critical code
  paths with complex allocation patterns (see the examples from my
  previous email). Just having separate pools of pages is not enough.

- minor nit #1: reservations are tied to zones, while mempool can take
  from different zones, as long as the zones are compatible.

- minor nit #2: reservations are adding overhead to critical code areas
  (and yes, besides oom-only code, the fast-path is touched as well) such
  as rmqueue() and __free_pages(). Mempool does not add overhead to the
  underlying allocator(s).

- perhaps there is a more advanced patch available (Ben?), but right now i
  cannot see how the SLAB allocator can have the same reservation concept
  added, without excessive code duplication.

Rik, it would be nice if you could provide a few technical arguments that
underscore your point. If i'm wrong then i'd like to be proven wrong.

	Ingo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: mempool design
  2001-12-15 18:47 ` Benjamin LaHaise
@ 2001-12-15 22:18   ` Ingo Molnar
  2001-12-17 15:04     ` Andrea Arcangeli
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2001-12-15 22:18 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Rik van Riel, linux-kernel


On Sat, 15 Dec 2001, Benjamin LaHaise wrote:

> [...] The design for reservations is to use enforced accounting limits
> to achive the effect of seperate memory pools. [...]

how is this going to handle higher-order pools? How is this going to
handle the need for composite allocations?

I think putting in reservation for higher-order pages is going to make
some parts of page_alloc.c *really* complex, and this complexity i dont
think we need.

> [...] Mempool's design is to build seperate pools on top of existing
> pools of memory. Can't you see the obvious duplication that implies?

no. Mempool's design is to build preallocated, reserved,
guaranteed-allocation pools on top of simpler allocators. Simpler
allocators will try reasonably hard to get something allocated, but might
fail as well. The majority of allocations within the kernel has no
deadlock relevance at all. If we allocate a new file structure, or create
a new socket, or are trying to page in overcommitted memory then we can
return with -ENOMEM (or OOM) just fine. Allocators are simpler and faster
without built-in deadlock avoidance and 'reserves'.

so the difference in design is that you are trying to add reservations as
a feature of the allocators themselves, while i'm trying to add it as a
generic, allocator-independent subsystem, which also solved a number of
other problems we always had in the IO code. Under this design, the 'pure'
allocators themselves have no concept of reserved pools at all, so there
is no duplication in concepts. (and no duplication of code.)

so the fundamental question is, should reservation be a core functionality
of allocators, or should it be a separate subsystem. *If* it can be put
into the core allocators (page allocator, SLAB) that gives us all the
features that mempool gives us today then i think i'd like that approach.
But i cannot really see how the composite allocation thing can be done via
reservations.

	Ingo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: mempool design
  2001-12-15 22:18   ` Ingo Molnar
@ 2001-12-17 15:04     ` Andrea Arcangeli
  2001-12-17 15:38       ` Victor Yodaiken
                         ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2001-12-17 15:04 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Benjamin LaHaise, Rik van Riel, linux-kernel

On Sat, Dec 15, 2001 at 11:18:33PM +0100, Ingo Molnar wrote:
> 
> On Sat, 15 Dec 2001, Benjamin LaHaise wrote:
> 
> > [...] The design for reservations is to use enforced accounting limits
> > to achive the effect of seperate memory pools. [...]
> 
> how is this going to handle higher-order pools? How is this going to
> handle the need for composite allocations?
> 
> I think putting in reservation for higher-order pages is going to make
> some parts of page_alloc.c *really* complex, and this complexity i dont
> think we need.
> 
> > [...] Mempool's design is to build seperate pools on top of existing
> > pools of memory. Can't you see the obvious duplication that implies?
> 
> no. Mempool's design is to build preallocated, reserved,
> guaranteed-allocation pools on top of simpler allocators. Simpler
> allocators will try reasonably hard to get something allocated, but might
> fail as well. The majority of allocations within the kernel has no
> deadlock relevance at all. If we allocate a new file structure, or create
> a new socket, or are trying to page in overcommitted memory then we can
> return with -ENOMEM (or OOM) just fine. Allocators are simpler and faster
> without built-in deadlock avoidance and 'reserves'.
> 
> so the difference in design is that you are trying to add reservations as
> a feature of the allocators themselves, while i'm trying to add it as a
> generic, allocator-independent subsystem, which also solved a number of
> other problems we always had in the IO code. Under this design, the 'pure'
> allocators themselves have no concept of reserved pools at all, so there
> is no duplication in concepts. (and no duplication of code.)
> 
> so the fundamental question is, should reservation be a core functionality
> of allocators, or should it be a separate subsystem. *If* it can be put
> into the core allocators (page allocator, SLAB) that gives us all the
> features that mempool gives us today then i think i'd like that approach.
> But i cannot really see how the composite allocation thing can be done via
> reservations.

This whole long thread can be resumed in two points:

1	mempool reserved memory is "wasted" i.e. not usable as cache
2	if the mempool code is moved inside the memory balancing of the
	VM we could use this memory as clean, atomically-freeable cache

however the option 2 is quite complex, think when somebody mmap the page
and we find_lock etc... we cannot "lock" a reserved page, or it would be
unfreeable, at least unless we're sure this "lock" will go away without
us deadlocking on it while waiting.

so in short solution 1 is much simpler and much more obviously correct,
and the only disavantage is that it reduces the amount of clean cache
that could be pontentially be used by the kernel.

If implementation details and code complexity would be our last design
priority  solution 2 advocated by Ben, Rik and SCT would be obviously
superior.

At the moment in 2.5 and also in 2.4 we use the "mempool outside VM"
logic just because we can keep it under control without being killed by
the huge complexity of the implementation details with the locking of
clean cache, nesting into the vm etc... Of course I'm considering a
correct implementation of it, not an hack where cache can be mlocked and
the kernel deadlocks because the reserved memory isn't freeable anymore.

Personally I'm more relaxed with the mempool approch because it reduces
the complexity of an order of magnitude, it abstracts the thing without
making the memory balancing more complex and it definitely solve the
problem (if used correctly i.e. not two alloc_bio in a row from the same
pool from multiple tasks at the same time as pointed out by Ingo).

If somebody wants such 1% of ram back he can buy another dimm of ram and
plug it into his hardware. I mean such 1% of ram lost is something that
can be solved by throwing a few euros into the hardware (and people buys
gigabyte boxes anyways so they don't need all of the 100% of ram), the
other complexy cannot be solved with a few euros, that can only be
solved with lots braincycles and it would be a maintainance work as
well. Abstraction and layering definitely helps cutting down the
complexity of the code.

Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: mempool design
  2001-12-17 15:04     ` Andrea Arcangeli
@ 2001-12-17 15:38       ` Victor Yodaiken
  2001-12-17 16:10         ` Andrea Arcangeli
                           ` (2 more replies)
  2001-12-17 17:21       ` Ingo Molnar
  2001-12-18 15:21       ` Alan Cox
  2 siblings, 3 replies; 15+ messages in thread
From: Victor Yodaiken @ 2001-12-17 15:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Benjamin LaHaise, Rik van Riel, linux-kernel

On Mon, Dec 17, 2001 at 04:04:26PM +0100, Andrea Arcangeli wrote:
> If somebody wants such 1% of ram back he can buy another dimm of ram and
> plug it into his hardware. I mean such 1% of ram lost is something that
> can be solved by throwing a few euros into the hardware (and people buys
> gigabyte boxes anyways so they don't need all of the 100% of ram), the
> other complexy cannot be solved with a few euros, that can only be
> solved with lots braincycles and it would be a maintainance work as
> well. Abstraction and layering definitely helps cutting down the
> complexity of the code.

I agree with all your arguments up to here. But being able to run Linux
in 4Meg or even 8M is important to a very large class of applications. 
Even if you are concerned mostly about bigger systems, making sure NT
remains at a serious disadvantage in the embedded boxes is key because
MS will certainly hope to use control of SOHO routers, set-top boxes
etc to set "standards" that will improve their competitivity in desktop
and beyond. It would be a delicious irony if MS were able to re-use
against Linux the "first control low end" strategy that allowed them 
vaporize the old line UNIXes, but irony is not as satisfying as winning.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: mempool design
  2001-12-17 17:21       ` Ingo Molnar
@ 2001-12-17 15:58         ` Andrea Arcangeli
  2001-12-18  0:32           ` Rik van Riel
  0 siblings, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2001-12-17 15:58 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Benjamin LaHaise, Rik van Riel, linux-kernel

On Mon, Dec 17, 2001 at 06:21:53PM +0100, Ingo Molnar wrote:
> 
> On Mon, 17 Dec 2001, Andrea Arcangeli wrote:
> 
> > This whole long thread can be resumed in two points:
> >
> > 1	mempool reserved memory is "wasted" i.e. not usable as cache
> 
> reservations, as in Ben's published (i know, incomplete) implementation,
> are 'wasted' as well.

yes, I was referring only about his long term design arguments.

> > 2	if the mempool code is moved inside the memory balancing of the
> > 	VM we could use this memory as clean, atomically-freeable cache
> 
> i agree - i proposed something like this to SCT about 3-4 years ago (back
> when the buffer-cache was still reentrant), and it's still not
> implemented. And i'm not betting on it being done soon. Making the
> pagecache structures IRQ-safe looks like the same kind of trouble we had
> with the IRQ-reentrant buffer-cache. It can be done (in fact it's quite
> easy to do the initial bits), but it can bite us in multiple ways. And in
> the real deadlock scenarios we have no clean pages anyway.

in theory those pages should be reserved, so it would be the same like
the pages in the mempool, but while they do nothing they could hold some
cache data, but for example they couldn't be either mapped in any
address space etc... at least unless we're able to atomically unmap
pages and flush tlb in all cpus etc.. :) it would be a mess and it's not
a concidence that Ben's first implementation wasn't taking adantage of
it and that in 3-4 years it's still not there yet :). Plus as you
mentioned it would add the local_save_irq overhead to the common path as
well, to be able to do things from irqs (which I didn't considered in
the previous email). That would hurt performance.

> i personally get the shivers from any global counters where being off by 1
> in 1% of the cases will bite us only in 1 out of 10000 systems.

yes, and as said it's a problem that doesn't affect performance or
scalability, nor it wastes a _percentage_ of ram, it only wastes a
_fixed_ amount of ram.

> > Personally I'm more relaxed with the mempool approch because it
> > reduces the complexity of an order of magnitude, it abstracts the
> > thing without making the memory balancing more complex and it
> > definitely solve the problem (if used correctly i.e. not two alloc_bio
> > in a row from the same pool from multiple tasks at the same time as
> > pointed out by Ingo).
> 
> yep - and as your VM rewrite has proven it as well, reducing complexity
> and interdependencies within the VM is the top priority at the moment and
> brings the most benefits. And the amount of reserved (lost) pool-pages
> does not scale up with more RAM in the system - it scales up with more
> devices (and more mounted filesystems) in the system. And we have
> per-device RAM footprint anyway. So it's not like 'struct page'.

100% agreed (as said above too :).

Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: mempool design
  2001-12-17 15:38       ` Victor Yodaiken
@ 2001-12-17 16:10         ` Andrea Arcangeli
  2001-12-17 17:33         ` kernel panic Geoffrey
  2001-12-18 16:55         ` mempool design Ingo Molnar
  2 siblings, 0 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2001-12-17 16:10 UTC (permalink / raw)
  To: Victor Yodaiken; +Cc: Ingo Molnar, Benjamin LaHaise, Rik van Riel, linux-kernel

On Mon, Dec 17, 2001 at 08:38:02AM -0700, Victor Yodaiken wrote:
> On Mon, Dec 17, 2001 at 04:04:26PM +0100, Andrea Arcangeli wrote:
> > If somebody wants such 1% of ram back he can buy another dimm of ram and
> > plug it into his hardware. I mean such 1% of ram lost is something that
> > can be solved by throwing a few euros into the hardware (and people buys
> > gigabyte boxes anyways so they don't need all of the 100% of ram), the
> > other complexy cannot be solved with a few euros, that can only be
> > solved with lots braincycles and it would be a maintainance work as
> > well. Abstraction and layering definitely helps cutting down the
> > complexity of the code.
> 
> I agree with all your arguments up to here. But being able to run Linux
> in 4Meg or even 8M is important to a very large class of applications. 
> Even if you are concerned mostly about bigger systems, making sure NT
> remains at a serious disadvantage in the embedded boxes is key because
> MS will certainly hope to use control of SOHO routers, set-top boxes
> etc to set "standards" that will improve their competitivity in desktop
> and beyond. It would be a delicious irony if MS were able to re-use
> against Linux the "first control low end" strategy that allowed them 
> vaporize the old line UNIXes, but irony is not as satisfying as winning.

I may been misleading mentionin a 1%, the 1% doesn't mean a 1% of ram is
wasted (otherwise adding a new dimm couldn't solve it because you would
waste even more ram :). As Ingo also mentioned, it's a fixed amount of
ram that is wasted in the mempool.

For very low end machines you can simply define a very small mempool, it
will potentially reduce scalability during heavy I/O with mem shortage
but it will waste very very little ram (potentially in the simpler case
you only need 1 entry in the pool to guarantee deadlock avoidance).  And
there's nearly nothing to worry about, we always had those mempools
since 2.0 at least, look at buffer.c and search for the async argument
to the functions allocating the bhs. Now with the bio we have more
mempools because lots of people still uses the bh, so in the short term
(before 2.6) we can waste some more byte, but once the bh and
ll_rw_block will be dead most of the bio overhead will go away and we'll
only hold the advantages of doing I/O in more than one page with a
single metadata entity (2.6). The other obvious advantage of the mempool
code is that we share it across all the mempool users, so we'll save
some byte of icache too by avoiding code duplication compared to 2.4 too :).

Infact the solution 2) cannot solve your 4M/8M boot problem either,
since such memory would need to be resrved anyways, and it could act
only as clean filesystem cache. So in short the only difference between 1) and
2) would be a little more of fs cache in solution 2) but with an huge
implementation complexity and local_save_irq all over the place in the
VM so with lower performance. It wouldn't make a difference in
functionality (boot or not boot, this is the real problem you worry
about :).

Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: mempool design
  2001-12-17 15:04     ` Andrea Arcangeli
  2001-12-17 15:38       ` Victor Yodaiken
@ 2001-12-17 17:21       ` Ingo Molnar
  2001-12-17 15:58         ` Andrea Arcangeli
  2001-12-18 15:21       ` Alan Cox
  2 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2001-12-17 17:21 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Benjamin LaHaise, Rik van Riel, linux-kernel


On Mon, 17 Dec 2001, Andrea Arcangeli wrote:

> This whole long thread can be resumed in two points:
>
> 1	mempool reserved memory is "wasted" i.e. not usable as cache

reservations, as in Ben's published (i know, incomplete) implementation,
are 'wasted' as well.

> 2	if the mempool code is moved inside the memory balancing of the
> 	VM we could use this memory as clean, atomically-freeable cache

i agree - i proposed something like this to SCT about 3-4 years ago (back
when the buffer-cache was still reentrant), and it's still not
implemented. And i'm not betting on it being done soon. Making the
pagecache structures IRQ-safe looks like the same kind of trouble we had
with the IRQ-reentrant buffer-cache. It can be done (in fact it's quite
easy to do the initial bits), but it can bite us in multiple ways. And in
the real deadlock scenarios we have no clean pages anyway.

i personally get the shivers from any global counters where being off by 1
in 1% of the cases will bite us only in 1 out of 10000 systems.

> Personally I'm more relaxed with the mempool approch because it
> reduces the complexity of an order of magnitude, it abstracts the
> thing without making the memory balancing more complex and it
> definitely solve the problem (if used correctly i.e. not two alloc_bio
> in a row from the same pool from multiple tasks at the same time as
> pointed out by Ingo).

yep - and as your VM rewrite has proven it as well, reducing complexity
and interdependencies within the VM is the top priority at the moment and
brings the most benefits. And the amount of reserved (lost) pool-pages
does not scale up with more RAM in the system - it scales up with more
devices (and more mounted filesystems) in the system. And we have
per-device RAM footprint anyway. So it's not like 'struct page'.

	Ingo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* kernel panic
  2001-12-17 15:38       ` Victor Yodaiken
  2001-12-17 16:10         ` Andrea Arcangeli
@ 2001-12-17 17:33         ` Geoffrey
  2001-12-18 16:55         ` mempool design Ingo Molnar
  2 siblings, 0 replies; 15+ messages in thread
From: Geoffrey @ 2001-12-17 17:33 UTC (permalink / raw)
  To: linux-kernel

I'm looking for the proper forum for posting a (possible) kernel bug. 
I'm receiving a panic when attempting to write a cdrw under 2.4.12.

I don't know that this is a bug, but I would expect some other activity
rather than a panic/lockup.

Suggestions as to the proper forum would be appreciated.

--
Until later: Geoffrey		esoteric@3times25.net

"...the system (Microsoft passport) carries significant risks to users
that
are not made adequately clear in the technical documentation available."
- David P. Kormann and Aviel D. Rubin, AT&T Labs - Research
- http://www.avirubin.com/passport.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: mempool design
  2001-12-17 15:58         ` Andrea Arcangeli
@ 2001-12-18  0:32           ` Rik van Riel
  0 siblings, 0 replies; 15+ messages in thread
From: Rik van Riel @ 2001-12-18  0:32 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Ingo Molnar, Benjamin LaHaise, linux-kernel

On Mon, 17 Dec 2001, Andrea Arcangeli wrote:
> On Mon, Dec 17, 2001 at 06:21:53PM +0100, Ingo Molnar wrote:
> > On Mon, 17 Dec 2001, Andrea Arcangeli wrote:
> >
> > > This whole long thread can be resumed in two points:
> > >
> > > 1	mempool reserved memory is "wasted" i.e. not usable as cache
> >
> > reservations, as in Ben's published (i know, incomplete) implementation,
> > are 'wasted' as well.
>
> yes, I was referring only about his long term design arguments.

Long term design arguments don't have to make the short-term
implementation any more complex. I guess you presented a nice
argument to go with the more flexible solution.

cheers,

Rik
-- 
Shortwave goes a long way:  irc.starchat.net  #swl

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: mempool design
  2001-12-17 15:04     ` Andrea Arcangeli
  2001-12-17 15:38       ` Victor Yodaiken
  2001-12-17 17:21       ` Ingo Molnar
@ 2001-12-18 15:21       ` Alan Cox
  2 siblings, 0 replies; 15+ messages in thread
From: Alan Cox @ 2001-12-18 15:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Benjamin LaHaise, Rik van Riel, linux-kernel

> If somebody wants such 1% of ram back he can buy another dimm of ram and
> plug it into his hardware. I mean such 1% of ram lost is something that
> can be solved by throwing a few euros into the hardware (and people buys
> gigabyte boxes anyways so they don't need all of the 100% of ram), the

How do I add dimms to an embedded board ?

> solved with lots braincycles and it would be a maintainance work as
> well. Abstraction and layering definitely helps cutting down the
> complexity of the code.

I'm not too worried. mempool as an API can relatively easily be persuaded
to do reservations on an underlying allocator some point in the future.

Alan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: mempool design
  2001-12-18 16:55         ` mempool design Ingo Molnar
@ 2001-12-18 16:06           ` Victor Yodaiken
  0 siblings, 0 replies; 15+ messages in thread
From: Victor Yodaiken @ 2001-12-18 16:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Victor Yodaiken, Andrea Arcangeli, Benjamin LaHaise,
	Rik van Riel, linux-kernel

On Tue, Dec 18, 2001 at 05:55:14PM +0100, Ingo Molnar wrote:
> 
> On Mon, 17 Dec 2001, Victor Yodaiken wrote:
> 
> > I agree with all your arguments up to here. But being able to run
> > Linux in 4Meg or even 8M is important to a very large class of
> > applications. [...]
> 
> the amount of reserved RAM should be very low. Especially in embedded
> applications that usually have a very controlled environment, with a low
> number of well-behaving devices, the number of pages needed to be reserved
> is very low. I wouldnt worry about this.


Bueno.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: mempool design
  2001-12-17 15:38       ` Victor Yodaiken
  2001-12-17 16:10         ` Andrea Arcangeli
  2001-12-17 17:33         ` kernel panic Geoffrey
@ 2001-12-18 16:55         ` Ingo Molnar
  2001-12-18 16:06           ` Victor Yodaiken
  2 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2001-12-18 16:55 UTC (permalink / raw)
  To: Victor Yodaiken
  Cc: Andrea Arcangeli, Benjamin LaHaise, Rik van Riel, linux-kernel


On Mon, 17 Dec 2001, Victor Yodaiken wrote:

> I agree with all your arguments up to here. But being able to run
> Linux in 4Meg or even 8M is important to a very large class of
> applications. [...]

the amount of reserved RAM should be very low. Especially in embedded
applications that usually have a very controlled environment, with a low
number of well-behaving devices, the number of pages needed to be reserved
is very low. I wouldnt worry about this.

	Ingo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: mempool design
  2001-12-15  9:01 Ingo Molnar
@ 2001-12-15 15:39 ` Rik van Riel
  0 siblings, 0 replies; 15+ messages in thread
From: Rik van Riel @ 2001-12-15 15:39 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Ben LaHaise, linux-kernel

On Sat, 15 Dec 2001, Ingo Molnar wrote:

> such scenarios can only be solved by using/creating independent pools,
> and/or by using 'composite' pools like raid1.c does. One common

OK, you've convinced me ...
... of the fact that you're reinventing Ben's reservation
mechanism, poorly.

Please take a look at Ben's code. ;)

cheers,

Rik
-- 
Shortwave goes a long way:  irc.starchat.net  #swl

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: mempool design
@ 2001-12-15  9:01 Ingo Molnar
  2001-12-15 15:39 ` Rik van Riel
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2001-12-15  9:01 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: linux-kernel


On Sat, 15 Dec 2001, Benjamin LaHaise wrote:

> >  - mempool handles allocation in a more deadlock-avoidance-aware way than
> >    a normal allocator would do:
> >
> >         - first it ->alloc()'s atomically
>
> Great.  Function calls through pointers are really not a good idea on
> modern cpus.

Modern cpus with a good BTB have no problems with this. Even a 400 MHz
Celeron has no problem with it:

 <~/loop_perf> ./loop_perf
 0.0405 billion ->fn_ptr() loops per sec.
 0.0412 billion fn() loops per sec.
 0.0274 billion ->fn_ptr(); ->fn_ptr2() loops per sec.
 0.0308 billion fn(); fn2() loops per sec.

and we use function pointers all around the place in the kernel.

> >         - then it tries to take from the pool if the pool is at least
> >           half full
> >         - then it ->alloc()'s non-atomically
> >         - then it takes from the pool if it's non-empty
> >         - then it waits for pool elements to be freed
>
> Oh dear.  Another set of vm logic that has to be kept in sync with the
> behaviour of the slab, alloc_pages and try_to_free_pages. [...]

actually, no. Mempool should be the *only* set of VM logic that should be
deadlock-aware (in the long run, right now the deadlock avoidance code in
the core allocators is still doing more good than harm.). Mempool
guarantees deadlock-free allocation regardless of the behavior of the
underlying allocator.

> We're already failing to keep alloc_pages deadlock free; [...]

because the code tried to be too generic, for too many cases, while via
mempool we solve specific, well-defined problems. Also, as i hope you'll
agree after reading my reply, alloc_pages() and kmem_cache_alloc() has a
fundamental problems why it *cannot* keep things deadlock-free.

> [...] how can you be certain that this arbitary "half full pool"
> condition is not going to cause deadlocks for $random_arbitary_driver?

the half full pool condition is more of a performance optimization than
deadlock avoidance, it gives out buffers without blocking. It has proven
to cause significant speedups in highmem bouncing for example. But it
indeed also has a deadlock avoidance role: if the underlying allocator is
trying to be too smart and deadlocks in that attempt, then we have
guaranteed that half of the pool elements are allocated already => some
progress will happen.

those places which allocate pool elements without guaranteeing their
release within a reasonable timeout should update their logic to first
allocate all necessery elements via a single pool allocation, *then* if
the allocation succeeds then they should start the IO - which IO
completion will free the allocated element(s).

Deadlock avoidance has two sides which are equally important: the pool
code has to guarantee allocation, *and* the pool user has to guarantee
freeing latency. If one part is missing then there is no guarantee against
deadlocks.

obviously a SLAB-based (or even a page-based) reservation system cannot
guarantee this, and will never guarantee this. Since it cannot pool
multiple buffers at once, in cases where there might be allocation
interdependencies (such as raid1.c), it cannot resolve some of the more
complex deadlock scenarios. SLAB cannot guarantee that when there is a
sudden 'rush' in allocations from lots of process contexts, that all the
reserved elements wont be used to just partially fulfill every such
complex allocation request - it will happily deadlock while every process
context is keeping the reserved elements forever.

Eg., if some code does this:

	bio1 = alloc_bio();
	[... do something ...]
	bio2 = alloc_bio();

if the number of reserved bio's is eg. 128 (random limit), and 128
processes rush/stampede in the same code path, then it might happen that
they allocate bio1 128 times, and all will deadlock on the second
allocation. (This scenario is more common in RL tests that it looks like
at first sight.)

The only option to do this via global reserves would be to keep a reserve
of at least max_nr_threads elements for every pool, or serializing the
allocation path via a global mutex. Both solutions are clearly excess and
hurt the common case ...

such scenarios can only be solved by using/creating independent pools,
and/or by using 'composite' pools like raid1.c does. One common reserve
*does not* guarantee deadlock-free progress. This is why i asked for
specifics (interface, actual semantics, etc.) about the SLAB-based and
page-based reservation system you envision, because, IMO, if it's done
correctly, it will end up looking very similar to mempool.c, besides being
ugly and duplicating code. Reserves *must be* kept separate.

It's not enough to just say 'we are going to need 100 more elements in the
common reserve', that can be drained way too easily.

> Again, this is duplicating functionality that doesn't need to be.
> The 1 additional branch for the uncommon case that reservations adds
> is far, far cheaper and easier to understand.

see above - please explain, what interface, what semantics. It's hard to
compare something that is here and is real against something that is only
known by: 'it does everything, and costs less'.

> >  - mempool adds reservation without increasing the complexity of the
> >    underlying allocators.
>
> This is where my basic disagreement with the approach comes from.  As
> I see it, all of the logic that mempools adds is already present in
> the current system (or at the very least should be). [...]

here is where i disagree. The logic *shouldnt be*, and *cannot be* there,
without major modifications to the __alloc_pages(), __free_pages(),
kmem_cache_alloc() and kmem_cache_free() interfaces.

> [...] To give you some insight to how I think reservations should work
> and how they can simplify code in the current allocators, take the
> case of an ordinary memory allocation of a single page.  Quite simply,
> if there are no immediately free pages, we need to wait for another
> page to be returned to the free pool (this is identical to the logic
> you added in mempool that prevents a pool from failing an allocation).
> Right now, memory allocations can fail because we allow ourselves to
> grossly overcommit memory usage.  That you're adding mempool to patch
> over that behaviour, is *wrong*, imo.  The correct way to fix this is
> to make the underlying allocator behave properly: the system has
> enough information at the time of the initial allocation to
> deterministically say "yes, the vm will be able to allocate this page"
> or "no, i have to wait until another user frees up memory".  Yes, you
> can argue that we don't currently keep all the necessary statistics on
> hand to make this determination, but that's a small matter of
> programming.

IMO this is a way too naive picture of what kind of allocations there
might happen which must be deadlock-free. Take for example the above
multi-bio (or any other complex) allocation where this picture fails. I
think this simplicistic approach to deadlock scenarios is what is causing
page_alloc()'s current failure to address deadlocks. Yes, some of the
deadlocks are as simple as you suggest, and those can be solved via a
simple and common reserved pool. In fact GFP_ATOMIC is an attempt to do
just that, and it has worked to a fair degree for years.

But in the generic case (and in fact in some of the simplest IO cases) it
*cannot* be solved via a single common reserved pool. Sources and drains
of allocations must be kept separate (in places where it's needed) to
resolve deadlocks in mildly complex IO code and drivers.

Just to cite an artificially extreme situation: imagine swapping to a swap
file created on a loopback mounted filesystem which loopback file resides
on a filesystem that is mounted over a sw-RAID-5 device that is a
combination of NBD, block-loopback, SCSI, IDE and hardware-RAID devices,
where the NDB devices use IP tunneling.

this setup, no matter how ridiculous, is possible here and today, and i'd
wager that Cerberus will lock up on it within 5 minutes. Not at any stage
does the sysadmin get warned that 'so far and no farther, deadlock
danger'.

and in fact we used to deadlock in a much simpler and more common scenario
as well:  swapping over RAID1. The solution was: the RAID code avoided the
generic allocators and implemented a per-personality reserved pool of
complex structures. This code was duplicated in three files, with minor
modifications. Now it's all done nicely via mempools, it removed hundreds
of lines of code, and it's even faster that way.

If coded correctly all across the IO and FS layer (block and network IO),
then the above extreme scenario should either say '-ENOMEM' at mount time:
"no more memory to allocate/resize the affected pool(s)", or should work
as expected, no matter how ridiculous the config is. And it should not
deadlock just because 100 processes hit swap space in the same moment. It
might be slow, but it should not deadlock.

eg. mkraid already returns -ENOMEM if it cannot create the necessery pool.

this is the same logic we follow in other, normal allocations: if there is
not enough RAM, then we -ENOMEM. The difference for devices is that they
must be kept 'minimally operational' at all times, so the time to -ENOMEM
is at device creation time.

(admittedly, the mempool code should maximize its footprint as some given
percentage of total RAM [5%?], so that mistaken configuration cannot take
up nearly all of the RAM for reserved pools. I've added this to my tree.]

> The above is looks like a bit more of a rant than I'd meant to write,
> but I think the current allocator is broken and in need of fixing, and
> once fixed there should be no need for yet another layer on top of it.

and i think deadlock avoidance should be taken out of the current
allocators (which will also simplify and speed them up), and deadlock
sources should be identified and resolved explicitly instead. Otherwise
we'll always have the current misery of 'dont do anything too complex
because it might deadlock'. Unfortunately, the road to a deadlock-free
kernel is not as easy as one might think at first sight, but it can be
done via some planning, without any significant amount of added
complexity.

	Ingo


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2001-12-18 18:43 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-12-15 19:40 mempool design Ingo Molnar
2001-12-15 18:47 ` Benjamin LaHaise
2001-12-15 22:18   ` Ingo Molnar
2001-12-17 15:04     ` Andrea Arcangeli
2001-12-17 15:38       ` Victor Yodaiken
2001-12-17 16:10         ` Andrea Arcangeli
2001-12-17 17:33         ` kernel panic Geoffrey
2001-12-18 16:55         ` mempool design Ingo Molnar
2001-12-18 16:06           ` Victor Yodaiken
2001-12-17 17:21       ` Ingo Molnar
2001-12-17 15:58         ` Andrea Arcangeli
2001-12-18  0:32           ` Rik van Riel
2001-12-18 15:21       ` Alan Cox
  -- strict thread matches above, loose matches on Subject: below --
2001-12-15  9:01 Ingo Molnar
2001-12-15 15:39 ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).