linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] [PATCH] A clean approach to writeout throttling
@ 2007-12-06  0:03 Daniel Phillips
  2007-12-06  1:24 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Daniel Phillips @ 2007-12-06  0:03 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Peter Zijlstra

Good afternoon,

According to me, each line of code removed from the kernel is worth ten 
lines added.  If lines can be removed while at the same time improving 
performance, that is worth, ah, about 1,000 times more than lines 
added, correct?  Or maybe 1,000,000 times as much, if removing lines 
removes a deadlock at the same time, something like that.

Today's patch is a step towards removing many lines of code from 
mainline and also aims to improve write cache performance, eliminate a 
troublesome class of deadlocks, save kernel memory and make the kernel 
easier to read.

Background

We seriously began to tackle the issue of block writeout vm deadlock 
more than three years ago,  and by now we have acreted a creaky 
agglomeration of dirty page limits, dirty page balancing between block 
devices, and miscellaneous other hacks attempting to solve these 
deadlocks.  These bandaids as of 2.6.23 do not in themselves address 
the underlying problem, but they do add a lot of code to core kernel 
and they do impair write Linux's write cache performance.  This is a 
classic case of fixing symptoms instead of the cause of a problem.  The 
worst part of it?  The dirty page limit idea did not actually fix the 
deadlock it was supposed to, but instead generated a new class of 
deadlocks that are easier to trigger.

The basis of the writeout deadlock scenario is easy to see: a task 
requests a page of memory, but all pages are currently in use for 
caching among other things.  To recover memory, some disk-backed pages 
must be evicted.  Dirty pages must be written to disk before being 
evicted, so they are passed to the block layer for writeout.  If any 
code in the block writeout path needs to allocate memory to do its 
work, then we can deadlock because the  shortage of memory prevents the 
block layer from making progress to recover memory.

So far, I have just summarized what everybody knows.  This deadlock 
scenario has been present in Linux since day one, and has been solved 
for local block devices since very early on, by the simple expedient of 
providing a reserve of "memalloc" pages that a task may access only if 
it is in the call chain of a memory manager task engaged in writing out 
dirty pages.

What was not commonly known three years ago is that the memalloc reserve 
solution fails in some cases, all of which share the common attribute 
that the task trying to allocate memory for block writeout is not the 
same as the task that initiated the writeout.  In this case, 
the "PF_MEMALLOC" task flag strategy fails because the consumer of the 
memory is not in the call chain of the submitter, and thus does not 
inherit the flag.  Examples of such deadlock-prone use cases include:

   * Fancy virtual block devices with helper daemons
   * Network block device accessed locally
   * Swap over network
   * Remote block devices in general
   * Any block device with a user space component
   * Filesystems implemented in user space

To put it bluntly: without solving these deadlocks, Linux is pretty much 
useless in many modern storage roles.

When I started working on this problem years go, I went at it by 
tackling the nastiest, most icky manifestation of it first, namely the 
possibility that the network layer might be unable to allocate memory 
to receive a reply packet from a remote block device.  Unfortunately, 
the tricky details of the solution to this problem had the unintended 
effect of overshadowing the main issues, just because the details of 
the network receive deadlock are so deliciously obscure.  Ironically, 
we found that the network read readlock scenario does not occur in 
practice.  It is high time to draw attention back to the main issue.

The meta-mistake I made while tackling this problem was to focus 
mainly on the logistics of providing "writeout helper" tasks with 
access to memory reserves, and just assume that it would be easy to 
place limits on how deep the helpers would dip into the reserves, which 
restriction is obviously necessary to prevent deadlock.  This was so 
obvious in fact that I (we) did not get around to implementing it until 
quite recently.  Then it became abundantly clear that limiting resource 
requirements is actually the main ingredient of any correct solution, 
and that the various complexities of providing access to memory 
reserves do not amount to much more than an interesting side show.
We (Zumastor team) proved this to ourselves by removing the 
entire "PeterZ" patch set we were carrying (a distant descendant of my 
original network deadlock prevention patch) and lo!  No deadlocks.  It 
seems that throttling writeout traffic in an organized way amounts to 
powerful magic indeed.

So that is all by way of saying, today's patch to limit the amount of 
data in flight to a block device is an Important Patch.  We find that 
our own storage project needs very little more than this hundred lines 
of code or so to wave goodbye permanently to writeout deadlocks, and 
thereby make our storage software actually useful.  Extension to the 
other use cases listed above is Obvious[tm].

Theory

Terje Mathisen said "almost all programming can be viewed as an exercise 
in caching."  In fact, almost all of the Linux kernel can be viewed as 
an exercise in caching.  The main job of the kernel (granted, there are 
other side jobs) is to move data back and forth between disk and 
memory.  We can usefully define the from-memory-to-disk half of this 
task as "progress".  So the main reason for the kernel to exist at all 
is to make progress by writing dirty memory to disk.  OK?

With that definition in hand, we can see right away what must be done to 
guarantee progress: the part of the kernel that makes progress must be 
guaranteed to have enough resources to make progress.  Like a shark, 
when the kernel stops swimming it dies.  (Ok, that analogy is admittedly 
stupid but I still cannot get Linus's rabbits and bazookas out of my 
head, so retaliation is in order.)

To guarantee resources, we need to do two things:

   1) Provide a dedicated pool of resources

   2) Ensure that the consumer never requires more than its share of
        the dedicated resource pool in order to make progress

For the problem at hand a suitable resource pool already exists, namely 
the memalloc reserve.  However, examination of the attached patch will 
show  that it knows nothing about that particular form of dedicated 
reserve:  in fact any form of reserve, for example, Ingo's mempool, or 
any other will do.  The key item is (2) above: we require a sane way of 
ensuring that the consumer (the block writeout path) never exceeds its 
allotment of resources.  This we accomplish simply, by imposing a limit 
on the amount of data that can be in flight to any particular block 
device.

Practice

So here is the key idea of today's patch: it provides a simple mechanism 
for imposing a limit on the amount of data that can be in flight to any 
particular block device.

The limit on in-flight data is in fact expressed generically, as it is 
hard to set down any single rule to specify the amount of resources 
any particular bio transfer will require.  Instead, the block layer 
provides a method that a block driver may optionally fill in, to 
calculate the resource bound in units of the block driver's choosing.  
The block driver thus takes upon itself the task of translating its 
own, self-imposed bound into generic resource units that can be treated 
generically by the kernel.  In simple terms, the block driver looks at 
each bio and decides how many pages (at most) of memalloc reserve could 
be needed to fully service the transfer, and translates that 
requirement into generic units for use by the block layer. The block 
layer compares the generic units to a generic bound provided by the 
block driver at initialization time and decides whether to let the bio 
transfer in question proceed, or hold it back.

This idea is Simplicity itself.  Some not so obvious details follow.

For one thing, a block device these days may not be just a single 
device, but may be a stack of devices connected together by a generic 
mechanism such as device mapper, or a hardcoded stack such as 
multi-disk or network block device.  It is necessary to consider the 
resource requirements of the stack as a whole _before_ letting a 
transfer proceed into any layer of the stack, otherwise deadlock on 
many partially completed transfers becomes a possibility.  For this 
reason, the bio throttling is only implemented at the initial, highest 
level submission of the bio to the block layer and not for any recursive 
submission of the same bio to a lower level block device in a stack.

This in turn has rather far reaching implications: the top level device 
in a stack must take care of inspecting the entire stack in order to 
determine how to calculate its resource requirements, thus becoming
the boss device for the entire stack.  Though this intriguing idea could 
easily become the cause of endless design work and many thousands of 
lines of fancy code, today I sidestep the question entirely using 
the "just provide lots of reserve" strategy.  Horrifying as it may seem 
to some, this is precisely the strategy that Linux has used in the 
context of resource management in general, from the very beginning and 
likely continuing for quite some time into the future  My strongly held 
opinion in this matter is that we need to solve the real, underlying 
problems definitively with nice code before declaring the opening of 
fancy patch season.  So I am leaving further discussion of automatic 
resource discovery algorithms and the like out of this post.

Today's patch implements _only_ the inflight-data limiting, and does not 
include any code to provide access to reserves.  In practice, simply 
oring PF_MEMALLOC into the flags of each writeout helper daemon does 
the trick.  We also found that we had to disable or work around the 
code that enforces memory dirty limits (contestion_wait) which 
otherwise causes a whole new class of deadlocks, which are in fact 
easier to trigger than the deadlocks it attempts to fix.

Overhead.

This bio patch adds a small amount of overhead to the bio processing 
path, consisting of one new test in submit_bio (generic_make_request) 
and one new test in bio->bi_endio, just to see if thottling methods are 
present.  I do not think that the extra cpu consumed is measurable.  
There is also a slight increase in the size of struct bio to hold two 
new fields, one to reference the struct queue of the block device in 
question and another to remember how many units of resources were 
assigned to the bio in order that exactly that many may be released on 
completion.  In fact, total memory use due to struct bio is typically 
reduced by a not insignificant amount by limiting the number of bios 
concurrently in flight.  Provided of course that local block devices 
also adopt this mechanism, which in fact would be a Very Good Thing[1] 
for reasons I will not get into here.

That is enough for today, and arguably too much, because in the past, 
communicating these simple ideas has always seemed to founder on the 
intrusion of many peripherally related ideas, hijacking the discussion.  
Either that, or appear to be such an involved subject that nobody will 
risk anything more than a cosmetic reply.  Hopefully not this time.

Let me close with perhaps the most relevant remarks: the attached code 
has been in heavy testing and in production for months now.  Thus there 
is nothing theoretical when I say it works, and the patch speaks for 
itself in terms of obvious correctness.  What I hope to add to this in 
the not too distant future is the news that we have removed hundreds of 
lines of existing kernel code, maintaining stability and improving 
performance.

Regards,

Daniel


--- 2.6.24-rc3-mm.clean/block/ll_rw_blk.c	2007-12-04 14:45:25.000000000 -0800
+++ 2.6.24-rc3-mm/block/ll_rw_blk.c	2007-12-04 14:01:18.000000000 -0800
@@ -3210,7 +3210,7 @@ static inline int bio_check_eod(struct b
  */
 static inline void __generic_make_request(struct bio *bio)
 {
-	struct request_queue *q;
+	request_queue_t *q = bdev_get_queue(bio->bi_bdev);
 	sector_t old_sector;
 	int ret, nr_sectors = bio_sectors(bio);
 	dev_t old_dev;
@@ -3221,6 +3221,13 @@ static inline void __generic_make_reques
 	if (bio_check_eod(bio, nr_sectors))
 		goto end_io;
 
+	if (q && q->metric && !bio->bi_queue) {
+		int need = bio->bi_throttle = q->metric(bio);
+		bio->bi_queue = q;
+		/* FIXME: potential race if atomic_sub is called in the middle of condition check */
+		wait_event_interruptible(q->throttle_wait, atomic_read(&q->available) >= need);
+		atomic_sub(need, &q->available);
+	}
 	/*
 	 * Resolve the mapping until finished. (drivers are
 	 * still free to implement/resolve their own stacking
@@ -3234,7 +3241,6 @@ static inline void __generic_make_reques
 	do {
 		char b[BDEVNAME_SIZE];
 
-		q = bdev_get_queue(bio->bi_bdev);
 		if (!q) {
 			printk(KERN_ERR
 			       "generic_make_request: Trying to access "
--- 2.6.24-rc3-mm.clean/drivers/md/dm.c	2007-12-04 14:46:04.000000000 -0800
+++ 2.6.24-rc3-mm/drivers/md/dm.c	2007-12-04 15:26:16.000000000 -0800
@@ -889,6 +889,11 @@ static int dm_any_congested(void *conges
 	return r;
 }
 
+static unsigned dm_metric(struct bio *bio)
+{
+	return bio->bi_vcnt;
+}
+
 /*-----------------------------------------------------------------
  * An IDR is used to keep track of allocated minor numbers.
  *---------------------------------------------------------------*/
@@ -967,6 +972,7 @@ out:
 
 static struct block_device_operations dm_blk_dops;
 
+#define DEFAULT_THROTTLE_CAPACITY 1000
 /*
  * Allocate and initialise a blank device with a given minor.
  */
@@ -1009,6 +1015,11 @@ static struct mapped_device *alloc_dev(i
 		goto bad1_free_minor;
 
 	md->queue->queuedata = md;
+	md->queue->metric = dm_metric;
+	/* A dm device constructor may change the throttle capacity */
+	atomic_set(&md->queue->available, md->queue->capacity = DEFAULT_THROTTLE_CAPACITY);
+	init_waitqueue_head(&md->queue->throttle_wait);
+
 	md->queue->backing_dev_info.congested_fn = dm_any_congested;
 	md->queue->backing_dev_info.congested_data = md;
 	blk_queue_make_request(md->queue, dm_request);
--- 2.6.24-rc3-mm.clean/fs/bio.c	2007-12-04 14:38:47.000000000 -0800
+++ 2.6.24-rc3-mm/fs/bio.c	2007-12-04 14:14:15.000000000 -0800
@@ -1007,6 +1007,13 @@ void bio_endio(struct bio *bio, int erro
 	else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
 		error = -EIO;
 
+	if (bio->bi_throttle) {
+		struct request_queue *q = bio->bi_queue;
+		bio->bi_throttle = 0; /* or detect multiple endio and err? */
+		atomic_add(bio->bi_throttle, &q->available);
+		wake_up(&q->throttle_wait);
+	}
+
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio, error);
 }
--- 2.6.24-rc3-mm.clean/include/linux/bio.h	2007-12-04 14:39:31.000000000 -0800
+++ 2.6.24-rc3-mm/include/linux/bio.h	2007-12-04 13:56:51.000000000 -0800
@@ -111,6 +111,9 @@ struct bio {
 	bio_end_io_t		*bi_end_io;
 	atomic_t		bi_cnt;		/* pin count */
 
+	struct request_queue	*bi_queue;	/* for throttling */
+	unsigned		bi_throttle;	/* throttle metric */
+
 	void			*bi_private;
 
 	bio_destructor_t	*bi_destructor;	/* destructor */
--- 2.6.24-rc3-mm.clean/include/linux/blkdev.h	2007-12-04 14:47:18.000000000 -0800
+++ 2.6.24-rc3-mm/include/linux/blkdev.h	2007-12-04 13:56:51.000000000 -0800
@@ -383,6 +383,10 @@ struct request_queue
 	struct work_struct	unplug_work;
 
 	struct backing_dev_info	backing_dev_info;
+	unsigned (*metric)(struct bio *bio);	/* bio throttle metric */
+	wait_queue_head_t	throttle_wait;
+	atomic_t		available;
+	unsigned		capacity;
 
 	/*
 	 * The queue owner gets to use this for whatever they like.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-06  0:03 [RFC] [PATCH] A clean approach to writeout throttling Daniel Phillips
@ 2007-12-06  1:24 ` Andrew Morton
  2007-12-06  6:21   ` Daniel Phillips
  2007-12-10 10:47 ` Jens Axboe
  2007-12-10 21:31 ` Jonathan Corbet
  2 siblings, 1 reply; 40+ messages in thread
From: Andrew Morton @ 2007-12-06  1:24 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, Peter Zijlstra

On Wed, 5 Dec 2007 16:03:01 -0800 Daniel Phillips <phillips@phunq.net> wrote:

> Practice
> 
> So here is the key idea of today's patch: it provides a simple mechanism 
> for imposing a limit on the amount of data that can be in flight to any 
> particular block device.
> 
> The limit on in-flight data is in fact expressed generically, as it is 
> hard to set down any single rule to specify the amount of resources 
> any particular bio transfer will require.  Instead, the block layer 
> provides a method that a block driver may optionally fill in, to 
> calculate the resource bound in units of the block driver's choosing.  
> The block driver thus takes upon itself the task of translating its 
> own, self-imposed bound into generic resource units that can be treated 
> generically by the kernel.  In simple terms, the block driver looks at 
> each bio and decides how many pages (at most) of memalloc reserve could 
> be needed to fully service the transfer, and translates that 
> requirement into generic units for use by the block layer. The block 
> layer compares the generic units to a generic bound provided by the 
> block driver at initialization time and decides whether to let the bio 
> transfer in question proceed, or hold it back.
> 
> This idea is Simplicity itself.  Some not so obvious details follow.
> 
> For one thing, a block device these days may not be just a single 
> device, but may be a stack of devices connected together by a generic 
> mechanism such as device mapper, or a hardcoded stack such as 
> multi-disk or network block device.  It is necessary to consider the 
> resource requirements of the stack as a whole _before_ letting a 
> transfer proceed into any layer of the stack, otherwise deadlock on 
> many partially completed transfers becomes a possibility.  For this 
> reason, the bio throttling is only implemented at the initial, highest 
> level submission of the bio to the block layer and not for any recursive 
> submission of the same bio to a lower level block device in a stack.
> 
> This in turn has rather far reaching implications: the top level device 
> in a stack must take care of inspecting the entire stack in order to 
> determine how to calculate its resource requirements, thus becoming
> the boss device for the entire stack.  Though this intriguing idea could 
> easily become the cause of endless design work and many thousands of 
> lines of fancy code, today I sidestep the question entirely using 
> the "just provide lots of reserve" strategy.  Horrifying as it may seem 
> to some, this is precisely the strategy that Linux has used in the 
> context of resource management in general, from the very beginning and 
> likely continuing for quite some time into the future  My strongly held 
> opinion in this matter is that we need to solve the real, underlying 
> problems definitively with nice code before declaring the opening of 
> fancy patch season.  So I am leaving further discussion of automatic 
> resource discovery algorithms and the like out of this post.

Rather than asking the stack "how much memory will this request consume"
you could instead ask "how much memory are you currently using".

ie: on entry to the stack, do

	current->account_block_allocations = 1;
	make_request(...);
	rq->used_memory += current->pages_used_for_block_allocations;

and in the page allocator do

	if (!in_interrupt() && current->account_block_allocations)
		current->pages_used_for_block_allocations++;

and then somehow handle deallocation too ;)

The basic idea being to know in real time how much memory a particular
block stack is presently using.  Then, on entry to that stack, if the
stack's current usage is too high, wait for it to subside.


otoh we already have mechanisms for limiting the number of requests in
flight.  This is approximately proportional to the amount of memory which
was allocated to service those requests.  Why not just use that?

> @@ -3221,6 +3221,13 @@ static inline void __generic_make_reques
>  	if (bio_check_eod(bio, nr_sectors))
>  		goto end_io;
>  
> +	if (q && q->metric && !bio->bi_queue) {
> +		int need = bio->bi_throttle = q->metric(bio);
> +		bio->bi_queue = q;
> +		/* FIXME: potential race if atomic_sub is called in the middle of condition check */
> +		wait_event_interruptible(q->throttle_wait, atomic_read(&q->available) >= need);

This will fall straight through if signal_pending() and (I assume) bad
stuff will happen.  uninterruptible sleep, methinks.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-06  1:24 ` Andrew Morton
@ 2007-12-06  6:21   ` Daniel Phillips
  2007-12-06  7:31     ` Andrew Morton
  2007-12-06 21:53     ` Bill Davidsen
  0 siblings, 2 replies; 40+ messages in thread
From: Daniel Phillips @ 2007-12-06  6:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Peter Zijlstra

On Wednesday 05 December 2007 17:24, Andrew Morton wrote:
> On Wed, 5 Dec 2007 16:03:01 -0800 Daniel Phillips <phillips@phunq.net> wrote:
> > ...a block device these days may not be just a single 
> > device, but may be a stack of devices connected together by a generic 
> > mechanism such as device mapper, or a hardcoded stack such as 
> > multi-disk or network block device.  It is necessary to consider the 
> > resource requirements of the stack as a whole _before_ letting a 
> > transfer proceed into any layer of the stack, otherwise deadlock on 
> > many partially completed transfers becomes a possibility.  For this 
> > reason, the bio throttling is only implemented at the initial, highest 
> > level submission of the bio to the block layer and not for any recursive 
> > submission of the same bio to a lower level block device in a stack.
> > 
> > This in turn has rather far reaching implications: the top level device 
> > in a stack must take care of inspecting the entire stack in order to 
> > determine how to calculate its resource requirements, thus becoming
> > the boss device for the entire stack.  Though this intriguing idea could 
> > easily become the cause of endless design work and many thousands of 
> > lines of fancy code, today I sidestep the question entirely using 
> > the "just provide lots of reserve" strategy.  Horrifying as it may seem 
> > to some, this is precisely the strategy that Linux has used in the 
> > context of resource management in general, from the very beginning and 
> > likely continuing for quite some time into the future  My strongly held 
> > opinion in this matter is that we need to solve the real, underlying 
> > problems definitively with nice code before declaring the opening of 
> > fancy patch season.  So I am leaving further discussion of automatic 
> > resource discovery algorithms and the like out of this post.
> 
> Rather than asking the stack "how much memory will this request consume"
> you could instead ask "how much memory are you currently using".
> 
> ie: on entry to the stack, do 
> 
> 	current->account_block_allocations = 1;
> 	make_request(...);
> 	rq->used_memory += current->pages_used_for_block_allocations;
> 
> and in the page allocator do
> 
> 	if (!in_interrupt() && current->account_block_allocations)
> 		current->pages_used_for_block_allocations++;
> 
> and then somehow handle deallocation too ;)

Ah, and how do you ensure that you do not deadlock while making this
inquiry?  Perhaps send a dummy transaction down the pipe?  Even so,
deadlock is possible, quite evidently so in the real life example I have
at hand.

Yours is essentially one of the strategies I had in mind, the other major
one being simply to examine the whole stack, which presupposes some
as-yet-nonexistant kernel wide method of representing block device
stacks in all there glorious possible topology variations.

> The basic idea being to know in real time how much memory a particular
> block stack is presently using.  Then, on entry to that stack, if the
> stack's current usage is too high, wait for it to subside.

We do not wait for high block device resource usage to subside before
submitting more requests.  The improvement you suggest is aimed at
automatically determining resource requirements by sampling a
running system, rather than requiring a programmer to determine them
arduously by hand.  Something like automatically determining a
workable locking strategy by analyzing running code, wouldn't that be
a treat?  I will hope for one of those under my tree at Christmas.

More practically, I can see a debug mode implemented along the lines
you describe where we automatically detect that a writeout path has
violated its covenant as expressed by its throttle_metric.
 
> otoh we already have mechanisms for limiting the number of requests in
> flight.  This is approximately proportional to the amount of memory which
> was allocated to service those requests.  Why not just use that?

Two reasons.  The minor one is that device mapper bypasses that
mechanism (no elevator) and the major one is that number of requests
does not map well to the amount of resources consumed.  In ddsnap for
example, the amount of memory used by the userspace ddsnapd is
roughly linear vs the number of pages transferred, not the number of
requests.

> > @@ -3221,6 +3221,13 @@ static inline void __generic_make_reques
> >  	if (bio_check_eod(bio, nr_sectors))
> >  		goto end_io;
> >  
> > +	if (q && q->metric && !bio->bi_queue) {
> > +		int need = bio->bi_throttle = q->metric(bio);
> > +		bio->bi_queue = q;
> > +		/* FIXME: potential race if atomic_sub is called in the middle of condition check */
> > +		wait_event_interruptible(q->throttle_wait, atomic_read(&q->available) >= need);
> 
> This will fall straight through if signal_pending() and (I assume) bad
> stuff will happen.  uninterruptible sleep, methinks.

Yes, as a first order repair.  To be done properly I need to express this
in terms of the guts of wait_event_*, and get rid of that race, maybe that
changes the equation.

It would be nice if threads didn't get stuck in D state here, though
_interruptible is probably the wrong idea, we should instead ensure that
whatever is being waited on must respond to, e.g., SIGKILL.  This at the
limits of my scheduler knowledge, l would appreciate a better
suggestion.  I do detest hang in D state with SIGKILL immunity, which
behavior unfortunately does not seem all that rare.

Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-06  6:21   ` Daniel Phillips
@ 2007-12-06  7:31     ` Andrew Morton
  2007-12-06  9:48       ` Daniel Phillips
  2007-12-06 21:53     ` Bill Davidsen
  1 sibling, 1 reply; 40+ messages in thread
From: Andrew Morton @ 2007-12-06  7:31 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, Peter Zijlstra

On Wed, 5 Dec 2007 22:21:44 -0800 Daniel Phillips <phillips@phunq.net> wrote:

> On Wednesday 05 December 2007 17:24, Andrew Morton wrote:
> > On Wed, 5 Dec 2007 16:03:01 -0800 Daniel Phillips <phillips@phunq.net> wrote:
> > > ...a block device these days may not be just a single 
> > > device, but may be a stack of devices connected together by a generic 
> > > mechanism such as device mapper, or a hardcoded stack such as 
> > > multi-disk or network block device.  It is necessary to consider the 
> > > resource requirements of the stack as a whole _before_ letting a 
> > > transfer proceed into any layer of the stack, otherwise deadlock on 
> > > many partially completed transfers becomes a possibility.  For this 
> > > reason, the bio throttling is only implemented at the initial, highest 
> > > level submission of the bio to the block layer and not for any recursive 
> > > submission of the same bio to a lower level block device in a stack.
> > > 
> > > This in turn has rather far reaching implications: the top level device 
> > > in a stack must take care of inspecting the entire stack in order to 
> > > determine how to calculate its resource requirements, thus becoming
> > > the boss device for the entire stack.  Though this intriguing idea could 
> > > easily become the cause of endless design work and many thousands of 
> > > lines of fancy code, today I sidestep the question entirely using 
> > > the "just provide lots of reserve" strategy.  Horrifying as it may seem 
> > > to some, this is precisely the strategy that Linux has used in the 
> > > context of resource management in general, from the very beginning and 
> > > likely continuing for quite some time into the future  My strongly held 
> > > opinion in this matter is that we need to solve the real, underlying 
> > > problems definitively with nice code before declaring the opening of 
> > > fancy patch season.  So I am leaving further discussion of automatic 
> > > resource discovery algorithms and the like out of this post.
> > 
> > Rather than asking the stack "how much memory will this request consume"
> > you could instead ask "how much memory are you currently using".
> > 
> > ie: on entry to the stack, do 
> > 
> > 	current->account_block_allocations = 1;
> > 	make_request(...);
> > 	rq->used_memory += current->pages_used_for_block_allocations;
> > 
> > and in the page allocator do
> > 
> > 	if (!in_interrupt() && current->account_block_allocations)
> > 		current->pages_used_for_block_allocations++;
> > 
> > and then somehow handle deallocation too ;)
> 
> Ah, and how do you ensure that you do not deadlock while making this
> inquiry?

It isn't an inquiry - it's a plain old submit_bio() and it runs to
completion in the usual fashion.

Thing is, we wouldn't have called it at all if this queue was already over
its allocation limit.  IOW, we know that it's below its allocation limit,
so we know it won't deadlock.  Given, of course, reasonably pessimistc
error margins.

Which margins can even be observed at runtime: keep a running "max" of this
stack's most-ever memory consumption (for a single call), and only submit a
bio into it when its current allocation is less than (limit - that-max).

>  Perhaps send a dummy transaction down the pipe?  Even so,
> deadlock is possible, quite evidently so in the real life example I have
> at hand.
> 
> Yours is essentially one of the strategies I had in mind, the other major
> one being simply to examine the whole stack, which presupposes some
> as-yet-nonexistant kernel wide method of representing block device
> stacks in all there glorious possible topology variations.

We already have that, I think: blk_run_backing_dev().  One could envisage a
similar thing which runs up and down the stack accumulating "how much
memory do you need for this request" data, but I think that would be hard to
implement and plain dumb.

> > The basic idea being to know in real time how much memory a particular
> > block stack is presently using.  Then, on entry to that stack, if the
> > stack's current usage is too high, wait for it to subside.
> 
> We do not wait for high block device resource usage to subside before
> submitting more requests.  The improvement you suggest is aimed at
> automatically determining resource requirements by sampling a
> running system, rather than requiring a programmer to determine them
> arduously by hand.  Something like automatically determining a
> workable locking strategy by analyzing running code, wouldn't that be
> a treat?  I will hope for one of those under my tree at Christmas.

I don't see any unviability.

> More practically, I can see a debug mode implemented along the lines
> you describe where we automatically detect that a writeout path has
> violated its covenant as expressed by its throttle_metric.
>  
> > otoh we already have mechanisms for limiting the number of requests in
> > flight.  This is approximately proportional to the amount of memory which
> > was allocated to service those requests.  Why not just use that?
> 
> Two reasons.  The minor one is that device mapper bypasses that
> mechanism (no elevator)

oh.

> and the major one is that number of requests
> does not map well to the amount of resources consumed.  In ddsnap for
> example, the amount of memory used by the userspace ddsnapd is
> roughly linear vs the number of pages transferred, not the number of
> requests.

Yeah, one would need to be pretty pessimal.  Perhaps unacceptably
inaccurate, dunno.

> > > @@ -3221,6 +3221,13 @@ static inline void __generic_make_reques
> > >  	if (bio_check_eod(bio, nr_sectors))
> > >  		goto end_io;
> > >  
> > > +	if (q && q->metric && !bio->bi_queue) {
> > > +		int need = bio->bi_throttle = q->metric(bio);
> > > +		bio->bi_queue = q;
> > > +		/* FIXME: potential race if atomic_sub is called in the middle of condition check */
> > > +		wait_event_interruptible(q->throttle_wait, atomic_read(&q->available) >= need);
> > 
> > This will fall straight through if signal_pending() and (I assume) bad
> > stuff will happen.  uninterruptible sleep, methinks.
> 
> Yes, as a first order repair.  To be done properly I need to express this
> in terms of the guts of wait_event_*, and get rid of that race, maybe that
> changes the equation.

I don't think so.  If you're going to sleep in state TASK_INTERRUPTIBLE
then you *have* to bale out and return to userspace (or whatever) when
signal_pending().  Because schedule() becomes a no-op.

> It would be nice if threads didn't get stuck in D state here, though
> _interruptible is probably the wrong idea, we should instead ensure that
> whatever is being waited on must respond to, e.g., SIGKILL.  This at the
> limits of my scheduler knowledge, l would appreciate a better
> suggestion.  I do detest hang in D state with SIGKILL immunity, which
> behavior unfortunately does not seem all that rare.

Well..  See 

add-lock_page_killable.patch
kernel-add-mutex_lock_killable.patch
vfs-use-mutex_lock_killable-in-vfs_readdir.patch

in 2.6.24-rc4-mm1.

But for now, TASK_UNINTERRUPTIBLE is the honest solution.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-06  7:31     ` Andrew Morton
@ 2007-12-06  9:48       ` Daniel Phillips
  2007-12-06 11:55         ` Andrew Morton
  0 siblings, 1 reply; 40+ messages in thread
From: Daniel Phillips @ 2007-12-06  9:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Peter Zijlstra

On Wednesday 05 December 2007 23:31, Andrew Morton wrote:
> > > Rather than asking the stack "how much memory will this request consume"
> > > you could instead ask "how much memory are you currently using".
> > > 
> > > ie: on entry to the stack, do 
> > > 
> > > 	current->account_block_allocations = 1;
> > > 	make_request(...);
> > > 	rq->used_memory += current->pages_used_for_block_allocations;
> > > 
> > > and in the page allocator do
> > > 
> > > 	if (!in_interrupt() && current->account_block_allocations)
> > > 		current->pages_used_for_block_allocations++;
> > > 
> > > and then somehow handle deallocation too ;)
> > 
> > Ah, and how do you ensure that you do not deadlock while making this
> > inquiry?
> 
> It isn't an inquiry - it's a plain old submit_bio() and it runs to
> completion in the usual fashion.
> 
> Thing is, we wouldn't have called it at all if this queue was already over 
> its allocation limit.  IOW, we know that it's below its allocation limit,
> so we know it won't deadlock.  Given, of course, reasonably pessimistc
> error margins.

OK, I see what you are suggesting.  Yes, one could set the inflight limit
very low and the reserve very high, and run a bio through the stack (what
I meant by "inquiry") to discover the actual usage, then shrink the reserve
accordingly.  By also running a real bio through the stack we can discover
something about the latency.  So we would then know roughly how high
the inflight limit should be set and how much the memalloc reserve
should be increased to handle that particular driver instance.

The big fly in this ointment is that we cannot possibly know that our bio
followed the worst case resource consumption path, whereas it is fairly
easy (hopefully) for a programmer to determine this statically.

> Which margins can even be observed at runtime: keep a running "max" of this
> stack's most-ever memory consumption (for a single call), and only submit a
> bio into it when its current allocation is less than (limit - that-max).

Actually, your mechanism would always have to be operable at runtime,
since inserting a new driver while the system is under heavy memory
load is a perfectly valid operation and has to be reliable.

Anyway, even if you run a bio through the stack lots of times (insert
definition of "lots" here) you still cannot be sure that it has explored the
worst case path.  To put this in perspective, some of the deadlocks we
have hunted down recently have taken days to manifest under artificially
high load.  It just takes that long to randomly explore a sufficient number
of corner cases.

> >  Perhaps send a dummy transaction down the pipe?  Even so,
> > deadlock is possible, quite evidently so in the real life example I have
> > at hand.
> > 
> > Yours is essentially one of the strategies I had in mind, the other major
> > one being simply to examine the whole stack, which presupposes some
> > as-yet-nonexistant kernel wide method of representing block device
> > stacks in all there glorious possible topology variations.
> 
> We already have that, I think: blk_run_backing_dev().  One could envisage a
> similar thing which runs up and down the stack accumulating "how much
> memory do you need for this request" data, but I think that would be hard to
> implement and plain dumb.

I don't think I quite communicated there.  We don't actually have any
generic notion of "the block device stack".  Device mapper has its own
model, md has another model, and other stacking devices may have no
model at all, just some through-coded hack.  It would be worth fixing this
problem as part of an effort to generalize the block IO model and make
block devices in general look more like device mapper devices.  But
that would be a pretty big project, the need for which is not generally
recognized.

> > ...Something like automatically determining a
> > workable locking strategy by analyzing running code, wouldn't that be
> > a treat?  I will hope for one of those under my tree at Christmas.
> 
> I don't see any unviability.

A small matter of coding.  It would be a legendary hack.

> > ...number of requests
> > does not map well to the amount of resources consumed.  In ddsnap for
> > example, the amount of memory used by the userspace ddsnapd is
> > roughly linear vs the number of pages transferred, not the number of
> > requests.
> 
> Yeah, one would need to be pretty pessimal.  Perhaps unacceptably
> inaccurate, dunno.

Orders of magnitude more reserve would need to be allocated in the
case of ddsnap, since bio payload can vary through a big range, which
is expected to get bigger as time goes by.  So the few lines of extra
code and the extra bio field needed to get a better fit is well worth the
effort, or even indispensable.

> > > > @@ -3221,6 +3221,13 @@ static inline void __generic_make_reques
> > > >  	if (bio_check_eod(bio, nr_sectors))
> > > >  		goto end_io;
> > > >  
> > > > +	if (q && q->metric && !bio->bi_queue) {
> > > > +		int need = bio->bi_throttle = q->metric(bio);
> > > > +		bio->bi_queue = q;
> > > > +		/* FIXME: potential race if atomic_sub is called in the middle of condition check */
> > > > +		wait_event_interruptible(q->throttle_wait, atomic_read(&q->available) >= need);
> > > 
> > > This will fall straight through if signal_pending() and (I assume) bad
> > > stuff will happen.  uninterruptible sleep, methinks.
> > 
> > Yes, as a first order repair.  To be done properly I need to express this
> > in terms of the guts of wait_event_*, and get rid of that race, maybe that
> > changes the equation.
> 
> I don't think so.  If you're going to sleep in state TASK_INTERRUPTIBLE
> then you *have* to bale out and return to userspace (or whatever) when
> signal_pending().  Because schedule() becomes a no-op.

Thanks for clearing that up.  No I did not know that important fact about
signal handling, and now it is obvious why it must be so.  Got a pointer
to a doc for that, and other similar scheduler factoids that must surely
be known to those in the know?

> ...for now, TASK_UNINTERRUPTIBLE is the honest solution.

Right, and easy too.  I think we ought to be able to do a fairly good job
of making the in-flight items go away in the presence of a signal, thus
ending the uninterruptible sleep reliably.  Probably.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-06  9:48       ` Daniel Phillips
@ 2007-12-06 11:55         ` Andrew Morton
  2007-12-06 15:52           ` Rik van Riel
  2007-12-06 20:04           ` Daniel Phillips
  0 siblings, 2 replies; 40+ messages in thread
From: Andrew Morton @ 2007-12-06 11:55 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, Peter Zijlstra

On Thu, 6 Dec 2007 01:48:52 -0800 Daniel Phillips <phillips@phunq.net> wrote:

> On Wednesday 05 December 2007 23:31, Andrew Morton wrote:
> > > > Rather than asking the stack "how much memory will this request consume"
> > > > you could instead ask "how much memory are you currently using".
> > > > 
> > > > ie: on entry to the stack, do 
> > > > 
> > > > 	current->account_block_allocations = 1;
> > > > 	make_request(...);
> > > > 	rq->used_memory += current->pages_used_for_block_allocations;
> > > > 
> > > > and in the page allocator do
> > > > 
> > > > 	if (!in_interrupt() && current->account_block_allocations)
> > > > 		current->pages_used_for_block_allocations++;
> > > > 
> > > > and then somehow handle deallocation too ;)
> > > 
> > > Ah, and how do you ensure that you do not deadlock while making this
> > > inquiry?
> > 
> > It isn't an inquiry - it's a plain old submit_bio() and it runs to
> > completion in the usual fashion.
> > 
> > Thing is, we wouldn't have called it at all if this queue was already over 
> > its allocation limit.  IOW, we know that it's below its allocation limit,
> > so we know it won't deadlock.  Given, of course, reasonably pessimistc
> > error margins.
> 
> OK, I see what you are suggesting.  Yes, one could set the inflight limit
> very low and the reserve very high, and run a bio through the stack (what
> I meant by "inquiry") to discover the actual usage, then shrink the reserve
> accordingly.  By also running a real bio through the stack we can discover
> something about the latency.  So we would then know roughly how high
> the inflight limit should be set and how much the memalloc reserve
> should be increased to handle that particular driver instance.
> 
> The big fly in this ointment is that we cannot possibly know that our bio
> followed the worst case resource consumption path, whereas it is fairly
> easy (hopefully) for a programmer to determine this statically.

nonono...

Consider an example.

- We a-priori decide to limit a particular stack's peak memory usage to
  1MB

- We empirically discover that the maximum amount of memory which is
  allocated by that stack on behalf of a single BIO is 16kb.  (ie: that's
  the most it has ever used for a single BIO).

- Now, we refuse to feed any more BIOs into the stack when its
  instantaneous memory usage exceeds (1MB - 16kb).

Of course, the _average_ memory-per-BIO is much less than 16kb.  So there
are a lot of BIOs in flight - probably hundreds, but a minimum of 63.

There is a teeny so-small-it-doesn't-matter chance that the stack will
exceed the 1MB limit.  If it happens to be at its (1MB-16kb) limit and all
the memory in the machine is AWOL and then someone throws a
never-seen-before twirly BIO at it.  Not worth worrying about, surely.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-06 11:55         ` Andrew Morton
@ 2007-12-06 15:52           ` Rik van Riel
  2007-12-06 17:34             ` Andrew Morton
  2007-12-06 20:04           ` Daniel Phillips
  1 sibling, 1 reply; 40+ messages in thread
From: Rik van Riel @ 2007-12-06 15:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Daniel Phillips, linux-kernel, Peter Zijlstra

On Thu, 6 Dec 2007 03:55:11 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> - We a-priori decide to limit a particular stack's peak memory usage to
>   1MB
> 
> - We empirically discover that the maximum amount of memory which is
>   allocated by that stack on behalf of a single BIO is 16kb.  (ie: that's
>   the most it has ever used for a single BIO).
> 
> - Now, we refuse to feed any more BIOs into the stack when its
>   instantaneous memory usage exceeds (1MB - 16kb).
> 
> Of course, the _average_ memory-per-BIO is much less than 16kb.  So there
> are a lot of BIOs in flight - probably hundreds, but a minimum of 63.

There is only one problem I can see with this.  With network block
IO, some memory will be consumed upon IO completion.  We need to
make sure we reserve (number of in flight BIOs * maximum amount of
memory consumed upon IO completion) memory, in addition to the
memory you're accounting in your example above.

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-06 15:52           ` Rik van Riel
@ 2007-12-06 17:34             ` Andrew Morton
  2007-12-06 17:48               ` Rik van Riel
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Morton @ 2007-12-06 17:34 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Daniel Phillips, linux-kernel, Peter Zijlstra

On Thu, 6 Dec 2007 10:52:42 -0500 Rik van Riel <riel@redhat.com> wrote:

> On Thu, 6 Dec 2007 03:55:11 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > - We a-priori decide to limit a particular stack's peak memory usage to
> >   1MB
> > 
> > - We empirically discover that the maximum amount of memory which is
> >   allocated by that stack on behalf of a single BIO is 16kb.  (ie: that's
> >   the most it has ever used for a single BIO).
> > 
> > - Now, we refuse to feed any more BIOs into the stack when its
> >   instantaneous memory usage exceeds (1MB - 16kb).
> > 
> > Of course, the _average_ memory-per-BIO is much less than 16kb.  So there
> > are a lot of BIOs in flight - probably hundreds, but a minimum of 63.
> 
> There is only one problem I can see with this.  With network block
> IO, some memory will be consumed upon IO completion.  We need to
> make sure we reserve (number of in flight BIOs * maximum amount of
> memory consumed upon IO completion) memory, in addition to the
> memory you're accounting in your example above.
> 

hm, yeah, drat.

What we could do is

- in do_IRQ(): set up a per-cpu pointer to some counter which
  corresponds to this IRQ.

- in the page allocator, if in_irq(), retrieve that per-cpu pointer and
  increment the counter.

- in the network block-io stack we can now look at the number of
  interrupts, number of packets, size of packets and amount of memory
  allocated and work out the max amount of memory which needs to be
  allocated for each frame.

That's all rather handwavy and misses a lot of details and might be
inaccurate too.  Probably sufficient to just work out by hand the amount of
memory which the network stack will need to allocate.  I expect it'll be
two pages..

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-06 17:34             ` Andrew Morton
@ 2007-12-06 17:48               ` Rik van Riel
  0 siblings, 0 replies; 40+ messages in thread
From: Rik van Riel @ 2007-12-06 17:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Daniel Phillips, linux-kernel, Peter Zijlstra

On Thu, 6 Dec 2007 09:34:32 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> That's all rather handwavy and misses a lot of details and might be
> inaccurate too.  Probably sufficient to just work out by hand the amount of
> memory which the network stack will need to allocate.  I expect it'll be
> two pages..

Doesn't Peter Zijlstra's patch series take care of all those
nasty details already?

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-06 11:55         ` Andrew Morton
  2007-12-06 15:52           ` Rik van Riel
@ 2007-12-06 20:04           ` Daniel Phillips
  2007-12-06 20:27             ` Andrew Morton
  1 sibling, 1 reply; 40+ messages in thread
From: Daniel Phillips @ 2007-12-06 20:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Peter Zijlstra

On Thursday 06 December 2007 03:55, Andrew Morton wrote:
> Consider an example.
>
> - We a-priori decide to limit a particular stack's peak memory usage
> to 1MB

Ah, this is the piece I was missing.

> - We empirically discover that the maximum amount of memory which is
>   allocated by that stack on behalf of a single BIO is 16kb.  (ie:
> that's the most it has ever used for a single BIO).
>
> - Now, we refuse to feed any more BIOs into the stack when its
>   instantaneous memory usage exceeds (1MB - 16kb).

And printk a warning, because the programmer's static analysis was
wrong.

> Of course, the _average_ memory-per-BIO is much less than 16kb.  So
> there are a lot of BIOs in flight - probably hundreds, but a minimum
> of 63.

And progress is being made, everybody is happy.

> There is a teeny so-small-it-doesn't-matter chance that the stack
> will exceed the 1MB limit.  If it happens to be at its (1MB-16kb)
> limit and all the memory in the machine is AWOL and then someone
> throws a never-seen-before twirly BIO at it.  Not worth worrying
> about, surely.

OK, I see where you are going.  The programmer statically determines a 
total reservation for the stack, which is big enough that progress is 
guaranteed.  We then throttle based on actual memory consumption, and 
essentially use a heuristic to decide when we are near the upper limit.  
Workable I think, but...

The main idea, current->pages_used_for_block_allocations++ is valid only 
in direct call context.  If a daemon needs to allocate memory on behalf 
of the IO transfer (not unusual) it won't get accounted, which is 
actually the central issue in this whole class of deadlocks.  Any idea 
how to extend the accounting idea to all tasks involved in a particular 
block device stack?

Regards,

Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-06 20:04           ` Daniel Phillips
@ 2007-12-06 20:27             ` Andrew Morton
  2007-12-06 21:27               ` Daniel Phillips
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Morton @ 2007-12-06 20:27 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, peterz

On Thu, 6 Dec 2007 12:04:14 -0800
Daniel Phillips <phillips@phunq.net> wrote:

> Any idea 
> how to extend the accounting idea to all tasks involved in a particular 
> block device stack?

SMOP, I'd have thought.  As long as each piece of code which handles data
for this stack knows that it's handling data for that stack it should be
able to account its memory allocations.

The tricky part will be networking allocations because a NIC can of course
handle data for all sorts of consumers.  But I expect this can be greatly
simplified with a few heuristics - work out how much memory your typical
networking stack will allocate for a frame and tack that onto the total.
Couple of pages worst case..

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-06 20:27             ` Andrew Morton
@ 2007-12-06 21:27               ` Daniel Phillips
  0 siblings, 0 replies; 40+ messages in thread
From: Daniel Phillips @ 2007-12-06 21:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, peterz

On Thursday 06 December 2007 12:27, Andrew Morton wrote:
> On Thu, 6 Dec 2007 12:04:14 -0800
>
> Daniel Phillips <phillips@phunq.net> wrote:
> > Any idea
> > how to extend the accounting idea to all tasks involved in a
> > particular block device stack?
>
> SMOP, I'd have thought.

Agreed, which I realized as soon as the post was one minute old.  Sure, 
each helper for the device registers as a helper which puts a pointer 
in the task struct, which points to the accounting info so only one new 
field in task struct.  The more I ponder, the more doable it seems.

> As long as each piece of code which handles 
> data for this stack knows that it's handling data for that stack it
> should be able to account its memory allocations.

Don't forget that we do not actually have a usable notion of "block 
device stack" yet.  Perhaps you are just assuming that is 
easy/imminent?

> The tricky part will be networking allocations because a NIC can of
> course handle data for all sorts of consumers.  But I expect this can
> be greatly simplified with a few heuristics - work out how much
> memory your typical networking stack will allocate for a frame and
> tack that onto the total. Couple of pages worst case..

Actually, the same pattern that Peter and I developed for handling 
network deadlock extends to this accounting concept.  As you say, it's 
a SMOP.

Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-06  6:21   ` Daniel Phillips
  2007-12-06  7:31     ` Andrew Morton
@ 2007-12-06 21:53     ` Bill Davidsen
  2007-12-07  0:04       ` Daniel Phillips
  1 sibling, 1 reply; 40+ messages in thread
From: Bill Davidsen @ 2007-12-06 21:53 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrew Morton, linux-kernel, Peter Zijlstra

Daniel Phillips wrote:
> On Wednesday 05 December 2007 17:24, Andrew Morton wrote:
>> On Wed, 5 Dec 2007 16:03:01 -0800 Daniel Phillips <phillips@phunq.net> wrote:
>>> ...a block device these days may not be just a single 
>>> device, but may be a stack of devices connected together by a generic 
>>> mechanism such as device mapper, or a hardcoded stack such as 
>>> multi-disk or network block device.  It is necessary to consider the 
>>> resource requirements of the stack as a whole _before_ letting a 
>>> transfer proceed into any layer of the stack, otherwise deadlock on 
>>> many partially completed transfers becomes a possibility.  For this 
>>> reason, the bio throttling is only implemented at the initial, highest 
>>> level submission of the bio to the block layer and not for any recursive 
>>> submission of the same bio to a lower level block device in a stack.
>>>
>>> This in turn has rather far reaching implications: the top level device 
>>> in a stack must take care of inspecting the entire stack in order to 
>>> determine how to calculate its resource requirements, thus becoming
>>> the boss device for the entire stack.  Though this intriguing idea could 
>>> easily become the cause of endless design work and many thousands of 
>>> lines of fancy code, today I sidestep the question entirely using 
>>> the "just provide lots of reserve" strategy.  Horrifying as it may seem 
>>> to some, this is precisely the strategy that Linux has used in the 
>>> context of resource management in general, from the very beginning and 
>>> likely continuing for quite some time into the future  My strongly held 
>>> opinion in this matter is that we need to solve the real, underlying 
>>> problems definitively with nice code before declaring the opening of 
>>> fancy patch season.  So I am leaving further discussion of automatic 
>>> resource discovery algorithms and the like out of this post.
>> Rather than asking the stack "how much memory will this request consume"
>> you could instead ask "how much memory are you currently using".
>>
>> ie: on entry to the stack, do 
>>
>> 	current->account_block_allocations = 1;
>> 	make_request(...);
>> 	rq->used_memory += current->pages_used_for_block_allocations;
>>
>> and in the page allocator do
>>
>> 	if (!in_interrupt() && current->account_block_allocations)
>> 		current->pages_used_for_block_allocations++;
>>
>> and then somehow handle deallocation too ;)
> 
> Ah, and how do you ensure that you do not deadlock while making this
> inquiry?  Perhaps send a dummy transaction down the pipe?  Even so,
> deadlock is possible, quite evidently so in the real life example I have
> at hand.
> 
> Yours is essentially one of the strategies I had in mind, the other major
> one being simply to examine the whole stack, which presupposes some
> as-yet-nonexistant kernel wide method of representing block device
> stacks in all there glorious possible topology variations.
> 
>> The basic idea being to know in real time how much memory a particular
>> block stack is presently using.  Then, on entry to that stack, if the
>> stack's current usage is too high, wait for it to subside.
> 
> We do not wait for high block device resource usage to subside before
> submitting more requests.  The improvement you suggest is aimed at
> automatically determining resource requirements by sampling a
> running system, rather than requiring a programmer to determine them
> arduously by hand.  Something like automatically determining a
> workable locking strategy by analyzing running code, wouldn't that be
> a treat?  I will hope for one of those under my tree at Christmas.
> 
The problem is that you (a) may or may not know just how bad a worst 
case can be, and (b) may block unnecessarily by being pessimistic.

The dummy transaction would be nice, but it would be perfect if you 
could send the real transaction down with a max memory limit and a flag, 
have each level check and decrement the max by what's actually needed, 
and then return some pass/fail status for that particular transaction. 
Clearly every level in the stack would have to know how to do that. It 
would seem that once excess memory use was detected the transaction 
could be failed without deadlock.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-06 21:53     ` Bill Davidsen
@ 2007-12-07  0:04       ` Daniel Phillips
  2007-12-07  0:29         ` Andrew Morton
  0 siblings, 1 reply; 40+ messages in thread
From: Daniel Phillips @ 2007-12-07  0:04 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Andrew Morton, linux-kernel, Peter Zijlstra

On Thursday 06 December 2007 13:53, Bill Davidsen wrote:
> Daniel Phillips wrote:
> The problem is that you (a) may or may not know just how bad a worst
> case can be, and (b) may block unnecessarily by being pessimistic.

True, but after a quick introspect I realized that that issue (it's 
really a single issue) is not any worse than the way I planned to wave 
my hands at the issue of programmers constructing their metrics wrongly 
and thereby breaking the throttling assumptions.

Which is to say that I am now entirely convince by Andrew's argument and 
am prepardc to reroll the patch along the lines he suggests.  The 
result will be somewhat bigger.  Only a minor change is required to the 
main mechanism: we will now account things entirely in units of pages 
instead of abstract units, eliminating a whole class of things to go 
wrong.  I like that.  Accounting variables get shifted to a new home, 
maybe.  Must try a few ideas and see what works.

Anyway, the key idea is that task struct will gain a field pointing at a 
handle for the "block device stack", whatever that is (this is sure to 
evolve over time) and alloc_pages will know how to account pages to 
that object.  The submit_bio and bio->endio bits change hardly at all.

The runner up key idea is that we will gain a notion of "block device 
stack" (or block stack for short, so that we may implement block 
stackers) which for the time being will simply be Device Mapper's 
notion of device stack, however many warts that may have.  It's there 
now and we use it for ddsnap.

The other player in this is Peterz's swap over network use case, which 
does not involve a device mapper device.  Maybe it should?  Otherwise 
we will need a variant notion of block device stack, and the two 
threads of work should merge eventually.  There is little harm in 
starting this effort in two different places, quite the contrary.

In the meantime we do have a strategy that works, posted at the head of 
this thread, for anybody who needs it now.

> The dummy transaction would be nice, but it would be perfect if you
> could send the real transaction down with a max memory limit and a
> flag, have each level check and decrement the max by what's actually
> needed, and then return some pass/fail status for that particular
> transaction. Clearly every level in the stack would have to know how
> to do that. It would seem that once excess memory use was detected
> the transaction could be failed without deadlock.

The function of the dummy transaction will be to establish roughly what 
kind of footprint for a single transaction we see on that block IO 
path.  Then we will make the reservation _hugely_ greater than that, to 
accommodate 1000 or so of those.  A transaction blocks if it actually 
tries to use more than that.  We close the "many partial submissions 
all deadlock together" hole by ... <insert handwaving here>.  Can't go 
wrong, right?

Agreed that drivers should pay special attention to our dummy 
transaction and try to use the maximum possible resources when they see 
one.  More handwaving, but this is progress.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-07  0:04       ` Daniel Phillips
@ 2007-12-07  0:29         ` Andrew Morton
  2007-12-07  7:13           ` Daniel Phillips
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Morton @ 2007-12-07  0:29 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: davidsen, linux-kernel, peterz

On Thu, 6 Dec 2007 16:04:41 -0800
Daniel Phillips <phillips@phunq.net> wrote:

> The runner up key idea is that we will gain a notion of "block device 
> stack" (or block stack for short, so that we may implement block 
> stackers) which for the time being will simply be Device Mapper's 
> notion of device stack, however many warts that may have.  It's there 
> now and we use it for ddsnap.

Perhaps all we need to track is the outermost point?

submit_bio(...)
{
	bool remove_the_rq = false;

	...
	if (current->the_rq == NULL) {
		current->the_rq = rq;
		remove_the_rq = true;
	}
	...
	if (remove_the_rq)
		current->the_rq = NULL;
}

?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-07  0:29         ` Andrew Morton
@ 2007-12-07  7:13           ` Daniel Phillips
  2007-12-10  9:20             ` Daniel Phillips
  0 siblings, 1 reply; 40+ messages in thread
From: Daniel Phillips @ 2007-12-07  7:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: davidsen, linux-kernel, peterz

On Thursday 06 December 2007 16:29, Andrew Morton wrote:
> On Thu, 6 Dec 2007 16:04:41 -0800
>
> Daniel Phillips <phillips@phunq.net> wrote:
> > The runner up key idea is that we will gain a notion of "block
> > device stack" (or block stack for short, so that we may implement
> > block stackers) which for the time being will simply be Device
> > Mapper's notion of device stack, however many warts that may have. 
> > It's there now and we use it for ddsnap.
>
> Perhaps all we need to track is the outermost point?
>
> submit_bio(...)
> {
> 	bool remove_the_rq = false;
>
> 	...
> 	if (current->the_rq == NULL) {
> 		current->the_rq = rq;
> 		remove_the_rq = true;
> 	}
> 	...
> 	if (remove_the_rq)
> 		current->the_rq = NULL;
> }
>
> ?

The parent patch already has that crucial property in a simple say, see

       if (q && q->metric && !bio->bi_queue) {
               bio->bi_queue = q;

Regards,

Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-07  7:13           ` Daniel Phillips
@ 2007-12-10  9:20             ` Daniel Phillips
  0 siblings, 0 replies; 40+ messages in thread
From: Daniel Phillips @ 2007-12-10  9:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: davidsen, linux-kernel, peterz

Hi Andrew,

Unfortunately, I agreed with your suggestion too hastily.   Not only 
would it be complex to implement, It does not work.  It took me several 
days to put my finger on exactly why.  Here it is in a nutshell: 
resources may be consumed _after_ the gatekeeper runs the "go, no go" 
throttling decision.  To illustrate, throw 10,000 bios simultaneously 
at a block stack that is supposed to allow only about 1,000 in flight 
at a time.  If the block stack allocates memory somewhat late in its 
servicing scheme (for example, when it sends a network message) then it 
is possible that no actual resource consumption will have taken place 
before all 10,000 bios are allowed past the gate keeper, and deadlock 
is sure to occur sooner or later.

In general, we must throttle against the maximum requirement of inflight 
bios rather than against the measured consumption.  This achieves the 
invariant I have touted, namely that memory consumption on the block 
writeout path must be bounded.  We could therefore possibly use your 
suggestion or something resembling it to implement a debug check that 
the programmer did in fact do their bounds arithmetic  correctly, but 
it is not useful for enforcing the bound itself.

In case that coffin needs more nails in it, consider that we would not 
only need to account page allocations, but frees as well.  So what 
tells us that a page has returned to the reserve pool?  Oops, tough 
one.  The page may have been returned to a slab and thus not actually 
freed, though it remains available for satisfying new bio transactions.  
Because of such caching, your algorithm would quickly lose track of 
available resources and grind to a halt.

Never mind that keeping track of page frees is a nasty problem in 
itself.   They can occur in interrupt context, so forget the current-> 
idea.  Even keeping track of page allocations for bio transactions in 
normal context will be a mess, and that is the easy part.  I can just 
imagine the code attempting to implement this approach acreting into a 
monster that gets confusingly close to working without ever actually 
getting  there.

We do have a simple, elegant solution posted at the head of this thread, 
which is known to work.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-06  0:03 [RFC] [PATCH] A clean approach to writeout throttling Daniel Phillips
  2007-12-06  1:24 ` Andrew Morton
@ 2007-12-10 10:47 ` Jens Axboe
  2007-12-10 11:23   ` [RFC] [PATCH] A clean aEvgeniy pproach " Daniel Phillips
  2007-12-10 11:33   ` [RFC] [PATCH] A clean approach " Daniel Phillips
  2007-12-10 21:31 ` Jonathan Corbet
  2 siblings, 2 replies; 40+ messages in thread
From: Jens Axboe @ 2007-12-10 10:47 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

On Wed, Dec 05 2007, Daniel Phillips wrote:
> --- 2.6.24-rc3-mm.clean/block/ll_rw_blk.c	2007-12-04 14:45:25.000000000 -0800
> +++ 2.6.24-rc3-mm/block/ll_rw_blk.c	2007-12-04 14:01:18.000000000 -0800
> @@ -3210,7 +3210,7 @@ static inline int bio_check_eod(struct b
>   */
>  static inline void __generic_make_request(struct bio *bio)
>  {
> -	struct request_queue *q;
> +	request_queue_t *q = bdev_get_queue(bio->bi_bdev);
>  	sector_t old_sector;
>  	int ret, nr_sectors = bio_sectors(bio);
>  	dev_t old_dev;
> @@ -3221,6 +3221,13 @@ static inline void __generic_make_reques
>  	if (bio_check_eod(bio, nr_sectors))
>  		goto end_io;
>  
> +	if (q && q->metric && !bio->bi_queue) {
> +		int need = bio->bi_throttle = q->metric(bio);
> +		bio->bi_queue = q;
> +		/* FIXME: potential race if atomic_sub is called in the middle of condition check */
> +		wait_event_interruptible(q->throttle_wait, atomic_read(&q->available) >= need);
> +		atomic_sub(need, &q->available);
> +	}
>  	/*
>  	 * Resolve the mapping until finished. (drivers are
>  	 * still free to implement/resolve their own stacking
> @@ -3234,7 +3241,6 @@ static inline void __generic_make_reques
>  	do {
>  		char b[BDEVNAME_SIZE];
>  
> -		q = bdev_get_queue(bio->bi_bdev);
>  		if (!q) {
>  			printk(KERN_ERR
>  			       "generic_make_request: Trying to access "

Ehm, this patch is so broken it's not even funny - did you even compile?
You would have noticed the warning on request_queue_t, surely. The big
problem is the last hunk here though, how would that work on stacked
devices? Clue: ->bi_bdev is not const, it can change after a call to
->make_request_fn().

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean aEvgeniy pproach to writeout throttling
  2007-12-10 10:47 ` Jens Axboe
@ 2007-12-10 11:23   ` Daniel Phillips
  2007-12-10 11:41     ` Jens Axboe
  2007-12-10 11:33   ` [RFC] [PATCH] A clean approach " Daniel Phillips
  1 sibling, 1 reply; 40+ messages in thread
From: Daniel Phillips @ 2007-12-10 11:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

On Monday 10 December 2007 02:47, Jens Axboe wrote:
> Ehm, this patch is so broken it's not even funny - did you even
> compile? You would have noticed the warning on request_queue_t,
> surely. The big problem is the last hunk here though, how would that
> work on stacked devices? Clue: ->bi_bdev is not const, it can change
> after a call to ->make_request_fn().

Such paranoia.  Yes, the patch was compiled.   Yes, the warning was 
slipped through.  No, it is not substantive, and in fact was removed 
from another branch of our tree already.

Ignoring the rhetoric, apparently you missed the line:

+       if (q && q->metric && !bio->bi_queue) {

The prevents any reference ti bi_bdev after the intial call to 
generic_make_request.  Thanks to Evgeniy for pointing out the need for 
this measure on the last go round.

"So broken" is a gross exaggeration.  Substantive comments welcome.

Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-10 10:47 ` Jens Axboe
  2007-12-10 11:23   ` [RFC] [PATCH] A clean aEvgeniy pproach " Daniel Phillips
@ 2007-12-10 11:33   ` Daniel Phillips
  1 sibling, 0 replies; 40+ messages in thread
From: Daniel Phillips @ 2007-12-10 11:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 152 bytes --]

On Monday 10 December 2007 02:47, Jens Axboe wrote:
> ...the warning on request_queue_t...

There you go, Jens, service with a smile.

Regards,

Daniel

[-- Attachment #2: bio.throttle-2.6.24-rc3-mm --]
[-- Type: text/x-diff, Size: 3912 bytes --]

--- 2.6.24-rc3-mm.clean/block/ll_rw_blk.c	2007-12-04 14:45:25.000000000 -0800
+++ 2.6.24-rc3-mm/block/ll_rw_blk.c	2007-12-10 03:27:42.000000000 -0800
@@ -3210,7 +3210,7 @@ static inline int bio_check_eod(struct b
  */
 static inline void __generic_make_request(struct bio *bio)
 {
-	struct request_queue *q;
+	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 	sector_t old_sector;
 	int ret, nr_sectors = bio_sectors(bio);
 	dev_t old_dev;
@@ -3221,6 +3221,13 @@ static inline void __generic_make_reques
 	if (bio_check_eod(bio, nr_sectors))
 		goto end_io;
 
+	if (q && q->metric && !bio->bi_queue) {
+		int need = bio->bi_throttle = q->metric(bio);
+		bio->bi_queue = q;
+		/* FIXME: potential race if atomic_sub is called in the middle of condition check */
+		wait_event_interruptible(q->throttle_wait, atomic_read(&q->available) >= need);
+		atomic_sub(need, &q->available);
+	}
 	/*
 	 * Resolve the mapping until finished. (drivers are
 	 * still free to implement/resolve their own stacking
@@ -3234,7 +3241,6 @@ static inline void __generic_make_reques
 	do {
 		char b[BDEVNAME_SIZE];
 
-		q = bdev_get_queue(bio->bi_bdev);
 		if (!q) {
 			printk(KERN_ERR
 			       "generic_make_request: Trying to access "
--- 2.6.24-rc3-mm.clean/drivers/md/dm.c	2007-12-04 14:46:04.000000000 -0800
+++ 2.6.24-rc3-mm/drivers/md/dm.c	2007-12-04 23:31:41.000000000 -0800
@@ -889,6 +889,11 @@ static int dm_any_congested(void *conges
 	return r;
 }
 
+static unsigned dm_metric(struct bio *bio)
+{
+	return bio->bi_vcnt;
+}
+
 /*-----------------------------------------------------------------
  * An IDR is used to keep track of allocated minor numbers.
  *---------------------------------------------------------------*/
@@ -967,6 +972,7 @@ out:
 
 static struct block_device_operations dm_blk_dops;
 
+#define DEFAULT_THROTTLE_CAPACITY 1000
 /*
  * Allocate and initialise a blank device with a given minor.
  */
@@ -1009,6 +1015,11 @@ static struct mapped_device *alloc_dev(i
 		goto bad1_free_minor;
 
 	md->queue->queuedata = md;
+	md->queue->metric = dm_metric;
+	/* A dm device constructor may change the throttle capacity */
+	atomic_set(&md->queue->available, md->queue->capacity = DEFAULT_THROTTLE_CAPACITY);
+	init_waitqueue_head(&md->queue->throttle_wait);
+
 	md->queue->backing_dev_info.congested_fn = dm_any_congested;
 	md->queue->backing_dev_info.congested_data = md;
 	blk_queue_make_request(md->queue, dm_request);
--- 2.6.24-rc3-mm.clean/fs/bio.c	2007-12-04 14:38:47.000000000 -0800
+++ 2.6.24-rc3-mm/fs/bio.c	2007-12-04 23:31:41.000000000 -0800
@@ -1007,6 +1007,13 @@ void bio_endio(struct bio *bio, int erro
 	else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
 		error = -EIO;
 
+	if (bio->bi_throttle) {
+		struct request_queue *q = bio->bi_queue;
+		bio->bi_throttle = 0; /* or detect multiple endio and err? */
+		atomic_add(bio->bi_throttle, &q->available);
+		wake_up(&q->throttle_wait);
+	}
+
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio, error);
 }
--- 2.6.24-rc3-mm.clean/include/linux/bio.h	2007-12-04 14:39:31.000000000 -0800
+++ 2.6.24-rc3-mm/include/linux/bio.h	2007-12-04 23:31:41.000000000 -0800
@@ -111,6 +111,9 @@ struct bio {
 	bio_end_io_t		*bi_end_io;
 	atomic_t		bi_cnt;		/* pin count */
 
+	struct request_queue	*bi_queue;	/* for throttling */
+	unsigned		bi_throttle;	/* throttle metric */
+
 	void			*bi_private;
 
 	bio_destructor_t	*bi_destructor;	/* destructor */
--- 2.6.24-rc3-mm.clean/include/linux/blkdev.h	2007-12-04 14:47:18.000000000 -0800
+++ 2.6.24-rc3-mm/include/linux/blkdev.h	2007-12-04 23:31:41.000000000 -0800
@@ -383,6 +383,10 @@ struct request_queue
 	struct work_struct	unplug_work;
 
 	struct backing_dev_info	backing_dev_info;
+	unsigned (*metric)(struct bio *bio);	/* bio throttle metric */
+	wait_queue_head_t	throttle_wait;
+	atomic_t		available;
+	unsigned		capacity;
 
 	/*
 	 * The queue owner gets to use this for whatever they like.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean aEvgeniy pproach to writeout throttling
  2007-12-10 11:23   ` [RFC] [PATCH] A clean aEvgeniy pproach " Daniel Phillips
@ 2007-12-10 11:41     ` Jens Axboe
  2007-12-10 12:13       ` Daniel Phillips
  0 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2007-12-10 11:41 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

On Mon, Dec 10 2007, Daniel Phillips wrote:
> On Monday 10 December 2007 02:47, Jens Axboe wrote:
> > Ehm, this patch is so broken it's not even funny - did you even
> > compile? You would have noticed the warning on request_queue_t,
> > surely. The big problem is the last hunk here though, how would that
> > work on stacked devices? Clue: ->bi_bdev is not const, it can change
> > after a call to ->make_request_fn().
> 
> Such paranoia.  Yes, the patch was compiled.   Yes, the warning was 
> slipped through.  No, it is not substantive, and in fact was removed 
> from another branch of our tree already.
> 
> Ignoring the rhetoric, apparently you missed the line:
> 
> +       if (q && q->metric && !bio->bi_queue) {
> 
> The prevents any reference ti bi_bdev after the intial call to 
> generic_make_request.  Thanks to Evgeniy for pointing out the need for 
> this measure on the last go round.

Which saves the initial target, for ease of accounting at end io time -
that's not the point. What happens when ->make_request_fn() changes
bio->bi_bdev and returns 1, causing another iteration of the
__generic_make_request() loop? 'q' is no longer the valid target,
bdev_get_queue(bio->bi_bdev) is.

> "So broken" is a gross exaggeration.  Substantive comments welcome.

Or you could try and make an effort to understand the comment instead of
just glancing over it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean aEvgeniy pproach to writeout throttling
  2007-12-10 11:41     ` Jens Axboe
@ 2007-12-10 12:13       ` Daniel Phillips
  2007-12-10 12:16         ` Jens Axboe
  0 siblings, 1 reply; 40+ messages in thread
From: Daniel Phillips @ 2007-12-10 12:13 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

On Monday 10 December 2007 03:41, Jens Axboe wrote:
> On Mon, Dec 10 2007, Daniel Phillips wrote:
> > +       if (q && q->metric && !bio->bi_queue) {
> >
> > This prevents any reference to bi_bdev after the intial call to
> > generic_make_request.  Thanks to Evgeniy for pointing out the need
> > for this measure on the last go round.
>
> Which saves the initial target, for ease of accounting at end io time
> - that's not the point. What happens when ->make_request_fn() changes
> bio->bi_bdev and returns 1, causing another iteration of the
> __generic_make_request() loop? 'q' is no longer the valid target,
> bdev_get_queue(bio->bi_bdev) is.

What happens on the second iteration of a recursive submission loop is 
exactly nothing, as is right and proper.  The throttling has already 
been done, and all the state necessary to perform the unthrottle was 
recorded in the bio.  Everything seems to be in order there, and the 
algorithm does indeed perform its function as designed, though to be 
sure we have not tested it on -mm branch, only on mainline.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean aEvgeniy pproach to writeout throttling
  2007-12-10 12:13       ` Daniel Phillips
@ 2007-12-10 12:16         ` Jens Axboe
  2007-12-10 12:27           ` Daniel Phillips
  0 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2007-12-10 12:16 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

On Mon, Dec 10 2007, Daniel Phillips wrote:
> On Monday 10 December 2007 03:41, Jens Axboe wrote:
> > On Mon, Dec 10 2007, Daniel Phillips wrote:
> > > +       if (q && q->metric && !bio->bi_queue) {
> > >
> > > This prevents any reference to bi_bdev after the intial call to
> > > generic_make_request.  Thanks to Evgeniy for pointing out the need
> > > for this measure on the last go round.
> >
> > Which saves the initial target, for ease of accounting at end io time
> > - that's not the point. What happens when ->make_request_fn() changes
> > bio->bi_bdev and returns 1, causing another iteration of the
> > __generic_make_request() loop? 'q' is no longer the valid target,
> > bdev_get_queue(bio->bi_bdev) is.
> 
> What happens on the second iteration of a recursive submission loop is 
> exactly nothing, as is right and proper.  The throttling has already 
> been done, and all the state necessary to perform the unthrottle was 
> recorded in the bio.  Everything seems to be in order there, and the 
> algorithm does indeed perform its function as designed, though to be 
> sure we have not tested it on -mm branch, only on mainline.

OK, let me get the neon out then. This has nothing to do with
throttling, I thought I made it clear that I get why you store the
origin queue in ->bi_queue. I'm concerned with the workings of
redirecting a bio. Previously we looked up the queue associated with
bio->bi_bdev inside the loop in __generic_make_request(), as is REQUIRED
to correctly locate a DIFFERENT queue if bio->bi_bdev has been changed
to point somewhere else.

Clear?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean aEvgeniy pproach to writeout throttling
  2007-12-10 12:16         ` Jens Axboe
@ 2007-12-10 12:27           ` Daniel Phillips
  2007-12-10 12:32             ` Jens Axboe
  0 siblings, 1 reply; 40+ messages in thread
From: Daniel Phillips @ 2007-12-10 12:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

On Monday 10 December 2007 04:16, Jens Axboe wrote:
> OK, let me get the neon out then. This has nothing to do with
> throttling, I thought I made it clear that I get why you store the
> origin queue in ->bi_queue. I'm concerned with the workings of
> redirecting a bio. Previously we looked up the queue associated with
> bio->bi_bdev inside the loop in __generic_make_request(), as is
> REQUIRED to correctly locate a DIFFERENT queue if bio->bi_bdev has
> been changed to point somewhere else.

Rhetoric aside, again.

We are only interested in throttling against the bio->bi_bdev that was 
stored in the bio at the time of the call to generic_make_request, why 
should we care about the redirected value?

Regards,

Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean aEvgeniy pproach to writeout throttling
  2007-12-10 12:27           ` Daniel Phillips
@ 2007-12-10 12:32             ` Jens Axboe
  2007-12-10 13:04               ` Daniel Phillips
  0 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2007-12-10 12:32 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

On Mon, Dec 10 2007, Daniel Phillips wrote:
> On Monday 10 December 2007 04:16, Jens Axboe wrote:
> > OK, let me get the neon out then. This has nothing to do with
> > throttling, I thought I made it clear that I get why you store the
> > origin queue in ->bi_queue. I'm concerned with the workings of
> > redirecting a bio. Previously we looked up the queue associated with
> > bio->bi_bdev inside the loop in __generic_make_request(), as is
> > REQUIRED to correctly locate a DIFFERENT queue if bio->bi_bdev has
> > been changed to point somewhere else.
> 
> Rhetoric aside, again.
> 
> We are only interested in throttling against the bio->bi_bdev that was 
> stored in the bio at the time of the call to generic_make_request, why 
> should we care about the redirected value?

Let me repeat - this has nothing to do with throttling! You are breaking
the bio redirection by killing that bdev_get_queue() in the
__generic_make_request().

I honestly don't know how to make this any clearer than I already did
above. Sleep on it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean aEvgeniy pproach to writeout throttling
  2007-12-10 12:32             ` Jens Axboe
@ 2007-12-10 13:04               ` Daniel Phillips
  2007-12-10 13:19                 ` Jens Axboe
  0 siblings, 1 reply; 40+ messages in thread
From: Daniel Phillips @ 2007-12-10 13:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 365 bytes --]

On Monday 10 December 2007 04:32, Jens Axboe wrote:
> I honestly don't know how to make this any clearer than I already did
> above.

Sure you do, you could cut out the rhetoric and save lots of bandwidth 
thereby.

Yes, the q = bdev_get_queue(bio->bi_bdev) needs to be repeated inside 
the submission loop, that was a flaw, thanks for the catch.

Regards,

Daniel

[-- Attachment #2: bio.throttle-2.6.24-rc3-mm --]
[-- Type: text/x-diff, Size: 4217 bytes --]

--- 2.6.24-rc3-mm.clean/block/ll_rw_blk.c	2007-12-04 14:45:25.000000000 -0800
+++ 2.6.24-rc3-mm/block/ll_rw_blk.c	2007-12-10 04:49:56.000000000 -0800
@@ -3210,9 +3210,9 @@ static inline int bio_check_eod(struct b
  */
 static inline void __generic_make_request(struct bio *bio)
 {
-	struct request_queue *q;
+	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 	sector_t old_sector;
-	int ret, nr_sectors = bio_sectors(bio);
+	int nr_sectors = bio_sectors(bio);
 	dev_t old_dev;
 	int err = -EIO;
 
@@ -3221,6 +3221,13 @@ static inline void __generic_make_reques
 	if (bio_check_eod(bio, nr_sectors))
 		goto end_io;
 
+	if (q && q->metric && !bio->bi_queue) {
+		int need = bio->bi_throttle = q->metric(bio);
+		bio->bi_queue = q;
+		/* FIXME: potential race if atomic_sub is called in the middle of condition check */
+		wait_event(q->throttle_wait, atomic_read(&q->available) >= need);
+		atomic_sub(need, &q->available);
+	}
 	/*
 	 * Resolve the mapping until finished. (drivers are
 	 * still free to implement/resolve their own stacking
@@ -3231,10 +3238,9 @@ static inline void __generic_make_reques
 	 */
 	old_sector = -1;
 	old_dev = 0;
-	do {
+	while (1) {
 		char b[BDEVNAME_SIZE];
 
-		q = bdev_get_queue(bio->bi_bdev);
 		if (!q) {
 			printk(KERN_ERR
 			       "generic_make_request: Trying to access "
@@ -3282,8 +3288,10 @@ end_io:
 			goto end_io;
 		}
 
-		ret = q->make_request_fn(q, bio);
-	} while (ret);
+		if (!q->make_request_fn(q, bio))
+			return;
+		q = bdev_get_queue(bio->bi_bdev);
+	}
 }
 
 /*
--- 2.6.24-rc3-mm.clean/drivers/md/dm.c	2007-12-04 14:46:04.000000000 -0800
+++ 2.6.24-rc3-mm/drivers/md/dm.c	2007-12-04 23:31:41.000000000 -0800
@@ -889,6 +889,11 @@ static int dm_any_congested(void *conges
 	return r;
 }
 
+static unsigned dm_metric(struct bio *bio)
+{
+	return bio->bi_vcnt;
+}
+
 /*-----------------------------------------------------------------
  * An IDR is used to keep track of allocated minor numbers.
  *---------------------------------------------------------------*/
@@ -967,6 +972,7 @@ out:
 
 static struct block_device_operations dm_blk_dops;
 
+#define DEFAULT_THROTTLE_CAPACITY 1000
 /*
  * Allocate and initialise a blank device with a given minor.
  */
@@ -1009,6 +1015,11 @@ static struct mapped_device *alloc_dev(i
 		goto bad1_free_minor;
 
 	md->queue->queuedata = md;
+	md->queue->metric = dm_metric;
+	/* A dm device constructor may change the throttle capacity */
+	atomic_set(&md->queue->available, md->queue->capacity = DEFAULT_THROTTLE_CAPACITY);
+	init_waitqueue_head(&md->queue->throttle_wait);
+
 	md->queue->backing_dev_info.congested_fn = dm_any_congested;
 	md->queue->backing_dev_info.congested_data = md;
 	blk_queue_make_request(md->queue, dm_request);
--- 2.6.24-rc3-mm.clean/fs/bio.c	2007-12-04 14:38:47.000000000 -0800
+++ 2.6.24-rc3-mm/fs/bio.c	2007-12-04 23:31:41.000000000 -0800
@@ -1007,6 +1007,13 @@ void bio_endio(struct bio *bio, int erro
 	else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
 		error = -EIO;
 
+	if (bio->bi_throttle) {
+		struct request_queue *q = bio->bi_queue;
+		bio->bi_throttle = 0; /* or detect multiple endio and err? */
+		atomic_add(bio->bi_throttle, &q->available);
+		wake_up(&q->throttle_wait);
+	}
+
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio, error);
 }
--- 2.6.24-rc3-mm.clean/include/linux/bio.h	2007-12-04 14:39:31.000000000 -0800
+++ 2.6.24-rc3-mm/include/linux/bio.h	2007-12-04 23:31:41.000000000 -0800
@@ -111,6 +111,9 @@ struct bio {
 	bio_end_io_t		*bi_end_io;
 	atomic_t		bi_cnt;		/* pin count */
 
+	struct request_queue	*bi_queue;	/* for throttling */
+	unsigned		bi_throttle;	/* throttle metric */
+
 	void			*bi_private;
 
 	bio_destructor_t	*bi_destructor;	/* destructor */
--- 2.6.24-rc3-mm.clean/include/linux/blkdev.h	2007-12-04 14:47:18.000000000 -0800
+++ 2.6.24-rc3-mm/include/linux/blkdev.h	2007-12-04 23:31:41.000000000 -0800
@@ -383,6 +383,10 @@ struct request_queue
 	struct work_struct	unplug_work;
 
 	struct backing_dev_info	backing_dev_info;
+	unsigned (*metric)(struct bio *bio);	/* bio throttle metric */
+	wait_queue_head_t	throttle_wait;
+	atomic_t		available;
+	unsigned		capacity;
 
 	/*
 	 * The queue owner gets to use this for whatever they like.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean aEvgeniy pproach to writeout throttling
  2007-12-10 13:04               ` Daniel Phillips
@ 2007-12-10 13:19                 ` Jens Axboe
  2007-12-10 13:26                   ` Daniel Phillips
  0 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2007-12-10 13:19 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

On Mon, Dec 10 2007, Daniel Phillips wrote:
> On Monday 10 December 2007 04:32, Jens Axboe wrote:
> > I honestly don't know how to make this any clearer than I already did
> > above.
> 
> Sure you do, you could cut out the rhetoric and save lots of bandwidth 
> thereby.

I spent 3 mail explaining it as clearly as I could. So you're welcome
for the review and the reminder of why it's impossible to have a normal
conversation with you.

> Yes, the q = bdev_get_queue(bio->bi_bdev) needs to be repeated inside 
> the submission loop, that was a flaw, thanks for the catch.

Precisely. So forgive me for thinking this patch hasn't seen very varied
testing, that's 2 errors (one simple, one bad - broken was NOT a gross
exageration, thanks) in very few lines.

> 
> Regards,
> 
> Daniel

> --- 2.6.24-rc3-mm.clean/block/ll_rw_blk.c	2007-12-04 14:45:25.000000000 -0800
> +++ 2.6.24-rc3-mm/block/ll_rw_blk.c	2007-12-10 04:49:56.000000000 -0800
> @@ -3210,9 +3210,9 @@ static inline int bio_check_eod(struct b
>   */
>  static inline void __generic_make_request(struct bio *bio)
>  {
> -	struct request_queue *q;
> +	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>  	sector_t old_sector;
> -	int ret, nr_sectors = bio_sectors(bio);
> +	int nr_sectors = bio_sectors(bio);
>  	dev_t old_dev;
>  	int err = -EIO;
>  
> @@ -3221,6 +3221,13 @@ static inline void __generic_make_reques
>  	if (bio_check_eod(bio, nr_sectors))
>  		goto end_io;
>  
> +	if (q && q->metric && !bio->bi_queue) {
> +		int need = bio->bi_throttle = q->metric(bio);
> +		bio->bi_queue = q;
> +		/* FIXME: potential race if atomic_sub is called in the middle of condition check */
> +		wait_event(q->throttle_wait, atomic_read(&q->available) >= need);
> +		atomic_sub(need, &q->available);
> +	}
>  	/*
>  	 * Resolve the mapping until finished. (drivers are
>  	 * still free to implement/resolve their own stacking
> @@ -3231,10 +3238,9 @@ static inline void __generic_make_reques
>  	 */
>  	old_sector = -1;
>  	old_dev = 0;
> -	do {
> +	while (1) {
>  		char b[BDEVNAME_SIZE];
>  
> -		q = bdev_get_queue(bio->bi_bdev);
>  		if (!q) {
>  			printk(KERN_ERR
>  			       "generic_make_request: Trying to access "
> @@ -3282,8 +3288,10 @@ end_io:
>  			goto end_io;
>  		}
>  
> -		ret = q->make_request_fn(q, bio);
> -	} while (ret);
> +		if (!q->make_request_fn(q, bio))
> +			return;
> +		q = bdev_get_queue(bio->bi_bdev);
> +	}

break here please.

> --- 2.6.24-rc3-mm.clean/include/linux/bio.h	2007-12-04 14:39:31.000000000 -0800
> +++ 2.6.24-rc3-mm/include/linux/bio.h	2007-12-04 23:31:41.000000000 -0800
> @@ -111,6 +111,9 @@ struct bio {
>  	bio_end_io_t		*bi_end_io;
>  	atomic_t		bi_cnt;		/* pin count */
>  
> +	struct request_queue	*bi_queue;	/* for throttling */
> +	unsigned		bi_throttle;	/* throttle metric */
> +

I still wish there was a way around this, you are bloating the bio by
about 15% (yeah I know you rambled on about this, but still). Better
placement would help, so there's still low hanging fruit available.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean aEvgeniy pproach to writeout throttling
  2007-12-10 13:19                 ` Jens Axboe
@ 2007-12-10 13:26                   ` Daniel Phillips
  2007-12-10 13:30                     ` Jens Axboe
  0 siblings, 1 reply; 40+ messages in thread
From: Daniel Phillips @ 2007-12-10 13:26 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

On Monday 10 December 2007 05:19, Jens Axboe wrote:
> Precisely. So forgive me for thinking this patch hasn't seen very
> varied testing, that's 2 errors (one simple, one bad - broken was NOT
> a gross exageration, thanks) in very few lines.

See the [RFC]?  If I had meant Request For Flaming, I would have written 
that.  Thankyou for the catch.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean aEvgeniy pproach to writeout throttling
  2007-12-10 13:26                   ` Daniel Phillips
@ 2007-12-10 13:30                     ` Jens Axboe
  2007-12-10 13:43                       ` Daniel Phillips
  0 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2007-12-10 13:30 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

On Mon, Dec 10 2007, Daniel Phillips wrote:
> On Monday 10 December 2007 05:19, Jens Axboe wrote:
> > Precisely. So forgive me for thinking this patch hasn't seen very
> > varied testing, that's 2 errors (one simple, one bad - broken was NOT
> > a gross exageration, thanks) in very few lines.
> 
> See the [RFC]?  If I had meant Request For Flaming, I would have written 
> that.  Thankyou for the catch.

The same mail that contained this part, copied verbatim:

"Let me close with perhaps the most relevant remarks: the attached code
 has been in heavy testing and in production for months now.  Thus there
 is nothing theoretical when I say it works, and the patch speaks for
 itself in terms of obvious correctness."

We must have differing opinions on what obvious correctness is.

Future replies to /dev/null, please.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean aEvgeniy pproach to writeout throttling
  2007-12-10 13:30                     ` Jens Axboe
@ 2007-12-10 13:43                       ` Daniel Phillips
  2007-12-10 13:53                         ` Jens Axboe
  0 siblings, 1 reply; 40+ messages in thread
From: Daniel Phillips @ 2007-12-10 13:43 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

On Monday 10 December 2007 05:30, Jens Axboe wrote:
> On Mon, Dec 10 2007, Daniel Phillips wrote:
> "Let me close with perhaps the most relevant remarks: the attached
> code has been in heavy testing and in production for months now. 
> Thus there is nothing theoretical when I say it works, and the patch
> speaks for itself in terms of obvious correctness."

That is quite correct, even without the redirect the code passed all our 
tests.  Remember, we were testing for deadlock, not every possible 
block IO configuration.

> We must have differing opinions on what obvious correctness is.

Yes we do.  You appear to have missed the plot entirely.  I suppose I 
should remind you: this is about deadlock in _your_ subsystem that has 
been creating bug reports for years.  Block writeout deadlock.  Caused 
by a deficiency in _your_ subsystem.

Got a plan?  Or does endless, pointless flaming feel more like progress 
to you?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean aEvgeniy pproach to writeout throttling
  2007-12-10 13:43                       ` Daniel Phillips
@ 2007-12-10 13:53                         ` Jens Axboe
  2007-12-10 14:17                           ` Daniel Phillips
  0 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2007-12-10 13:53 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

On Mon, Dec 10 2007, Daniel Phillips wrote:
> On Monday 10 December 2007 05:30, Jens Axboe wrote:
> > On Mon, Dec 10 2007, Daniel Phillips wrote:
> > "Let me close with perhaps the most relevant remarks: the attached
> > code has been in heavy testing and in production for months now. 
> > Thus there is nothing theoretical when I say it works, and the patch
> > speaks for itself in terms of obvious correctness."
> 
> That is quite correct, even without the redirect the code passed all our 
> tests.  Remember, we were testing for deadlock, not every possible 
> block IO configuration.

And round we go.

> > We must have differing opinions on what obvious correctness is.
> 
> Yes we do.  You appear to have missed the plot entirely.  I suppose I 
> should remind you: this is about deadlock in _your_ subsystem that has 
> been creating bug reports for years.  Block writeout deadlock.  Caused 
> by a deficiency in _your_ subsystem.

And I may remind you that I have participated in this discussion before
and made my points clear there. I suppose I should remind you how the
development process works? Just because I happen to maintain some piece
of code does not mean I'm under some sort of contractual obligation to
fix and write new code for users. I'll happily review patches and
integrate stuff I agree with, as I have been doing for years. This bug
may seem really important to you - guess what, that's the normal nature
of a bug to a user that runs into them.

> Got a plan?  Or does endless, pointless flaming feel more like progress 
> to you?

Please, I'm not flaming you. I reviewed your code and pointed out
errors, which was followed by lots of hand waving on your side instead
of just sitting down and reading/fixing the bug.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean aEvgeniy pproach to writeout throttling
  2007-12-10 13:53                         ` Jens Axboe
@ 2007-12-10 14:17                           ` Daniel Phillips
  2007-12-11 13:15                             ` Jens Axboe
  0 siblings, 1 reply; 40+ messages in thread
From: Daniel Phillips @ 2007-12-10 14:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

On Monday 10 December 2007 05:53, Jens Axboe wrote:
> > Got a plan?  Or does endless, pointless flaming feel more like
> > progress to you?
>
> Please, I'm not flaming you. I reviewed your code and pointed out
> errors, which was followed by lots of hand waving on your side
> instead of just sitting down and reading/fixing the bug.

Well that is indeed more civil language, if somewhat revisionist, since 
I distinctly remember being flamed by you from the word go.  Once 
again, thankyou for the catch.  A fairly trivial oversight that would 
have been caught sooner of later.  As for the typedef thing, that was 
just a spelling flame, admit it.

Truly, the way you were yelling I thought you had picked up a 
fundamental flaw instead of a simple misplaced line of code.

Now about that block writeout deadlock... it doesn't just affect my 
code, it basically breaks Linux as a storage platform, among other 
things.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-06  0:03 [RFC] [PATCH] A clean approach to writeout throttling Daniel Phillips
  2007-12-06  1:24 ` Andrew Morton
  2007-12-10 10:47 ` Jens Axboe
@ 2007-12-10 21:31 ` Jonathan Corbet
  2007-12-10 22:06   ` Pekka Enberg
  2007-12-11  4:21   ` Daniel Phillips
  2 siblings, 2 replies; 40+ messages in thread
From: Jonathan Corbet @ 2007-12-10 21:31 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrew Morton, Peter Zijlstra, linux-kernel

Hey, Daniel,

I'm just getting around to looking at this.  One thing jumped out at me:

> +	if (bio->bi_throttle) {
> +		struct request_queue *q = bio->bi_queue;
> +		bio->bi_throttle = 0; /* or detect multiple endio and err? */
> +		atomic_add(bio->bi_throttle, &q->available);
> +		wake_up(&q->throttle_wait);
> +	}

I'm feeling like I must be really dumb, but...how can that possibly
work?  You're zeroing >bi_throttle before adding it back into
q->available, so the latter will never increase...

jon

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-10 21:31 ` Jonathan Corbet
@ 2007-12-10 22:06   ` Pekka Enberg
  2007-12-11  4:21   ` Daniel Phillips
  1 sibling, 0 replies; 40+ messages in thread
From: Pekka Enberg @ 2007-12-10 22:06 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Daniel Phillips, Andrew Morton, Peter Zijlstra, linux-kernel

Hi,

On Dec 10, 2007 11:31 PM, Jonathan Corbet <corbet@lwn.net> wrote:
> I'm just getting around to looking at this.  One thing jumped out at me:
>
> > +     if (bio->bi_throttle) {
> > +             struct request_queue *q = bio->bi_queue;
> > +             bio->bi_throttle = 0; /* or detect multiple endio and err? */
> > +             atomic_add(bio->bi_throttle, &q->available);
> > +             wake_up(&q->throttle_wait);
> > +     }
>
> I'm feeling like I must be really dumb, but...how can that possibly
> work?  You're zeroing >bi_throttle before adding it back into
> q->available, so the latter will never increase...

Heh, well, that's ok as long as bio->bi_vcnt is set to zero and I think we
have some md raid drivers do just that... ;-)

                                Pekka

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean approach to writeout throttling
  2007-12-10 21:31 ` Jonathan Corbet
  2007-12-10 22:06   ` Pekka Enberg
@ 2007-12-11  4:21   ` Daniel Phillips
  1 sibling, 0 replies; 40+ messages in thread
From: Daniel Phillips @ 2007-12-11  4:21 UTC (permalink / raw)
  To: Jonathan Corbet; +Cc: Andrew Morton, Peter Zijlstra, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1560 bytes --]

On Monday 10 December 2007 13:31, Jonathan Corbet wrote:
> Hey, Daniel,
>
> I'm just getting around to looking at this.  One thing jumped out at me:
> > +	if (bio->bi_throttle) {
> > +		struct request_queue *q = bio->bi_queue;
> > +		bio->bi_throttle = 0; /* or detect multiple endio and err? */
> > +		atomic_add(bio->bi_throttle, &q->available);
> > +		wake_up(&q->throttle_wait);
> > +	}
>
> I'm feeling like I must be really dumb, but...how can that possibly
> work?  You're zeroing >bi_throttle before adding it back into
> q->available, so the latter will never increase...

Hi Jon,

Don't you know?  These days we optimize all our code for modern
processors with tunnelling instructions and metaphysical cache.
On such processors, setting a register to zero does not entirely
destroy all the data that used to be in the register, so subsequent
instructions can make further use of the overwritten data by
reconstructing it from remnants of bits left attached to the edges of
the register.

Um, yeah, that's it.

Actually, I fat-fingered it in the merge to -mm.  Thanks for the catch,
corrected patch attached.

The offending line isn't even a functional part of the algorithm, it is
just supposed to defend against the possibility that, somehow,
->bi_endio gets called multiple times.  Probably it should really be
something like:

		BUG_ON(bio->bi_throttle == -1);
		if (bio->bi_throttle) {
			...
			bio->bi_throttle = -1;

Or perhaps we should just rely on nobody ever making that mistake
and let somebody else catch it if it does.

Regards,

Daniel

[-- Attachment #2: bio.throttle-2.6.24-rc3-mm --]
[-- Type: text/x-diff, Size: 4217 bytes --]

--- 2.6.24-rc3-mm.clean/block/ll_rw_blk.c	2007-12-04 14:45:25.000000000 -0800
+++ 2.6.24-rc3-mm/block/ll_rw_blk.c	2007-12-10 04:49:56.000000000 -0800
@@ -3210,9 +3210,9 @@ static inline int bio_check_eod(struct b
  */
 static inline void __generic_make_request(struct bio *bio)
 {
-	struct request_queue *q;
+	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 	sector_t old_sector;
-	int ret, nr_sectors = bio_sectors(bio);
+	int nr_sectors = bio_sectors(bio);
 	dev_t old_dev;
 	int err = -EIO;
 
@@ -3221,6 +3221,13 @@ static inline void __generic_make_reques
 	if (bio_check_eod(bio, nr_sectors))
 		goto end_io;
 
+	if (q && q->metric && !bio->bi_queue) {
+		int need = bio->bi_throttle = q->metric(bio);
+		bio->bi_queue = q;
+		/* FIXME: potential race if atomic_sub is called in the middle of condition check */
+		wait_event(q->throttle_wait, atomic_read(&q->available) >= need);
+		atomic_sub(need, &q->available);
+	}
 	/*
 	 * Resolve the mapping until finished. (drivers are
 	 * still free to implement/resolve their own stacking
@@ -3231,10 +3238,9 @@ static inline void __generic_make_reques
 	 */
 	old_sector = -1;
 	old_dev = 0;
-	do {
+	while (1) {
 		char b[BDEVNAME_SIZE];
 
-		q = bdev_get_queue(bio->bi_bdev);
 		if (!q) {
 			printk(KERN_ERR
 			       "generic_make_request: Trying to access "
@@ -3282,8 +3288,10 @@ end_io:
 			goto end_io;
 		}
 
-		ret = q->make_request_fn(q, bio);
-	} while (ret);
+		if (!q->make_request_fn(q, bio))
+			return;
+		q = bdev_get_queue(bio->bi_bdev);
+	}
 }
 
 /*
--- 2.6.24-rc3-mm.clean/drivers/md/dm.c	2007-12-04 14:46:04.000000000 -0800
+++ 2.6.24-rc3-mm/drivers/md/dm.c	2007-12-04 23:31:41.000000000 -0800
@@ -889,6 +889,11 @@ static int dm_any_congested(void *conges
 	return r;
 }
 
+static unsigned dm_metric(struct bio *bio)
+{
+	return bio->bi_vcnt;
+}
+
 /*-----------------------------------------------------------------
  * An IDR is used to keep track of allocated minor numbers.
  *---------------------------------------------------------------*/
@@ -967,6 +972,7 @@ out:
 
 static struct block_device_operations dm_blk_dops;
 
+#define DEFAULT_THROTTLE_CAPACITY 1000
 /*
  * Allocate and initialise a blank device with a given minor.
  */
@@ -1009,6 +1015,11 @@ static struct mapped_device *alloc_dev(i
 		goto bad1_free_minor;
 
 	md->queue->queuedata = md;
+	md->queue->metric = dm_metric;
+	/* A dm device constructor may change the throttle capacity */
+	atomic_set(&md->queue->available, md->queue->capacity = DEFAULT_THROTTLE_CAPACITY);
+	init_waitqueue_head(&md->queue->throttle_wait);
+
 	md->queue->backing_dev_info.congested_fn = dm_any_congested;
 	md->queue->backing_dev_info.congested_data = md;
 	blk_queue_make_request(md->queue, dm_request);
--- 2.6.24-rc3-mm.clean/fs/bio.c	2007-12-04 14:38:47.000000000 -0800
+++ 2.6.24-rc3-mm/fs/bio.c	2007-12-04 23:31:41.000000000 -0800
@@ -1007,6 +1007,13 @@ void bio_endio(struct bio *bio, int erro
 	else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
 		error = -EIO;
 
+	if (bio->bi_throttle) {
+		struct request_queue *q = bio->bi_queue;
+		atomic_add(bio->bi_throttle, &q->available);
+		bio->bi_throttle = 0; /* or detect multiple endio and err? */
+		wake_up(&q->throttle_wait);
+	}
+
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio, error);
 }
--- 2.6.24-rc3-mm.clean/include/linux/bio.h	2007-12-04 14:39:31.000000000 -0800
+++ 2.6.24-rc3-mm/include/linux/bio.h	2007-12-04 23:31:41.000000000 -0800
@@ -111,6 +111,9 @@ struct bio {
 	bio_end_io_t		*bi_end_io;
 	atomic_t		bi_cnt;		/* pin count */
 
+	struct request_queue	*bi_queue;	/* for throttling */
+	unsigned		bi_throttle;	/* throttle metric */
+
 	void			*bi_private;
 
 	bio_destructor_t	*bi_destructor;	/* destructor */
--- 2.6.24-rc3-mm.clean/include/linux/blkdev.h	2007-12-04 14:47:18.000000000 -0800
+++ 2.6.24-rc3-mm/include/linux/blkdev.h	2007-12-04 23:31:41.000000000 -0800
@@ -383,6 +383,10 @@ struct request_queue
 	struct work_struct	unplug_work;
 
 	struct backing_dev_info	backing_dev_info;
+	unsigned (*metric)(struct bio *bio);	/* bio throttle metric */
+	wait_queue_head_t	throttle_wait;
+	atomic_t		available;
+	unsigned		capacity;
 
 	/*
 	 * The queue owner gets to use this for whatever they like.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean aEvgeniy pproach to writeout throttling
  2007-12-10 14:17                           ` Daniel Phillips
@ 2007-12-11 13:15                             ` Jens Axboe
  2007-12-11 19:38                               ` Daniel Phillips
  2007-12-11 20:07                               ` Daniel Phillips
  0 siblings, 2 replies; 40+ messages in thread
From: Jens Axboe @ 2007-12-11 13:15 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

On Mon, Dec 10 2007, Daniel Phillips wrote:
> Now about that block writeout deadlock... it doesn't just affect my 
> code, it basically breaks Linux as a storage platform, among other 
> things.

As written in other similar threads in the past in which you also
participated, I still of the opinion that this is a vm issue and should
be solved as such.

As to the patch in question "fixing" it in the block layer, it's a
fairly simple work around and I'm not totally against it. If you get rid
of the ->bi_throttle stuff and just do sanity checks on the count, then
we could look at getting some testing done.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean aEvgeniy pproach to writeout throttling
  2007-12-11 13:15                             ` Jens Axboe
@ 2007-12-11 19:38                               ` Daniel Phillips
  2007-12-11 20:01                                 ` Jens Axboe
  2007-12-11 20:07                               ` Daniel Phillips
  1 sibling, 1 reply; 40+ messages in thread
From: Daniel Phillips @ 2007-12-11 19:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

On Tuesday 11 December 2007 05:15, Jens Axboe wrote:
> On Mon, Dec 10 2007, Daniel Phillips wrote:
> > Now about that block writeout deadlock... it doesn't just affect my
> > code, it basically breaks Linux as a storage platform, among other
> > things.
>
> As written in other similar threads in the past in which you also
> participated, I still of the opinion that this is a vm issue and
> should be solved as such.

The problem is solved.  The main cornerstone of the solution is
bio throttling, simply because the resources in question are consumed by 
bio transactions.

> As to the patch in question "fixing" it in the block layer, it's a
> fairly simple work around and I'm not totally against it. If you get
> rid of the ->bi_throttle stuff and just do sanity checks on the
> count, then we could look at getting some testing done.

Testing is already progressing fine without you, thankyou.  If you do 
want to participate, welcome, otherwise it is not a problem.  Thanks 
for picking up that bug yesterday.

Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean aEvgeniy pproach to writeout throttling
  2007-12-11 19:38                               ` Daniel Phillips
@ 2007-12-11 20:01                                 ` Jens Axboe
  2007-12-11 20:11                                   ` Daniel Phillips
  0 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2007-12-11 20:01 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

On Tue, Dec 11 2007, Daniel Phillips wrote:
> On Tuesday 11 December 2007 05:15, Jens Axboe wrote:
> > On Mon, Dec 10 2007, Daniel Phillips wrote:
> > > Now about that block writeout deadlock... it doesn't just affect my
> > > code, it basically breaks Linux as a storage platform, among other
> > > things.
> >
> > As written in other similar threads in the past in which you also
> > participated, I still of the opinion that this is a vm issue and
> > should be solved as such.
> 
> The problem is solved.  The main cornerstone of the solution is
> bio throttling, simply because the resources in question are consumed by 
> bio transactions.

... because too much is pushed out. This isn't a mathematica problem,
there's more than one solution to this problem. Throttling the bio count
is one.

> > As to the patch in question "fixing" it in the block layer, it's a
> > fairly simple work around and I'm not totally against it. If you get
> > rid of the ->bi_throttle stuff and just do sanity checks on the
> > count, then we could look at getting some testing done.
> 
> Testing is already progressing fine without you, thankyou.  If you do 
> want to participate, welcome, otherwise it is not a problem.  Thanks 
> for picking up that bug yesterday.

Here we go again, thanks for picking up your jerky attitude again. I'm
trying to suggest a way to get the patch in a state to be included, but
apparently you are not interested. With 3 bugs so far exposed in your
really short patch, seems you should take all the testing help you can
get.

For what it's worth, your behind the doors testing is worth basically
nothing. It's already been shown that your test coverage wasn't very
wide, if this patch/idea is to have any hope of proceeding further you
need user testing. Period.

Stop cc'ing or replying, what little interest I had is totally gone.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean aEvgeniy pproach to writeout throttling
  2007-12-11 13:15                             ` Jens Axboe
  2007-12-11 19:38                               ` Daniel Phillips
@ 2007-12-11 20:07                               ` Daniel Phillips
  1 sibling, 0 replies; 40+ messages in thread
From: Daniel Phillips @ 2007-12-11 20:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

On Tuesday 11 December 2007 05:15, Jens Axboe wrote:
> As to the patch in question "fixing" it in the block layer, it's a
> fairly simple work around and I'm not totally against it. If you get
> rid of the ->bi_throttle stuff and just do sanity checks on the
> count, then we could look at getting some testing done.

Oh, sorry I missed that olive branch on first read.  Getting rid of 
those 8 bytes that bother you requires an extensive rethink of bio  
handling in order to make some fields that are not now constant, 
constant or at least restored on change.  Which would be a good thing 
in itself.  There are lots of good improvements that can be made to 
this subsystem along those lines.

But that is properly a separate project.  Quite some time will be needed 
to get it right, and should I mention it, everybody needs to be on the 
same page or the work will never start.  It is therefore a theoretical 
solution.  We have a practical, tested solution, here and now, and it 
is short enough to be understood, unlike any of the previous attempts.

Your argument seems to be that adding 8 bytes to struct bio turns this 
beautiful swan into an ugly duck.  Actually, because the throttling 
reduces the number of bios in flight in a busy system, total memory use 
is reduced.  When the system is not busy, there are few bios hanging 
around so that is not a problem either.  Nice, hmm?

Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] [PATCH] A clean aEvgeniy pproach to writeout throttling
  2007-12-11 20:01                                 ` Jens Axboe
@ 2007-12-11 20:11                                   ` Daniel Phillips
  0 siblings, 0 replies; 40+ messages in thread
From: Daniel Phillips @ 2007-12-11 20:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Peter Zijlstra

On Tuesday 11 December 2007 12:01, Jens Axboe wrote:
> On Tue, Dec 11 2007, Daniel Phillips wrote:
> > The problem is solved.  The main cornerstone of the solution is
> > bio throttling, simply because the resources in question are
> > consumed by bio transactions.
>
> ... because too much is pushed out. This isn't a mathematica problem,
> there's more than one solution to this problem. Throttling the bio
> count is one.

And nobody has been able to find another.  Funny that.  In fact, every 
solution proposed so far has implicitly required the writeout traffic 
to be throttled, even if that throttling was not part of the patch.  
Without throttling, deadlock.  Simple as that.

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2007-12-11 20:11 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-06  0:03 [RFC] [PATCH] A clean approach to writeout throttling Daniel Phillips
2007-12-06  1:24 ` Andrew Morton
2007-12-06  6:21   ` Daniel Phillips
2007-12-06  7:31     ` Andrew Morton
2007-12-06  9:48       ` Daniel Phillips
2007-12-06 11:55         ` Andrew Morton
2007-12-06 15:52           ` Rik van Riel
2007-12-06 17:34             ` Andrew Morton
2007-12-06 17:48               ` Rik van Riel
2007-12-06 20:04           ` Daniel Phillips
2007-12-06 20:27             ` Andrew Morton
2007-12-06 21:27               ` Daniel Phillips
2007-12-06 21:53     ` Bill Davidsen
2007-12-07  0:04       ` Daniel Phillips
2007-12-07  0:29         ` Andrew Morton
2007-12-07  7:13           ` Daniel Phillips
2007-12-10  9:20             ` Daniel Phillips
2007-12-10 10:47 ` Jens Axboe
2007-12-10 11:23   ` [RFC] [PATCH] A clean aEvgeniy pproach " Daniel Phillips
2007-12-10 11:41     ` Jens Axboe
2007-12-10 12:13       ` Daniel Phillips
2007-12-10 12:16         ` Jens Axboe
2007-12-10 12:27           ` Daniel Phillips
2007-12-10 12:32             ` Jens Axboe
2007-12-10 13:04               ` Daniel Phillips
2007-12-10 13:19                 ` Jens Axboe
2007-12-10 13:26                   ` Daniel Phillips
2007-12-10 13:30                     ` Jens Axboe
2007-12-10 13:43                       ` Daniel Phillips
2007-12-10 13:53                         ` Jens Axboe
2007-12-10 14:17                           ` Daniel Phillips
2007-12-11 13:15                             ` Jens Axboe
2007-12-11 19:38                               ` Daniel Phillips
2007-12-11 20:01                                 ` Jens Axboe
2007-12-11 20:11                                   ` Daniel Phillips
2007-12-11 20:07                               ` Daniel Phillips
2007-12-10 11:33   ` [RFC] [PATCH] A clean approach " Daniel Phillips
2007-12-10 21:31 ` Jonathan Corbet
2007-12-10 22:06   ` Pekka Enberg
2007-12-11  4:21   ` Daniel Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).