linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] scsi: fix sense_slab/bio swapping livelock
@ 2008-04-06 22:56 Hugh Dickins
  2008-04-06 23:35 ` James Bottomley
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Hugh Dickins @ 2008-04-06 22:56 UTC (permalink / raw)
  To: James Bottomley
  Cc: Linus Torvalds, Andrew Morton, FUJITA Tomonori, Jens Axboe,
	Christoph Lameter, Pekka Enberg, Peter Zijlstra,
	Rafael J. Wysocki, linux-kernel

Since 2.6.25-rc7, I've been seeing an occasional livelock on one
x86_64 machine, copying kernel trees to tmpfs, paging out to swap.

Signature: 6000 pages under writeback but never getting written;
most tasks of interest trying to reclaim, but each get_swap_bio
waiting for a bio in mempool_alloc's io_schedule_timeout(5*HZ);
every five seconds an atomic page allocation failure report from
kblockd failing to allocate a sense_buffer in __scsi_get_command.

__scsi_get_command has a (one item) free_list to protect against
this, but rc1's [SCSI] use dynamically allocated sense buffer
de25deb18016f66dcdede165d07654559bb332bc upset that slightly.
When it fails to allocate from the separate sense_slab, instead
of giving up, it must fall back to the command free_list, which
is sure to have a sense_buffer attached.

Either my earlier -rc testing missed this, or there's some recent
contributory factor.  One very significant factor is SLUB, which
merges slab caches when it can, and on 64-bit happens to merge
both bio cache and sense_slab cache into kmalloc's 128-byte cache:
so that under this swapping load, bios above are liable to gobble
up all the slots needed for scsi_cmnd sense_buffers below.

That's disturbing behaviour, and I tried a few things to fix it.
Adding a no-op constructor to the sense_slab inhibits SLUB from
merging it, and stops all the allocation failures I was seeing;
but it's rather a hack, and perhaps in different configurations
we have other caches on the swapout path which are ill-merged.

Another alternative is to revert the separate sense_slab, using
cache-line-aligned sense_buffer allocated beyond scsi_cmnd from
the one kmem_cache; but that might waste more memory, and is
only a way of diverting around the known problem.

While I don't like seeing the allocation failures, and hate the
idea of all those bios piled up above a scsi host working one by
one, it does seem to emerge fairly soon with the livelock fix.
So lacking better ideas, stick with that one clear fix for now.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 drivers/scsi/scsi.c |   22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

--- 2.6.25-rc8/drivers/scsi/scsi.c	2008-03-05 10:47:40.000000000 +0000
+++ linux/drivers/scsi/scsi.c	2008-04-05 22:23:40.000000000 +0100
@@ -181,6 +181,18 @@ struct scsi_cmnd *__scsi_get_command(str
 	cmd = kmem_cache_alloc(shost->cmd_pool->cmd_slab,
 			       gfp_mask | shost->cmd_pool->gfp_mask);
 
+	if (likely(cmd)) {
+		buf = kmem_cache_alloc(shost->cmd_pool->sense_slab,
+				       gfp_mask | shost->cmd_pool->gfp_mask);
+		if (likely(buf)) {
+			memset(cmd, 0, sizeof(*cmd));
+			cmd->sense_buffer = buf;
+		} else {
+			kmem_cache_free(shost->cmd_pool->cmd_slab, cmd);
+			cmd = NULL;
+		}
+	}
+
 	if (unlikely(!cmd)) {
 		unsigned long flags;
 
@@ -197,16 +209,6 @@ struct scsi_cmnd *__scsi_get_command(str
 			memset(cmd, 0, sizeof(*cmd));
 			cmd->sense_buffer = buf;
 		}
-	} else {
-		buf = kmem_cache_alloc(shost->cmd_pool->sense_slab,
-				       gfp_mask | shost->cmd_pool->gfp_mask);
-		if (likely(buf)) {
-			memset(cmd, 0, sizeof(*cmd));
-			cmd->sense_buffer = buf;
-		} else {
-			kmem_cache_free(shost->cmd_pool->cmd_slab, cmd);
-			cmd = NULL;
-		}
 	}
 
 	return cmd;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-06 22:56 [PATCH] scsi: fix sense_slab/bio swapping livelock Hugh Dickins
@ 2008-04-06 23:35 ` James Bottomley
  2008-04-07  1:01   ` Hugh Dickins
  2008-04-07  2:48 ` FUJITA Tomonori
  2008-04-07  5:26 ` Christoph Lameter
  2 siblings, 1 reply; 30+ messages in thread
From: James Bottomley @ 2008-04-06 23:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Andrew Morton, FUJITA Tomonori, Jens Axboe,
	Christoph Lameter, Pekka Enberg, Peter Zijlstra,
	Rafael J. Wysocki, linux-kernel

On Sun, 2008-04-06 at 23:56 +0100, Hugh Dickins wrote:
> Since 2.6.25-rc7, I've been seeing an occasional livelock on one
> x86_64 machine, copying kernel trees to tmpfs, paging out to swap.
> 
> Signature: 6000 pages under writeback but never getting written;
> most tasks of interest trying to reclaim, but each get_swap_bio
> waiting for a bio in mempool_alloc's io_schedule_timeout(5*HZ);
> every five seconds an atomic page allocation failure report from
> kblockd failing to allocate a sense_buffer in __scsi_get_command.
> 
> __scsi_get_command has a (one item) free_list to protect against
> this, but rc1's [SCSI] use dynamically allocated sense buffer
> de25deb18016f66dcdede165d07654559bb332bc upset that slightly.
> When it fails to allocate from the separate sense_slab, instead
> of giving up, it must fall back to the command free_list, which
> is sure to have a sense_buffer attached.
> 
> Either my earlier -rc testing missed this, or there's some recent
> contributory factor.  One very significant factor is SLUB, which
> merges slab caches when it can, and on 64-bit happens to merge
> both bio cache and sense_slab cache into kmalloc's 128-byte cache:
> so that under this swapping load, bios above are liable to gobble
> up all the slots needed for scsi_cmnd sense_buffers below.
> 
> That's disturbing behaviour, and I tried a few things to fix it.
> Adding a no-op constructor to the sense_slab inhibits SLUB from
> merging it, and stops all the allocation failures I was seeing;
> but it's rather a hack, and perhaps in different configurations
> we have other caches on the swapout path which are ill-merged.
> 
> Another alternative is to revert the separate sense_slab, using
> cache-line-aligned sense_buffer allocated beyond scsi_cmnd from
> the one kmem_cache; but that might waste more memory, and is
> only a way of diverting around the known problem.
> 
> While I don't like seeing the allocation failures, and hate the
> idea of all those bios piled up above a scsi host working one by
> one, it does seem to emerge fairly soon with the livelock fix.
> So lacking better ideas, stick with that one clear fix for now.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>

This was sort of accidentally fixed in scsi-misc by commit 

commit c5f73260b289cb974928eac05f2d84e58ddfc020
Author: James Bottomley <James.Bottomley@HansenPartnership.com>
Date:   Thu Mar 13 11:16:33 2008 -0500

    [SCSI] consolidate command allocation in a single place

Could you check that:

master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6.git

and see if it alleviates the problem? ... if so, we can work out which
pieces to backport.

Thanks,

James



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-06 23:35 ` James Bottomley
@ 2008-04-07  1:01   ` Hugh Dickins
  2008-04-07 17:51     ` Hugh Dickins
  0 siblings, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2008-04-07  1:01 UTC (permalink / raw)
  To: James Bottomley
  Cc: Linus Torvalds, Andrew Morton, FUJITA Tomonori, Jens Axboe,
	Christoph Lameter, Pekka Enberg, Peter Zijlstra,
	Rafael J. Wysocki, linux-kernel

On Sun, 6 Apr 2008, James Bottomley wrote:
> 
> This was sort of accidentally fixed in scsi-misc by commit 
> 
> commit c5f73260b289cb974928eac05f2d84e58ddfc020
> Author: James Bottomley <James.Bottomley@HansenPartnership.com>
> Date:   Thu Mar 13 11:16:33 2008 -0500
> 
>     [SCSI] consolidate command allocation in a single place

Thanks, yes, that looks a good substitute to me.

> Could you check that:
> 
> master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6.git
> 
> and see if it alleviates the problem? ... if so, we can work out which
> pieces to backport.

Precisely that patch seems appropriate to 2.6.25-rc8-git, so I'm now
running the test with just that applied to 2.6.25-rc8 (plus cfq rcu
fix).  Not quite what you asked, but...

Strictly speaking, it'd take a couple of days to be reasonably sure
that the livelock is gone (it appeared to reproduce quicker once I
moved to -rc8 plus cfq rcu fix; but I'm not entirely convinced that
wasn't just coincidence).  But if nothing bad appears overnight,
let's assume your patch is the one to push: I'll report tomorrow.

Hugh


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-06 22:56 [PATCH] scsi: fix sense_slab/bio swapping livelock Hugh Dickins
  2008-04-06 23:35 ` James Bottomley
@ 2008-04-07  2:48 ` FUJITA Tomonori
  2008-04-07 18:07   ` Hugh Dickins
  2008-04-07  5:26 ` Christoph Lameter
  2 siblings, 1 reply; 30+ messages in thread
From: FUJITA Tomonori @ 2008-04-07  2:48 UTC (permalink / raw)
  To: hugh
  Cc: James.Bottomley, torvalds, akpm, fujita.tomonori, jens.axboe,
	clameter, penberg, a.p.ziljstra, rjw, linux-kernel

On Sun, 6 Apr 2008 23:56:57 +0100 (BST)
Hugh Dickins <hugh@veritas.com> wrote:

> Since 2.6.25-rc7, I've been seeing an occasional livelock on one
> x86_64 machine, copying kernel trees to tmpfs, paging out to swap.
> 
> Signature: 6000 pages under writeback but never getting written;
> most tasks of interest trying to reclaim, but each get_swap_bio
> waiting for a bio in mempool_alloc's io_schedule_timeout(5*HZ);
> every five seconds an atomic page allocation failure report from
> kblockd failing to allocate a sense_buffer in __scsi_get_command.
> 
> __scsi_get_command has a (one item) free_list to protect against
> this, but rc1's [SCSI] use dynamically allocated sense buffer
> de25deb18016f66dcdede165d07654559bb332bc upset that slightly.
> When it fails to allocate from the separate sense_slab, instead
> of giving up, it must fall back to the command free_list, which
> is sure to have a sense_buffer attached.

Really sorry about the bug.


> Another alternative is to revert the separate sense_slab, using
> cache-line-aligned sense_buffer allocated beyond scsi_cmnd from
> the one kmem_cache; but that might waste more memory, and is
> only a way of diverting around the known problem.

Reverting the separate sense_slab is fine for now but we need the
separation shortly anyway. We need to support larger sense buffer (260
bytes). The current 96 byte sense buffer works for the majority of us,
so we doesn't want to embed 260 byte sense buffer in scsi_cmnd struct.


> While I don't like seeing the allocation failures, and hate the
> idea of all those bios piled up above a scsi host working one by
> one, it does seem to emerge fairly soon with the livelock fix.
> So lacking better ideas, stick with that one clear fix for now.

As you and James agreed, the patch in scsi-misc looks good to me.


Thanks for finding this bug.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-06 22:56 [PATCH] scsi: fix sense_slab/bio swapping livelock Hugh Dickins
  2008-04-06 23:35 ` James Bottomley
  2008-04-07  2:48 ` FUJITA Tomonori
@ 2008-04-07  5:26 ` Christoph Lameter
  2008-04-07 19:40   ` Hugh Dickins
  2 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2008-04-07  5:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: James Bottomley, Linus Torvalds, Andrew Morton, FUJITA Tomonori,
	Jens Axboe, Pekka Enberg, Peter Zijlstra, Rafael J. Wysocki,
	linux-kernel

On Sun, 6 Apr 2008, Hugh Dickins wrote:

> __scsi_get_command has a (one item) free_list to protect against
> this, but rc1's [SCSI] use dynamically allocated sense buffer
> de25deb18016f66dcdede165d07654559bb332bc upset that slightly.
> When it fails to allocate from the separate sense_slab, instead
> of giving up, it must fall back to the command free_list, which
> is sure to have a sense_buffer attached.
> 
> Either my earlier -rc testing missed this, or there's some recent
> contributory factor.  One very significant factor is SLUB, which
> merges slab caches when it can, and on 64-bit happens to merge
> both bio cache and sense_slab cache into kmalloc's 128-byte cache:
> so that under this swapping load, bios above are liable to gobble
> up all the slots needed for scsi_cmnd sense_buffers below.

A reliance on free slots that the slab allocator may provide? That is a 
rather bad dependency since it is up to the slab allocator to implement
the storage layout for the objects and thus the availability of slots may
vary depending on the layout for the objects chosen by the allocator.

Looking at mempool_alloc: Mempools may be used to do atomic allocations 
until they fail thereby exhausting reserves and available object in the 
partial lists of slab caches?

In order to make this a significant factor we need to have already 
exhausted reserves right? Thus we are already operating at the boundary of 
what memory there is. Any non atomic alloc will then allocate a new page 
with N elements in order to get one object. The mempool_allocs from the 
atomic context will then gooble up the N-1 remaining objects? So the
nonatomic alloc will then have to hit the page allocator again...


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07  1:01   ` Hugh Dickins
@ 2008-04-07 17:51     ` Hugh Dickins
  2008-04-07 18:04       ` James Bottomley
  0 siblings, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2008-04-07 17:51 UTC (permalink / raw)
  To: James Bottomley
  Cc: Linus Torvalds, Andrew Morton, FUJITA Tomonori, Jens Axboe,
	Christoph Lameter, Pekka Enberg, Peter Zijlstra,
	Rafael J. Wysocki, linux-kernel

On Mon, 7 Apr 2008, Hugh Dickins wrote:
> On Sun, 6 Apr 2008, James Bottomley wrote:
> > 
> > This was sort of accidentally fixed in scsi-misc by commit 
> > 
> > commit c5f73260b289cb974928eac05f2d84e58ddfc020
> > Author: James Bottomley <James.Bottomley@HansenPartnership.com>
> > Date:   Thu Mar 13 11:16:33 2008 -0500
> > 
> >     [SCSI] consolidate command allocation in a single place
> 
> Thanks, yes, that looks a good substitute to me.
> 
> > Could you check that:
> > 
> > master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6.git
> > 
> > and see if it alleviates the problem? ... if so, we can work out which
> > pieces to backport.
> 
> Precisely that patch seems appropriate to 2.6.25-rc8-git, so I'm now
> running the test with just that applied to 2.6.25-rc8 (plus cfq rcu
> fix).  Not quite what you asked, but...
> 
> Strictly speaking, it'd take a couple of days to be reasonably sure
> that the livelock is gone (it appeared to reproduce quicker once I
> moved to -rc8 plus cfq rcu fix; but I'm not entirely convinced that
> wasn't just coincidence).  But if nothing bad appears overnight,
> let's assume your patch is the one to push: I'll report tomorrow.

Right, as expected, that seems to run fine: I've seen no problem with
it in 17 hours - beyond, of course, all the page allocation failures
which will plague such tests until something else is changed.

Somewhat irrelevant since Linus put my patch into 2.6.25-rc8-git
(probably worth it for the various issues noted in the comment);
but reassuring for 2.6.26.

I'll stop that test now and put it to work on something else.

Hugh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07 17:51     ` Hugh Dickins
@ 2008-04-07 18:04       ` James Bottomley
  2008-04-07 18:26         ` Hugh Dickins
  0 siblings, 1 reply; 30+ messages in thread
From: James Bottomley @ 2008-04-07 18:04 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Andrew Morton, FUJITA Tomonori, Jens Axboe,
	Christoph Lameter, Pekka Enberg, Peter Zijlstra,
	Rafael J. Wysocki, linux-kernel

On Mon, 2008-04-07 at 18:51 +0100, Hugh Dickins wrote:
> On Mon, 7 Apr 2008, Hugh Dickins wrote:
> > On Sun, 6 Apr 2008, James Bottomley wrote:
> > > 
> > > This was sort of accidentally fixed in scsi-misc by commit 
> > > 
> > > commit c5f73260b289cb974928eac05f2d84e58ddfc020
> > > Author: James Bottomley <James.Bottomley@HansenPartnership.com>
> > > Date:   Thu Mar 13 11:16:33 2008 -0500
> > > 
> > >     [SCSI] consolidate command allocation in a single place
> > 
> > Thanks, yes, that looks a good substitute to me.
> > 
> > > Could you check that:
> > > 
> > > master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6.git
> > > 
> > > and see if it alleviates the problem? ... if so, we can work out which
> > > pieces to backport.
> > 
> > Precisely that patch seems appropriate to 2.6.25-rc8-git, so I'm now
> > running the test with just that applied to 2.6.25-rc8 (plus cfq rcu
> > fix).  Not quite what you asked, but...
> > 
> > Strictly speaking, it'd take a couple of days to be reasonably sure
> > that the livelock is gone (it appeared to reproduce quicker once I
> > moved to -rc8 plus cfq rcu fix; but I'm not entirely convinced that
> > wasn't just coincidence).  But if nothing bad appears overnight,
> > let's assume your patch is the one to push: I'll report tomorrow.
> 
> Right, as expected, that seems to run fine: I've seen no problem with
> it in 17 hours - beyond, of course, all the page allocation failures
> which will plague such tests until something else is changed.
> 
> Somewhat irrelevant since Linus put my patch into 2.6.25-rc8-git
> (probably worth it for the various issues noted in the comment);
> but reassuring for 2.6.26.
> 
> I'll stop that test now and put it to work on something else.

Well ... just remember that to merge scsi-misc with what Linus has now
done, I'm effectively reversing your patch, so testing current scsi-misc
is very valuable if you want the problem not to recur in 2.6.26 ...

James



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07  2:48 ` FUJITA Tomonori
@ 2008-04-07 18:07   ` Hugh Dickins
  2008-04-08 14:04     ` FUJITA Tomonori
  0 siblings, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2008-04-07 18:07 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: James.Bottomley, torvalds, akpm, jens.axboe, clameter, penberg,
	a.p.zijlstra, rjw, linux-kernel

On Mon, 7 Apr 2008, FUJITA Tomonori wrote:
> On Sun, 6 Apr 2008 23:56:57 +0100 (BST)
> Hugh Dickins <hugh@veritas.com> wrote:
> 
> Really sorry about the bug.

No, it's brought attention to this interesting slab merge issue;
even if in the end we decide that's a non-issue.

> > Another alternative is to revert the separate sense_slab, using
> > cache-line-aligned sense_buffer allocated beyond scsi_cmnd from
> > the one kmem_cache; but that might waste more memory, and is
> > only a way of diverting around the known problem.
> 
> Reverting the separate sense_slab is fine for now but we need the
> separation shortly anyway. We need to support larger sense buffer (260
> bytes). The current 96 byte sense buffer works for the majority of us,
> so we doesn't want to embed 260 byte sense buffer in scsi_cmnd struct.

I don't believe you _need_ a separate sense_slab even for that:
what I meant was that you just need something like
	pool->cmd_slab = kmem_cache_create(pool->cmd_name,
					   cache_line_align(
					   sizeof(struct scsi_cmnd)) +
					   max_scsi_sense_buffersize,
					   0, pool->slab_flags, NULL);
then point cmd->sense_buffer to (unsigned char *) cmd +
			cache_line_align(sizeof(struct scsi_cmnd));
where cache_line_align and max_scsi_sense_buffersize are preferably
determined at runtime.

Now, it may well be that over the different configurations, at least
some would waste significant memory by putting it all in the one big
buffer, and you're better off with the separate slabs: so I didn't
want to interfere with your direction on that.

Hugh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07 18:04       ` James Bottomley
@ 2008-04-07 18:26         ` Hugh Dickins
  0 siblings, 0 replies; 30+ messages in thread
From: Hugh Dickins @ 2008-04-07 18:26 UTC (permalink / raw)
  To: James Bottomley
  Cc: Linus Torvalds, Andrew Morton, FUJITA Tomonori, Jens Axboe,
	Christoph Lameter, Pekka Enberg, Peter Zijlstra,
	Rafael J. Wysocki, linux-kernel

On Mon, 7 Apr 2008, James Bottomley wrote:
> On Mon, 2008-04-07 at 18:51 +0100, Hugh Dickins wrote:
> > 
> > Somewhat irrelevant since Linus put my patch into 2.6.25-rc8-git
> > (probably worth it for the various issues noted in the comment);
> > but reassuring for 2.6.26.
> 
> Well ... just remember that to merge scsi-misc with what Linus has now
> done, I'm effectively reversing your patch, so testing current scsi-misc
> is very valuable if you want the problem not to recur in 2.6.26 ...

Exactly.

Hugh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07  5:26 ` Christoph Lameter
@ 2008-04-07 19:40   ` Hugh Dickins
  2008-04-07 19:55     ` Peter Zijlstra
  2008-04-08 20:43     ` Christoph Lameter
  0 siblings, 2 replies; 30+ messages in thread
From: Hugh Dickins @ 2008-04-07 19:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: James Bottomley, Linus Torvalds, Andrew Morton, FUJITA Tomonori,
	Jens Axboe, Pekka Enberg, Peter Zijlstra, Rafael J. Wysocki,
	linux-kernel

On Sun, 6 Apr 2008, Christoph Lameter wrote:
> On Sun, 6 Apr 2008, Hugh Dickins wrote:
> > 
> > One very significant factor is SLUB, which
> > merges slab caches when it can, and on 64-bit happens to merge
> > both bio cache and sense_slab cache into kmalloc's 128-byte cache:
> > so that under this swapping load, bios above are liable to gobble
> > up all the slots needed for scsi_cmnd sense_buffers below.
> 
> A reliance on free slots that the slab allocator may provide? That is a 
> rather bad dependency since it is up to the slab allocator to implement
> the storage layout for the objects and thus the availability of slots may
> vary depending on the layout for the objects chosen by the allocator.

I'm not sure that I understand you.  Yes, different slab allocators
may lay out slots differently.  But a significant departure from
existing behaviour may be a bad idea in some circumstances.
(Hmm, maybe I've written a content-free sentence there!).

> 
> Looking at mempool_alloc: Mempools may be used to do atomic allocations 
> until they fail thereby exhausting reserves and available object in the 
> partial lists of slab caches?

Mempools may be used for atomic allocations, but I think that's not
the case here.  swap_writepage's get_swap_bio says GFP_NOIO, which
allows (indeed is) __GFP_WAIT, and does not give access to __GFP_HIGH
reserves.

Whereas at the __scsi_get_command end, there are GFP_ATOMIC sense_slab
allocations, which do give access to __GFP_HIGH reserves.

My supposition is that once a page has been allocated from __GFP_HIGH
reserves to a scsi sense_slab, swap_writepages are liable to gobble up
the rest of the page with bio allocations which they wouldn't have had
access to traditionally (i.e. under SLAB).

So an unexpected behaviour emerges from SLUB's slab merging.

Though of course the same might happen in other circumstances, even
without slab merging: if some kmem_cache allocations are made with
GFP_ATOMIC, those can give access to reserves to non-__GFP_HIGH
allocations from the same kmem_cache.

Maybe PF_MEMALLOC and __GFP_NOMEMALLOC complicate the situation:
I've given little thought to mempool_alloc's fiddling with the
gfp_mask (beyond repeatedly misreading it).

> 
> In order to make this a significant factor we need to have already 
> exhausted reserves right? Thus we are already operating at the boundary of 
> what memory there is. Any non atomic alloc will then allocate a new page 
> with N elements in order to get one object. The mempool_allocs from the 
> atomic context will then gooble up the N-1 remaining objects? So the
> nonatomic alloc will then have to hit the page allocator again...

We need to have already exhausted reserves, yes: so this isn't an
issue hitting everyone all the time, and it may be nothing worse
than a surprising anomaly; but I'm pretty sure it's not how bio
and scsi command allocation is expected to interact.

What do you think a SLAB_NOMERGE flag?  The last time I suggested
something like that (but I was thinking of debug), your comment
was "Ohh..", which left me in some doubt ;)

If we had a SLAB_NOMERGE flag, would we want to apply it to the
bio cache or to the scsi_sense_cache or to both?  My difficulty
in answering that makes me wonder whether such a flag is right.

Hugh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07 19:40   ` Hugh Dickins
@ 2008-04-07 19:55     ` Peter Zijlstra
  2008-04-07 20:31       ` Hugh Dickins
  2008-04-08 20:43     ` Christoph Lameter
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2008-04-07 19:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Christoph Lameter, James Bottomley, Linus Torvalds,
	Andrew Morton, FUJITA Tomonori, Jens Axboe, Pekka Enberg,
	Rafael J. Wysocki, linux-kernel

On Mon, 2008-04-07 at 20:40 +0100, Hugh Dickins wrote:
> On Sun, 6 Apr 2008, Christoph Lameter wrote:
> > On Sun, 6 Apr 2008, Hugh Dickins wrote:
> > > 
> > > One very significant factor is SLUB, which
> > > merges slab caches when it can, and on 64-bit happens to merge
> > > both bio cache and sense_slab cache into kmalloc's 128-byte cache:
> > > so that under this swapping load, bios above are liable to gobble
> > > up all the slots needed for scsi_cmnd sense_buffers below.
> > 
> > A reliance on free slots that the slab allocator may provide? That is a 
> > rather bad dependency since it is up to the slab allocator to implement
> > the storage layout for the objects and thus the availability of slots may
> > vary depending on the layout for the objects chosen by the allocator.
> 
> I'm not sure that I understand you.  Yes, different slab allocators
> may lay out slots differently.  But a significant departure from
> existing behaviour may be a bad idea in some circumstances.
> (Hmm, maybe I've written a content-free sentence there!).
> 
> > 
> > Looking at mempool_alloc: Mempools may be used to do atomic allocations 
> > until they fail thereby exhausting reserves and available object in the 
> > partial lists of slab caches?
> 
> Mempools may be used for atomic allocations, but I think that's not
> the case here.  swap_writepage's get_swap_bio says GFP_NOIO, which
> allows (indeed is) __GFP_WAIT, and does not give access to __GFP_HIGH
> reserves.
> 
> Whereas at the __scsi_get_command end, there are GFP_ATOMIC sense_slab
> allocations, which do give access to __GFP_HIGH reserves.
> 
> My supposition is that once a page has been allocated from __GFP_HIGH
> reserves to a scsi sense_slab, swap_writepages are liable to gobble up
> the rest of the page with bio allocations which they wouldn't have had
> access to traditionally (i.e. under SLAB).
> 
> So an unexpected behaviour emerges from SLUB's slab merging.

Somewhere along the line of my swap over network patches I made
'robustified' SLAB to ensure these sorts of things could not happen - it
came at a cost though.

It would basically fail[*] allocations that had a higher low watermark
than what was used to allocate the current slab.

[*] - well, it would attempt to allocate a new slab to raise the current
watermark, but failing that it would fail the allocation.

> Though of course the same might happen in other circumstances, even
> without slab merging: if some kmem_cache allocations are made with
> GFP_ATOMIC, those can give access to reserves to non-__GFP_HIGH
> allocations from the same kmem_cache.
> 
> Maybe PF_MEMALLOC and __GFP_NOMEMALLOC complicate the situation:
> I've given little thought to mempool_alloc's fiddling with the
> gfp_mask (beyond repeatedly misreading it).

My latest series ensures that SLABs allocated using PF_MEMALLOC will not
distribute objects to allocation contexts that are not entitled for as
long as the memory shortage lasts.

I'm not sure how applicable this is to the problem at hand, just letting
you know whats there.

> > In order to make this a significant factor we need to have already 
> > exhausted reserves right? Thus we are already operating at the boundary of 
> > what memory there is. Any non atomic alloc will then allocate a new page 
> > with N elements in order to get one object. The mempool_allocs from the 
> > atomic context will then gooble up the N-1 remaining objects? So the
> > nonatomic alloc will then have to hit the page allocator again...

Relying on this is highly dubious, who is to say that first __GFP_HIGH
alloc came SCSI layer (there could be another merged slab).

Also, when one of these users is PF_MEMALLOC the other users will gobble
up our emergency memory - not as intended.

> We need to have already exhausted reserves, yes: so this isn't an
> issue hitting everyone all the time, and it may be nothing worse
> than a surprising anomaly; but I'm pretty sure it's not how bio
> and scsi command allocation is expected to interact.
> 
> What do you think a SLAB_NOMERGE flag?  The last time I suggested
> something like that (but I was thinking of debug), your comment
> was "Ohh..", which left me in some doubt ;)
> 
> If we had a SLAB_NOMERGE flag, would we want to apply it to the
> bio cache or to the scsi_sense_cache or to both?  My difficulty
> in answering that makes me wonder whether such a flag is right.

If this is critical to avoid memory deadlocks, I would suggest using
mempools (or my reserve framework).


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07 19:55     ` Peter Zijlstra
@ 2008-04-07 20:31       ` Hugh Dickins
  2008-04-07 20:47         ` Peter Zijlstra
  2008-04-07 21:00         ` Pekka Enberg
  0 siblings, 2 replies; 30+ messages in thread
From: Hugh Dickins @ 2008-04-07 20:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, James Bottomley, Linus Torvalds,
	Andrew Morton, FUJITA Tomonori, Jens Axboe, Pekka Enberg,
	Rafael J. Wysocki, linux-kernel

On Mon, 7 Apr 2008, Peter Zijlstra wrote:
> On Mon, 2008-04-07 at 20:40 +0100, Hugh Dickins wrote:
> > 
> > My supposition is that once a page has been allocated from __GFP_HIGH
> > reserves to a scsi sense_slab, swap_writepages are liable to gobble up
> > the rest of the page with bio allocations which they wouldn't have had
> > access to traditionally (i.e. under SLAB).
> > 
> > So an unexpected behaviour emerges from SLUB's slab merging.
> 
> Somewhere along the line of my swap over network patches I made
> 'robustified' SLAB to ensure these sorts of things could not happen - it
> came at a cost though.
> 
> It would basically fail[*] allocations that had a higher low watermark
> than what was used to allocate the current slab.
> 
> [*] - well, it would attempt to allocate a new slab to raise the current
> watermark, but failing that it would fail the allocation.

Thanks, Peter: that sounds just right to me; but a larger change than
we'd want to jump into for this one particular issue - it might have
its own unexpected consequences.

> > If we had a SLAB_NOMERGE flag, would we want to apply it to the
> > bio cache or to the scsi_sense_cache or to both?  My difficulty
> > in answering that makes me wonder whether such a flag is right.
> 
> If this is critical to avoid memory deadlocks, I would suggest using
> mempools (or my reserve framework).

No, the critical part of it has been dealt with (small fix to scsi
free_list handling: which resembles a mempool, but done its own way).

What remains is about "unsightly" behaviour, the system having a
tendency to collapse briefly into far-from-efficient operation
when out of memory.

Hugh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07 20:31       ` Hugh Dickins
@ 2008-04-07 20:47         ` Peter Zijlstra
  2008-04-07 21:00         ` Pekka Enberg
  1 sibling, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2008-04-07 20:47 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Christoph Lameter, James Bottomley, Linus Torvalds,
	Andrew Morton, FUJITA Tomonori, Jens Axboe, Pekka Enberg,
	Rafael J. Wysocki, linux-kernel

On Mon, 2008-04-07 at 21:31 +0100, Hugh Dickins wrote:
> On Mon, 7 Apr 2008, Peter Zijlstra wrote:
> > On Mon, 2008-04-07 at 20:40 +0100, Hugh Dickins wrote:
> > > 
> > > My supposition is that once a page has been allocated from __GFP_HIGH
> > > reserves to a scsi sense_slab, swap_writepages are liable to gobble up
> > > the rest of the page with bio allocations which they wouldn't have had
> > > access to traditionally (i.e. under SLAB).
> > > 
> > > So an unexpected behaviour emerges from SLUB's slab merging.
> > 
> > Somewhere along the line of my swap over network patches I made
> > 'robustified' SLAB to ensure these sorts of things could not happen - it
> > came at a cost though.
> > 
> > It would basically fail[*] allocations that had a higher low watermark
> > than what was used to allocate the current slab.
> > 
> > [*] - well, it would attempt to allocate a new slab to raise the current
> > watermark, but failing that it would fail the allocation.
> 
> Thanks, Peter: that sounds just right to me; but a larger change than
> we'd want to jump into for this one particular issue - it might have
> its own unexpected consequences.

Right, but I doubt we'd ever get something like that merged though -
esp. as it will basically destroy the SLUB fast-path.

SLAB allocation fairness:
  http://lkml.org/lkml/2007/1/16/61

I abandoned this approach because it was too expensive; it was reduced
to the ALLOC_NO_WATERMARKS state transition. Which is much more unlikely
to happen and it's generally accepted we're in a slow path once we
really dive so low into the reserves.

The latest posting:
  http://lkml.org/lkml/2008/3/20/214



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07 20:31       ` Hugh Dickins
  2008-04-07 20:47         ` Peter Zijlstra
@ 2008-04-07 21:00         ` Pekka Enberg
  2008-04-07 21:05           ` Pekka Enberg
  1 sibling, 1 reply; 30+ messages in thread
From: Pekka Enberg @ 2008-04-07 21:00 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Christoph Lameter, James Bottomley,
	Linus Torvalds, Andrew Morton, FUJITA Tomonori, Jens Axboe,
	Rafael J. Wysocki, linux-kernel

Hi Hugh,

On Mon, 2008-04-07 at 20:40 +0100, Hugh Dickins wrote:
> > > If we had a SLAB_NOMERGE flag, would we want to apply it to the
> > > bio cache or to the scsi_sense_cache or to both?  My difficulty
> > > in answering that makes me wonder whether such a flag is right.

On Mon, 7 Apr 2008, Peter Zijlstra wrote:
> > If this is critical to avoid memory deadlocks, I would suggest using
> > mempools (or my reserve framework).

Hugh Dickins wrote:
> No, the critical part of it has been dealt with (small fix to scsi
> free_list handling: which resembles a mempool, but done its own way).
> 
> What remains is about "unsightly" behaviour, the system having a
> tendency to collapse briefly into far-from-efficient operation
> when out of memory.

Although you weren't convinced by my arguments, I still have 
difficulties understanding why this kind of bad behavior would be 
acceptable in an embedded environment and why we don't need to fix it 
for the SLOB case as well.

But you do bring up a good point of SLUB changing the behavior on OOM 
situations for which SLAB_NOMERGE sounds like a good-enough stop-gap 
measure for the short term. I would prefer some other fix even if it 
means getting rid of slab merging competely (which would suck as it's 
very nice for making memory footprint smaller).

		Pekka

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07 21:00         ` Pekka Enberg
@ 2008-04-07 21:05           ` Pekka Enberg
  2008-04-07 21:15             ` Linus Torvalds
  2008-04-07 21:30             ` Hugh Dickins
  0 siblings, 2 replies; 30+ messages in thread
From: Pekka Enberg @ 2008-04-07 21:05 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Christoph Lameter, James Bottomley,
	Linus Torvalds, Andrew Morton, FUJITA Tomonori, Jens Axboe,
	Rafael J. Wysocki, linux-kernel

Pekka Enberg wrote:
> Although you weren't convinced by my arguments, I still have 
> difficulties understanding why this kind of bad behavior would be 
> acceptable in an embedded environment and why we don't need to fix it 
> for the SLOB case as well.
> 
> But you do bring up a good point of SLUB changing the behavior on OOM 
> situations for which SLAB_NOMERGE sounds like a good-enough stop-gap 
> measure for the short term. I would prefer some other fix even if it 
> means getting rid of slab merging competely (which would suck as it's 
> very nice for making memory footprint smaller).

I wonder if we can get away with a SLAB_IO flag that you can use to 
annotate caches that participate in writeout and the allocator could 
keep some spare pages around that can be handed out for them in case of 
OOM...

		Pekka

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07 21:05           ` Pekka Enberg
@ 2008-04-07 21:15             ` Linus Torvalds
  2008-04-07 21:34               ` Pekka Enberg
  2008-04-07 21:30             ` Hugh Dickins
  1 sibling, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2008-04-07 21:15 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Hugh Dickins, Peter Zijlstra, Christoph Lameter, James Bottomley,
	Andrew Morton, FUJITA Tomonori, Jens Axboe, Rafael J. Wysocki,
	linux-kernel



On Tue, 8 Apr 2008, Pekka Enberg wrote:
> 
> I wonder if we can get away with a SLAB_IO flag that you can use to annotate
> caches that participate in writeout and the allocator could keep some spare
> pages around that can be handed out for them in case of OOM...

Actually, I don't think this is SLAB-specific, since there are potentially 
the same issues for pages.

I suspect the right thing to do is not to mark them for "IO", but mark 
them for "short-lived", and allow short-lived allocations that don't have 
extended lifetimes to succeed even when a "real" allocation wouldn't.

When we are under memory pressure, a normal allocation generally needs to 
clear up enough memory from elsewhere to succeed. But a short-lived 
allocation without any long-term lifetimes would be known to release its 
memory back to the pool, so it can be allowed to go ahead when a normal 
memory allocation would not.

Examples of short-lived allocations:
 - IO requests
 - temporary network packets that don't get queued up (eg "ACK" packet) as 
   opposed to those that do get queued (incoming _or_ outgoing queues)
 - things like the temporary buffers for "getname()/putname()" etc.

That said, even a lot of temporary allocations can at times have issues. 
If your IO layer is dead, your IO request queues that _should_ have been 
very temporary may end up staying around for a long time. But I do think 
that it makes sense to prioritize allocations that are known to be short- 
lived.

(It might also allows us to allocate them from different pools and just 
generally have other heuristics for cache behaviour - short-lived 
allocations simply have different behaviour)

			Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07 21:05           ` Pekka Enberg
  2008-04-07 21:15             ` Linus Torvalds
@ 2008-04-07 21:30             ` Hugh Dickins
  2008-04-07 21:36               ` Pekka Enberg
  1 sibling, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2008-04-07 21:30 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Peter Zijlstra, Christoph Lameter, James Bottomley,
	Linus Torvalds, Andrew Morton, FUJITA Tomonori, Jens Axboe,
	Rafael J. Wysocki, linux-kernel

On Tue, 8 Apr 2008, Pekka Enberg wrote:
> Pekka Enberg wrote:
> > Although you weren't convinced by my arguments, I still have difficulties
> > understanding why this kind of bad behavior would be acceptable in an
> > embedded environment and why we don't need to fix it for the SLOB case as
> > well.

If we do decide that it's seriously bad behaviour, then of course
it should be fixed in SLOB also.  I just meant that the fact that
SLOB's been merging slabs too doesn't give me great confidence that
that's the right and safe thing to do: SLOB isn't our gold standard.

> > 
> > But you do bring up a good point of SLUB changing the behavior on OOM
> > situations for which SLAB_NOMERGE sounds like a good-enough stop-gap measure
> > for the short term. I would prefer some other fix even if it means getting
> > rid of slab merging competely (which would suck as it's very nice for making
> > memory footprint smaller).
> 
> I wonder if we can get away with a SLAB_IO flag that you can use to annotate
> caches that participate in writeout and the allocator could keep some spare
> pages around that can be handed out for them in case of OOM...

Well, the work of keeping spares around to be handed out in case of
OOM was done in 2.5 with mempools.  They seem to be working well
(until you try swapping over network as Peter is pursuing),
I don't think we need to duplicate that.

I've the uncomfortable feeling that I'm making a big fuss over this
behaviour on the one hand, but utterly failing to justify that fuss
on the other.  It seems to come down to me wanting fewer backtraces
in my logs: I'd better find stronger reasons than that if we're to
pursue this.

Hugh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07 21:15             ` Linus Torvalds
@ 2008-04-07 21:34               ` Pekka Enberg
  2008-04-07 21:39                 ` Pekka Enberg
  0 siblings, 1 reply; 30+ messages in thread
From: Pekka Enberg @ 2008-04-07 21:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Peter Zijlstra, Christoph Lameter, James Bottomley,
	Andrew Morton, FUJITA Tomonori, Jens Axboe, Rafael J. Wysocki,
	linux-kernel

Hi Linus,

Linus Torvalds wrote:
> I suspect the right thing to do is not to mark them for "IO", but mark 
> them for "short-lived", and allow short-lived allocations that don't have 
> extended lifetimes to succeed even when a "real" allocation wouldn't.

Yeah, makes sense. We do have GFP_TEMPORARY so we could associate this 
new semantics with that. But the real problem here is how to do the 
"allocate harder" part which, btw, sounds very similar to what Peter's 
kmalloc reserve patches try to do...

		Pekka

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07 21:30             ` Hugh Dickins
@ 2008-04-07 21:36               ` Pekka Enberg
  0 siblings, 0 replies; 30+ messages in thread
From: Pekka Enberg @ 2008-04-07 21:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Christoph Lameter, James Bottomley,
	Linus Torvalds, Andrew Morton, FUJITA Tomonori, Jens Axboe,
	Rafael J. Wysocki, linux-kernel

Hugh Dickins wrote:
> I've the uncomfortable feeling that I'm making a big fuss over this
> behaviour on the one hand, but utterly failing to justify that fuss
> on the other.  It seems to come down to me wanting fewer backtraces
> in my logs: I'd better find stronger reasons than that if we're to
> pursue this.

No, I do agree that it's a problem. I just don't like the proposed 
no-merge fix ;-).

		Pekka

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07 21:34               ` Pekka Enberg
@ 2008-04-07 21:39                 ` Pekka Enberg
  2008-04-07 22:05                   ` Pekka J Enberg
  0 siblings, 1 reply; 30+ messages in thread
From: Pekka Enberg @ 2008-04-07 21:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Peter Zijlstra, Christoph Lameter, James Bottomley,
	Andrew Morton, FUJITA Tomonori, Jens Axboe, Rafael J. Wysocki,
	linux-kernel

Linus Torvalds wrote:
>> I suspect the right thing to do is not to mark them for "IO", but mark 
>> them for "short-lived", and allow short-lived allocations that don't 
>> have extended lifetimes to succeed even when a "real" allocation 
>> wouldn't.

Pekka Enberg wrote:
> Yeah, makes sense. We do have GFP_TEMPORARY so we could associate this 
> new semantics with that. But the real problem here is how to do the 
> "allocate harder" part which, btw, sounds very similar to what Peter's 
> kmalloc reserve patches try to do...

Actually, a trivial way to implement that is to have a few "emergency 
kmalloc" caches say for sizes 64, 128, 256, and 512 that have some 
pre-allocated pages into which these GFP_TEMPORARY allocations are 
allowed to dip into on OOM and OOM only.

		Pekka

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07 21:39                 ` Pekka Enberg
@ 2008-04-07 22:05                   ` Pekka J Enberg
  2008-04-07 22:17                     ` Linus Torvalds
  0 siblings, 1 reply; 30+ messages in thread
From: Pekka J Enberg @ 2008-04-07 22:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Peter Zijlstra, Christoph Lameter, James Bottomley,
	Andrew Morton, FUJITA Tomonori, Jens Axboe, Rafael J. Wysocki,
	linux-kernel

On Tue, 8 Apr 2008, Pekka Enberg wrote:
> Actually, a trivial way to implement that is to have a few "emergency kmalloc"
> caches say for sizes 64, 128, 256, and 512 that have some pre-allocated pages
> into which these GFP_TEMPORARY allocations are allowed to dip into on OOM and
> OOM only.

So something like the following (totally untested) patch modulo the 
pre-allocation bits.

		Pekka

diff --git a/mm/slub.c b/mm/slub.c
index acc975f..e8f7cf3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -264,6 +264,67 @@ static inline void stat(struct kmem_cache_cpu *c, enum stat_item si)
 #endif
 }
 
+/*
+ * Emergency caches are for GFP_TEMPORARY allocations to dip into when we're
+ * OOM to make sure we can make some progress for writeback.
+ */
+
+#define MIN_EMERGENCY_SIZE 32
+#define NR_EMERGENCY_CACHES 4
+
+struct kmem_cache *emergency_caches[NR_EMERGENCY_CACHES];
+
+static void init_emergency_caches(void)
+{
+	unsigned long size = MIN_EMERGENCY_SIZE;
+	int i;
+
+	for (i = 0; i < NR_EMERGENCY_CACHES; i++) {
+		struct kmem_cache *cache;
+		char *name;
+
+		name = kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
+		BUG_ON(!name);
+
+		cache = kmem_cache_create(name, size, 0, 0, NULL);
+		BUG_ON(!cache);
+		kfree(name);
+
+		emergency_caches[i] = cache;
+
+		size *= 2;
+	}
+}
+
+static struct kmem_cache *lookup_emergency_cache(unsigned long size)
+{
+	unsigned long cache_size = MIN_EMERGENCY_SIZE;
+	int i;
+
+	for (i = 0; i < NR_EMERGENCY_CACHES; i++) {
+		if (size < cache_size)
+			return emergency_caches[i];
+
+		cache_size *= 2;
+	}
+	return NULL;
+}
+
+static void *emergency_alloc(struct kmem_cache *cache, gfp_t gfp)
+{
+	struct kmem_cache *emergency_cache;
+	void *p;
+
+	emergency_cache = lookup_emergency_cache(cache->objsize);
+	if (!emergency_cache)
+		return NULL;
+
+	p = kmem_cache_alloc(emergency_cache, gfp);
+	if (p && cache->ctor)
+		cache->ctor(cache, p);
+	return p;
+}
+
 /********************************************************************
  * 			Core slab cache functions
  *******************************************************************/
@@ -1528,6 +1589,9 @@ new_slab:
 		goto load_freelist;
 	}
 
+	if ((gfpflags & GFP_TEMPORARY) == GFP_TEMPORARY)
+		return emergency_alloc(s, gfpflags);
+
 	/*
 	 * No memory available.
 	 *
@@ -2970,6 +3034,7 @@ void __init kmem_cache_init(void)
 #else
 	kmem_size = sizeof(struct kmem_cache);
 #endif
+	init_emergency_caches();
 
 	printk(KERN_INFO
 		"SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d, MinObjects=%d,"

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07 22:05                   ` Pekka J Enberg
@ 2008-04-07 22:17                     ` Linus Torvalds
  2008-04-07 22:42                       ` Pekka Enberg
  2008-04-08 20:42                       ` Pekka J Enberg
  0 siblings, 2 replies; 30+ messages in thread
From: Linus Torvalds @ 2008-04-07 22:17 UTC (permalink / raw)
  To: Pekka J Enberg
  Cc: Hugh Dickins, Peter Zijlstra, Christoph Lameter, James Bottomley,
	Andrew Morton, FUJITA Tomonori, Jens Axboe, Rafael J. Wysocki,
	linux-kernel



On Tue, 8 Apr 2008, Pekka J Enberg wrote:
> 
> So something like the following (totally untested) patch modulo the 
> pre-allocation bits.

Hmm. I didn't check, but won't this cause problems on the freeing path 
when we call kmem_cache_free() on the result but with the wrong "struct 
kmem_cache" pointer?

		Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07 22:17                     ` Linus Torvalds
@ 2008-04-07 22:42                       ` Pekka Enberg
  2008-04-08 20:42                       ` Pekka J Enberg
  1 sibling, 0 replies; 30+ messages in thread
From: Pekka Enberg @ 2008-04-07 22:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Peter Zijlstra, Christoph Lameter, James Bottomley,
	Andrew Morton, FUJITA Tomonori, Jens Axboe, Rafael J. Wysocki,
	linux-kernel

On Tue, 8 Apr 2008, Pekka J Enberg wrote:
>> So something like the following (totally untested) patch modulo the 
>> pre-allocation bits.

Linus Torvalds wrote:
> Hmm. I didn't check, but won't this cause problems on the freeing path 
> when we call kmem_cache_free() on the result but with the wrong "struct 
> kmem_cache" pointer?

Aah, yes. I was thinking of kmalloc() here for which it works as 
expected because kfree() will return the page to the proper cache. But 
we can relax the rules of kmem_cache_free() a bit to make this work (but 
perhaps add a WARN_ON() there if cache doesn't match page->slab).

			Pekka

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07 18:07   ` Hugh Dickins
@ 2008-04-08 14:04     ` FUJITA Tomonori
  0 siblings, 0 replies; 30+ messages in thread
From: FUJITA Tomonori @ 2008-04-08 14:04 UTC (permalink / raw)
  To: hugh
  Cc: fujita.tomonori, James.Bottomley, torvalds, akpm, jens.axboe,
	clameter, penberg, a.p.zijlstra, rjw, linux-kernel

On Mon, 7 Apr 2008 19:07:56 +0100 (BST)
Hugh Dickins <hugh@veritas.com> wrote:

> On Mon, 7 Apr 2008, FUJITA Tomonori wrote:
> > On Sun, 6 Apr 2008 23:56:57 +0100 (BST)
> > Hugh Dickins <hugh@veritas.com> wrote:
> > 
> > Really sorry about the bug.
> 
> No, it's brought attention to this interesting slab merge issue;
> even if in the end we decide that's a non-issue.

Yeah, seems that it led to an interesting discussion (using cache
behavior like ephemeral sounds useful, I think) though surely this is
a bug.


> > > Another alternative is to revert the separate sense_slab, using
> > > cache-line-aligned sense_buffer allocated beyond scsi_cmnd from
> > > the one kmem_cache; but that might waste more memory, and is
> > > only a way of diverting around the known problem.
> > 
> > Reverting the separate sense_slab is fine for now but we need the
> > separation shortly anyway. We need to support larger sense buffer (260
> > bytes). The current 96 byte sense buffer works for the majority of us,
> > so we doesn't want to embed 260 byte sense buffer in scsi_cmnd struct.
> 
> I don't believe you _need_ a separate sense_slab even for that:
> what I meant was that you just need something like
> 	pool->cmd_slab = kmem_cache_create(pool->cmd_name,
> 					   cache_line_align(
> 					   sizeof(struct scsi_cmnd)) +
> 					   max_scsi_sense_buffersize,
> 					   0, pool->slab_flags, NULL);
> then point cmd->sense_buffer to (unsigned char *) cmd +
> 			cache_line_align(sizeof(struct scsi_cmnd));
> where cache_line_align and max_scsi_sense_buffersize are preferably
> determined at runtime.

Yes, if we have only 96 and 260 bytes sense buffers, it would be a
solution. Well, evne if we have various length sense buffers, we can
have a pool per driver (or device, scsi_host, etc).

Another reason why we separated them is that we could allocate a sense
buffer only when it's necessary (though I'm not sure we will do it
actually).


> Now, it may well be that over the different configurations, at least
> some would waste significant memory by putting it all in the one big
> buffer, and you're better off with the separate slabs: so I didn't
> want to interfere with your direction on that.

Yes, this was about wasting memory (with the one big buffer)
vs. overheads due to allocating two buffers. After some performance
tests, we chose the latter but we might change this again in the
future.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07 22:17                     ` Linus Torvalds
  2008-04-07 22:42                       ` Pekka Enberg
@ 2008-04-08 20:42                       ` Pekka J Enberg
  2008-04-08 20:44                         ` Pekka Enberg
  2008-04-08 20:45                         ` Christoph Lameter
  1 sibling, 2 replies; 30+ messages in thread
From: Pekka J Enberg @ 2008-04-08 20:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Peter Zijlstra, Christoph Lameter, James Bottomley,
	Andrew Morton, FUJITA Tomonori, Jens Axboe, Rafael J. Wysocki,
	linux-kernel

On Tue, 8 Apr 2008, Pekka J Enberg wrote:
> > 
> > So something like the following (totally untested) patch modulo the 
> > pre-allocation bits.

On Mon, 7 Apr 2008, Linus Torvalds wrote:
> Hmm. I didn't check, but won't this cause problems on the freeing path 
> when we call kmem_cache_free() on the result but with the wrong "struct 
> kmem_cache" pointer?

So something like this fugly patch on top of the SLUB variable order 
patches that let the allocator fall-back to smaller page orders. Survives 
OOM in my testing and the box seems to be bit more responsive even under X 
with this.

		Pekka

---
 mm/slub.c |   93 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 92 insertions(+), 1 deletion(-)

Index: slab-2.6/mm/slub.c
===================================================================
--- slab-2.6.orig/mm/slub.c	2008-04-08 22:52:02.000000000 +0300
+++ slab-2.6/mm/slub.c	2008-04-08 23:27:27.000000000 +0300
@@ -1549,6 +1549,88 @@
 }
 
 /*
+ * Emergency caches are reserved for GFP_TEMPORARY allocations on OOM.
+ */
+
+#define MIN_EMERGENCY_SIZE 32
+#define NR_EMERGENCY_CACHES 4
+
+struct kmem_cache *emergency_caches[NR_EMERGENCY_CACHES];
+
+static void pre_alloc_cpu_slabs(struct kmem_cache *s)
+{
+	int cpu;
+
+	/*
+	 * FIXME: CPU hot-plug, stats, debug?
+	 */
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c;
+		struct page *new;
+		void **object;
+		int node;
+
+		c = get_cpu_slab(s, cpu);
+		node = cpu_to_node(cpu);
+
+		new = new_slab(s, GFP_KERNEL, node);
+		BUG_ON(!new);
+
+		slab_lock(new);
+		SetSlabFrozen(new);
+		c->page = new;
+		object = c->page->freelist;
+		c->freelist = object[c->offset];
+		c->page->inuse = c->page->objects;
+		c->page->freelist = NULL;
+		c->node = page_to_nid(c->page);
+		slab_unlock(c->page);
+	}
+}
+
+static void init_emergency_caches(void)
+{
+	unsigned long size = MIN_EMERGENCY_SIZE;
+	int i;
+
+	for (i = 0; i < NR_EMERGENCY_CACHES; i++) {
+		struct kmem_cache *cache;
+		char *name;
+
+		name = kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
+		BUG_ON(!name);
+
+		cache = kmem_cache_create(name, size, 0, 0, NULL);
+		BUG_ON(!cache);
+		kfree(name);
+
+		pre_alloc_cpu_slabs(cache);
+		emergency_caches[i] = cache;
+
+		size *= 2;
+	}
+}
+
+static void *emergency_alloc(struct kmem_cache *cache, gfp_t gfp)
+{
+	unsigned long cache_size = MIN_EMERGENCY_SIZE;
+	void *p = NULL;
+	int i;
+
+	for (i = 0; i < NR_EMERGENCY_CACHES; i++) {
+		if (cache->objsize < cache_size) {
+			p = kmem_cache_alloc(emergency_caches[i], gfp);
+			if (p)
+				break;
+		}
+		cache_size *= 2;
+	}
+	if (p && cache->ctor)
+		cache->ctor(cache, p);
+	return p;
+}
+
+/*
  * Slow path. The lockless freelist is empty or we need to perform
  * debugging duties.
  *
@@ -1626,6 +1708,14 @@
 		c->page = new;
 		goto load_freelist;
 	}
+
+	/*
+	 * We are really OOM. Let short-lived allocations dip into the reserves
+	 * to ensure writeback makes progress.
+	 */
+	if ((gfpflags & GFP_TEMPORARY) == GFP_TEMPORARY)
+		return emergency_alloc(s, gfpflags);
+
 	return NULL;
 debug:
 	if (!alloc_debug_processing(s, c->page, object, addr))
@@ -1791,7 +1881,7 @@
 
 	page = virt_to_head_page(x);
 
-	slab_free(s, page, x, __builtin_return_address(0));
+	slab_free(page->slab, page, x, __builtin_return_address(0));
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
@@ -3225,6 +3315,7 @@
 #else
 	kmem_size = sizeof(struct kmem_cache);
 #endif
+	init_emergency_caches();
 
 	printk(KERN_INFO
 		"SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d, MinObjects=%d,"

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-07 19:40   ` Hugh Dickins
  2008-04-07 19:55     ` Peter Zijlstra
@ 2008-04-08 20:43     ` Christoph Lameter
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2008-04-08 20:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: James Bottomley, Linus Torvalds, Andrew Morton, FUJITA Tomonori,
	Jens Axboe, Pekka Enberg, Peter Zijlstra, Rafael J. Wysocki,
	linux-kernel

On Mon, 7 Apr 2008, Hugh Dickins wrote:

> > Looking at mempool_alloc: Mempools may be used to do atomic allocations 
> > until they fail thereby exhausting reserves and available object in the 
> > partial lists of slab caches?
> 
> Mempools may be used for atomic allocations, but I think that's not
> the case here.  swap_writepage's get_swap_bio says GFP_NOIO, which
> allows (indeed is) __GFP_WAIT, and does not give access to __GFP_HIGH
> reserves.

Looks like that one of the issues here is that swap_writepage() 
does not perform enough reclaim? If it would free more pages then 
__scsi_get_command would still have pages to allocate and not drain
the reserves.

> Maybe PF_MEMALLOC and __GFP_NOMEMALLOC complicate the situation:
> I've given little thought to mempool_alloc's fiddling with the
> gfp_mask (beyond repeatedly misreading it).

Mempool_alloc()s use of the gfp_mask here suggests that it can potentially 
drain all reserves and exhaust all available "slots" (partial slabs). Thus 
it may regularly force any other user of the slab to hit the slow path 
and potentially trigger reclaim. Could be a bit unfair.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-08 20:42                       ` Pekka J Enberg
@ 2008-04-08 20:44                         ` Pekka Enberg
  2008-04-08 20:45                         ` Christoph Lameter
  1 sibling, 0 replies; 30+ messages in thread
From: Pekka Enberg @ 2008-04-08 20:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Peter Zijlstra, Christoph Lameter, James Bottomley,
	Andrew Morton, FUJITA Tomonori, Jens Axboe, Rafael J. Wysocki,
	linux-kernel

Pekka J Enberg wrote:
> So something like this fugly patch on top of the SLUB variable order 
> patches that let the allocator fall-back to smaller page orders. Survives 
> OOM in my testing and the box seems to be bit more responsive even under X 
> with this.

...uh-oh hand over the paper bag. I have no GFP_TEMPORARY allocations 
going on...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-08 20:42                       ` Pekka J Enberg
  2008-04-08 20:44                         ` Pekka Enberg
@ 2008-04-08 20:45                         ` Christoph Lameter
  2008-04-08 21:11                           ` Pekka Enberg
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2008-04-08 20:45 UTC (permalink / raw)
  To: Pekka J Enberg
  Cc: Linus Torvalds, Hugh Dickins, Peter Zijlstra, James Bottomley,
	Andrew Morton, FUJITA Tomonori, Jens Axboe, Rafael J. Wysocki,
	linux-kernel

On Tue, 8 Apr 2008, Pekka J Enberg wrote:

> So something like this fugly patch on top of the SLUB variable order 
> patches that let the allocator fall-back to smaller page orders. Survives 
> OOM in my testing and the box seems to be bit more responsive even under X 
> with this.

Hmmmm... Peter has the most experience with these issues. Maybe the best 
would be to have this sort of logic in a more general way in the page 
allocator? Similar issues surely exist with the page allocator and a fix 
there would fix it for all users.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-08 20:45                         ` Christoph Lameter
@ 2008-04-08 21:11                           ` Pekka Enberg
  2008-04-08 21:40                             ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Pekka Enberg @ 2008-04-08 21:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Hugh Dickins, Peter Zijlstra, James Bottomley,
	Andrew Morton, FUJITA Tomonori, Jens Axboe, Rafael J. Wysocki,
	linux-kernel

Christoph Lameter wrote:
> Hmmmm... Peter has the most experience with these issues. Maybe the best 
> would be to have this sort of logic in a more general way in the page 
> allocator? Similar issues surely exist with the page allocator and a fix 
> there would fix it for all users.

This needs some support in the slab allocator anyway. Keep in mind that 
the patch is specifically addressing writeback in OOM conditions so we 
must (1) prioritize GFP_TEMPORARY allocations over everyone else (which 
just get NULL) and (2) use the remaining available memory as efficiently 
as possible for _all_ GFP_TEMPORARY allocations.

Peter is, however, bringing up a good point that my patch doesn't 
actually _guarantee_ anything so I'm still wondering if this approach 
makes any sense... But I sure do like Linus' ideas of marking 
short-lived allocations and trying harder for them in OOM.

			Pekka

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] scsi: fix sense_slab/bio swapping livelock
  2008-04-08 21:11                           ` Pekka Enberg
@ 2008-04-08 21:40                             ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2008-04-08 21:40 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Linus Torvalds, Hugh Dickins, James Bottomley,
	Andrew Morton, FUJITA Tomonori, Jens Axboe, Rafael J. Wysocki,
	linux-kernel

On Wed, 2008-04-09 at 00:11 +0300, Pekka Enberg wrote:
> Christoph Lameter wrote:
> > Hmmmm... Peter has the most experience with these issues. Maybe the best 
> > would be to have this sort of logic in a more general way in the page 
> > allocator? Similar issues surely exist with the page allocator and a fix 
> > there would fix it for all users.
> 
> This needs some support in the slab allocator anyway. Keep in mind that 
> the patch is specifically addressing writeback in OOM conditions so we 
> must (1) prioritize GFP_TEMPORARY allocations over everyone else (which 
> just get NULL) and (2) use the remaining available memory as efficiently 
> as possible for _all_ GFP_TEMPORARY allocations.
> 
> Peter is, however, bringing up a good point that my patch doesn't 
> actually _guarantee_ anything so I'm still wondering if this approach 
> makes any sense... But I sure do like Linus' ideas of marking 
> short-lived allocations and trying harder for them in OOM.

Also, this scheme so far does not provide for a means to detect the end
of pressure situation.

I need both triggers, enter pressure and exit pressure. Enter pressure
is easy enough, that's when normal allocations start failing. Exit
pressure however is more difficult - that is basically when allocations
start succeeding again. You'll see that my patches basically always
attempt a regular allocation as long as we're in the emergency state.

Also, the requirement for usage of emergency memory (GFP_MEMALLOC,
PF_MEMALLOC) is that it will be freed without external conditions. So
while it might be delayed for a while (it might sit in the fragment
assembly cache for a while) it doesn't need any external input to get
freed again:
  - it will either get reclaimed from this cache;
  - it will exit the cache as a full packet and:
    - get dropped, or
    - get processed.




^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2008-04-08 21:41 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-04-06 22:56 [PATCH] scsi: fix sense_slab/bio swapping livelock Hugh Dickins
2008-04-06 23:35 ` James Bottomley
2008-04-07  1:01   ` Hugh Dickins
2008-04-07 17:51     ` Hugh Dickins
2008-04-07 18:04       ` James Bottomley
2008-04-07 18:26         ` Hugh Dickins
2008-04-07  2:48 ` FUJITA Tomonori
2008-04-07 18:07   ` Hugh Dickins
2008-04-08 14:04     ` FUJITA Tomonori
2008-04-07  5:26 ` Christoph Lameter
2008-04-07 19:40   ` Hugh Dickins
2008-04-07 19:55     ` Peter Zijlstra
2008-04-07 20:31       ` Hugh Dickins
2008-04-07 20:47         ` Peter Zijlstra
2008-04-07 21:00         ` Pekka Enberg
2008-04-07 21:05           ` Pekka Enberg
2008-04-07 21:15             ` Linus Torvalds
2008-04-07 21:34               ` Pekka Enberg
2008-04-07 21:39                 ` Pekka Enberg
2008-04-07 22:05                   ` Pekka J Enberg
2008-04-07 22:17                     ` Linus Torvalds
2008-04-07 22:42                       ` Pekka Enberg
2008-04-08 20:42                       ` Pekka J Enberg
2008-04-08 20:44                         ` Pekka Enberg
2008-04-08 20:45                         ` Christoph Lameter
2008-04-08 21:11                           ` Pekka Enberg
2008-04-08 21:40                             ` Peter Zijlstra
2008-04-07 21:30             ` Hugh Dickins
2008-04-07 21:36               ` Pekka Enberg
2008-04-08 20:43     ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).