[BLOCK] bh: Ensure bh fits within a page
diff mbox series

Message ID 20060801030443.GA2221@gondor.apana.org.au
State New, archived
Headers show
Series
  • [BLOCK] bh: Ensure bh fits within a page
Related show

Commit Message

Herbert Xu Aug. 1, 2006, 3:04 a.m. UTC
Hi Andrew:

[BLOCK] bh: Ensure bh fits within a page

There is a bug in jbd with slab debugging enabled where it was submitting
a bh obtained via jbd_rep_kmalloc which crossed a page boundary.  A lot
of time was spent on tracking this down because the symptoms were far off
from where the problem was.

This patch adds a sanity check to submit_bh so we can immediately spot
anyone doing similar things in future.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

While you're at it, could you fix that jbd bug for us :)

Cheers,

Comments

Andrew Morton Aug. 1, 2006, 4:04 a.m. UTC | #1
On Tue, 1 Aug 2006 13:04:43 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> 
> There is a bug in jbd with slab debugging enabled where it was submitting
> a bh obtained via jbd_rep_kmalloc which crossed a page boundary.  A lot
> of time was spent on tracking this down because the symptoms were far off
> from where the problem was.
> 
> This patch adds a sanity check to submit_bh so we can immediately spot
> anyone doing similar things in future.

Seems sane.

> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> While you're at it, could you fix that jbd bug for us :)

Could we have a more detailed description?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Herbert Xu Aug. 1, 2006, 5:02 a.m. UTC | #2
On Mon, Jul 31, 2006 at 09:04:18PM -0700, Andrew Morton wrote:
> 
> Could we have a more detailed description?

Sure, this particular instance is in journal_write_metadata_buffer
where the bh may be constructed from kmalloc memory (search for the
call to jbd_rep_kmalloc).  Because the memory returned by kmalloc
may straddle a page (when slab debugging is enabled that is), this
causes a broken bh to be injected into submit_bh.

Cheers,
Andrew Morton Aug. 1, 2006, 5:54 a.m. UTC | #3
On Tue, 1 Aug 2006 15:02:59 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Mon, Jul 31, 2006 at 09:04:18PM -0700, Andrew Morton wrote:
> > 
> > Could we have a more detailed description?
> 
> Sure, this particular instance is in journal_write_metadata_buffer
> where the bh may be constructed from kmalloc memory (search for the
> call to jbd_rep_kmalloc).  Because the memory returned by kmalloc
> may straddle a page (when slab debugging is enabled that is), this
> causes a broken bh to be injected into submit_bh.
> 

Crap, that's hard to fix.   Am I allowed to blame submit_bh()? ;)

uhm, we don't want to lose kmalloc redzoning, so I guess we need to create
on-demand ext3-private slab caches for 1024, 2048, and 4096 bytes.  With
the appropriate slab flags to defeat the redzoning.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jens Axboe Aug. 1, 2006, 7:23 a.m. UTC | #4
On Tue, Aug 01 2006, Herbert Xu wrote:
> Hi Andrew:
> 
> [BLOCK] bh: Ensure bh fits within a page
> 
> There is a bug in jbd with slab debugging enabled where it was submitting
> a bh obtained via jbd_rep_kmalloc which crossed a page boundary.  A lot
> of time was spent on tracking this down because the symptoms were far off
> from where the problem was.
> 
> This patch adds a sanity check to submit_bh so we can immediately spot
> anyone doing similar things in future.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> While you're at it, could you fix that jbd bug for us :)
> 
> Cheers,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> --
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 71649ef..b998f08 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2790,6 +2790,7 @@ int submit_bh(int rw, struct buffer_head
>  	BUG_ON(!buffer_locked(bh));
>  	BUG_ON(!buffer_mapped(bh));
>  	BUG_ON(!bh->b_end_io);
> +	WARN_ON(bh_offset(bh) + bh->b_size > PAGE_SIZE);

That looks really dangerous, I'd prefer that to be a BUG_ON() as well to
prevent nastiness further down.
Herbert Xu Aug. 1, 2006, 11:06 a.m. UTC | #5
On Mon, Jul 31, 2006 at 10:54:54PM -0700, Andrew Morton wrote:
> 
> Crap, that's hard to fix.   Am I allowed to blame submit_bh()? ;)

Sure you're allowed to do anything :)

> uhm, we don't want to lose kmalloc redzoning, so I guess we need to create
> on-demand ext3-private slab caches for 1024, 2048, and 4096 bytes.  With
> the appropriate slab flags to defeat the redzoning.

Either that or we should fix redzoning so that it actually preserves
alignment, rather than turning off debugging whenever we want alignment.
Basically it means that we need to use twice the amount of memory for
each object (so 2K for each 1K object).  Is this acceptable for debugging?

Cheers,
Andrew Morton Aug. 1, 2006, 2:45 p.m. UTC | #6
On Tue, 1 Aug 2006 21:06:58 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Mon, Jul 31, 2006 at 10:54:54PM -0700, Andrew Morton wrote:
> > 
> > Crap, that's hard to fix.   Am I allowed to blame submit_bh()? ;)
> 
> Sure you're allowed to do anything :)
> 
> > uhm, we don't want to lose kmalloc redzoning, so I guess we need to create
> > on-demand ext3-private slab caches for 1024, 2048, and 4096 bytes.  With
> > the appropriate slab flags to defeat the redzoning.
> 
> Either that or we should fix redzoning so that it actually preserves
> alignment, rather than turning off debugging whenever we want alignment.
> Basically it means that we need to use twice the amount of memory for
> each object (so 2K for each 1K object).  Is this acceptable for debugging?

It doesn't sound like a good idea.  At present CONFIG_DEBUG_SLAB is
acceptable for production use (although it makes kernel profiles look
pretty sad).  Doubling the size of the slabs would rather alter that.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Christoph Lameter Aug. 1, 2006, 5:36 p.m. UTC | #7
On Mon, 31 Jul 2006, Andrew Morton wrote:

> > Sure, this particular instance is in journal_write_metadata_buffer
> > where the bh may be constructed from kmalloc memory (search for the
> > call to jbd_rep_kmalloc).  Because the memory returned by kmalloc
> > may straddle a page (when slab debugging is enabled that is), this
> > causes a broken bh to be injected into submit_bh.
> > 
> 
> Crap, that's hard to fix.   Am I allowed to blame submit_bh()? ;)
> 
> uhm, we don't want to lose kmalloc redzoning, so I guess we need to create
> on-demand ext3-private slab caches for 1024, 2048, and 4096 bytes.  With
> the appropriate slab flags to defeat the redzoning.

The slab allocator gives no guarantee that a structure is not straddling a 
page boundary regardless of debug or not. It may just happen that the 
objects are arranged if kmem_cache_cretae() is called with certain 
parameters. Another arch with other cacheline alignment and another page 
size may arrange the objects differently.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Pekka Enberg Aug. 1, 2006, 7:08 p.m. UTC | #8
Hi,

On 8/1/06, Christoph Lameter <clameter@sgi.com> wrote:
> The slab allocator gives no guarantee that a structure is not straddling a
> page boundary regardless of debug or not. It may just happen that the
> objects are arranged if kmem_cache_cretae() is called with certain
> parameters. Another arch with other cacheline alignment and another page
> size may arrange the objects differently.

Indeed. You could try to force zero-order page allocations for a
cache, but we are probably better of using the page allocator directly
or crafting a special purpose allocator for the block layer.

                                 Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Steven Rostedt Aug. 1, 2006, 7:08 p.m. UTC | #9
On Tue, 2006-08-01 at 10:36 -0700, Christoph Lameter wrote:
> On Mon, 31 Jul 2006, Andrew Morton wrote:
> 
> > > Sure, this particular instance is in journal_write_metadata_buffer
> > > where the bh may be constructed from kmalloc memory (search for the
> > > call to jbd_rep_kmalloc).  Because the memory returned by kmalloc
> > > may straddle a page (when slab debugging is enabled that is), this
> > > causes a broken bh to be injected into submit_bh.
> > > 
> > 
> > Crap, that's hard to fix.   Am I allowed to blame submit_bh()? ;)
> > 
> > uhm, we don't want to lose kmalloc redzoning, so I guess we need to create
> > on-demand ext3-private slab caches for 1024, 2048, and 4096 bytes.  With
> > the appropriate slab flags to defeat the redzoning.
> 
> The slab allocator gives no guarantee that a structure is not straddling a 
> page boundary regardless of debug or not. It may just happen that the 
> objects are arranged if kmem_cache_cretae() is called with certain 
> parameters. Another arch with other cacheline alignment and another page 
> size may arrange the objects differently.

If you set the alignment for ext3 the same as the size (ie 1024, 2048,
4096 for the above respectively) then wouldn't that guarantee not
straddling a page?

-- Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Christoph Lameter Aug. 1, 2006, 7:10 p.m. UTC | #10
On Tue, 1 Aug 2006, Steven Rostedt wrote:

> If you set the alignment for ext3 the same as the size (ie 1024, 2048,
> 4096 for the above respectively) then wouldn't that guarantee not
> straddling a page?

Yes. But then that number must always be a fraction of pagesize.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Andi Kleen Aug. 1, 2006, 7:33 p.m. UTC | #11
Herbert Xu <herbert@gondor.apana.org.au> writes:
> --
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 71649ef..b998f08 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2790,6 +2790,7 @@ int submit_bh(int rw, struct buffer_head
>  	BUG_ON(!buffer_locked(bh));
>  	BUG_ON(!buffer_mapped(bh));
>  	BUG_ON(!bh->b_end_io);
> +	WARN_ON(bh_offset(bh) + bh->b_size > PAGE_SIZE);

What happens when someone implements direct large page IO?

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Steven Rostedt Aug. 1, 2006, 7:35 p.m. UTC | #12
On Tue, 2006-08-01 at 12:10 -0700, Christoph Lameter wrote:
> On Tue, 1 Aug 2006, Steven Rostedt wrote:
> 
> > If you set the alignment for ext3 the same as the size (ie 1024, 2048,
> > 4096 for the above respectively) then wouldn't that guarantee not
> > straddling a page?
> 
> Yes. But then that number must always be a fraction of pagesize.
> 

understood, as is 1024, 2048, and 4096 are.  Well, if pagesize is 4096
is 4096 really a fraction of 4096? :)

Also, isn't all sizes for kmalloc that are under pagesize a fraction of
the page size? Or more correctly, a power of 2?

-- Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Christoph Lameter Aug. 1, 2006, 7:55 p.m. UTC | #13
On Tue, 1 Aug 2006, Steven Rostedt wrote:

> > Yes. But then that number must always be a fraction of pagesize.
> > 
> 
> understood, as is 1024, 2048, and 4096 are.  Well, if pagesize is 4096
> is 4096 really a fraction of 4096? :)
> 
> Also, isn't all sizes for kmalloc that are under pagesize a fraction of
> the page size? Or more correctly, a power of 2?

Well the size of the object and the alignment must be a fraction of the 
pagesize. If you get page aligned pages then I wonder why use the slab 
allocator? The page allocator will be much better suited.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Herbert Xu Aug. 1, 2006, 10:09 p.m. UTC | #14
On Tue, Aug 01, 2006 at 09:33:54PM +0200, Andi Kleen wrote:
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 71649ef..b998f08 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -2790,6 +2790,7 @@ int submit_bh(int rw, struct buffer_head
> >  	BUG_ON(!buffer_locked(bh));
> >  	BUG_ON(!buffer_mapped(bh));
> >  	BUG_ON(!bh->b_end_io);
> > +	WARN_ON(bh_offset(bh) + bh->b_size > PAGE_SIZE);
> 
> What happens when someone implements direct large page IO?

Then they'll need to change submit_bh to generate more than one bvec.
At that time they can remove this warning :)

Cheers,
Herbert Xu Aug. 1, 2006, 11:30 p.m. UTC | #15
Jens Axboe <axboe@suse.de> wrote:
>
> That looks really dangerous, I'd prefer that to be a BUG_ON() as well to
> prevent nastiness further down.

OK, I used a WARN_ON mainly because ext3 has been doing this for years
without killing anyone until now :)

[BLOCK] bh: Ensure bh fits within a page

There is a bug in jbd with slab debugging enabled where it was submitting
a bh obtained via jbd_rep_kmalloc which crossed a page boundary.  A lot
of time was spent on tracking this down because the symptoms were far off
from where the problem was.

This patch adds a sanity check to submit_bh so we can immediately spot
anyone doing similar things in future.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Cheers,
Herbert Xu Aug. 1, 2006, 11:32 p.m. UTC | #16
On Wed, Aug 02, 2006 at 09:30:51AM +1000, Herbert Xu wrote:
> 
> OK, I used a WARN_ON mainly because ext3 has been doing this for years
> without killing anyone until now :)
> 
> [BLOCK] bh: Ensure bh fits within a page

Actually, the other reason is that we can't BUG_ON until ext3 is fixed
to not do this.  So I suppose we should keep the WARN_ON until that
happens.

Cheers,
Andrew Morton Aug. 1, 2006, 11:44 p.m. UTC | #17
On Wed, 02 Aug 2006 09:30:51 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> Jens Axboe <axboe@suse.de> wrote:
> >
> > That looks really dangerous, I'd prefer that to be a BUG_ON() as well to
> > prevent nastiness further down.
> 
> OK, I used a WARN_ON mainly because ext3 has been doing this for years
> without killing anyone until now :)

I can't apply either of these patches, because we _know_ it'll trigger.

Once the JBD fix is in hand, we can go with a WARN_ON_ONCE and if that is
all clear, then we can make it a BUG_ON().

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jens Axboe Aug. 2, 2006, 6:23 a.m. UTC | #18
On Wed, Aug 02 2006, Herbert Xu wrote:
> On Wed, Aug 02, 2006 at 09:30:51AM +1000, Herbert Xu wrote:
> > 
> > OK, I used a WARN_ON mainly because ext3 has been doing this for years
> > without killing anyone until now :)
> > 
> > [BLOCK] bh: Ensure bh fits within a page
> 
> Actually, the other reason is that we can't BUG_ON until ext3 is fixed
> to not do this.  So I suppose we should keep the WARN_ON until that
> happens.

Yep, the ext3 bits need to go in first.
Jens Axboe Aug. 2, 2006, 11:06 a.m. UTC | #19
On Wed, Aug 02 2006, Herbert Xu wrote:
> On Tue, Aug 01, 2006 at 09:33:54PM +0200, Andi Kleen wrote:
> > > diff --git a/fs/buffer.c b/fs/buffer.c
> > > index 71649ef..b998f08 100644
> > > --- a/fs/buffer.c
> > > +++ b/fs/buffer.c
> > > @@ -2790,6 +2790,7 @@ int submit_bh(int rw, struct buffer_head
> > >  	BUG_ON(!buffer_locked(bh));
> > >  	BUG_ON(!buffer_mapped(bh));
> > >  	BUG_ON(!bh->b_end_io);
> > > +	WARN_ON(bh_offset(bh) + bh->b_size > PAGE_SIZE);
> > 
> > What happens when someone implements direct large page IO?
> 
> Then they'll need to change submit_bh to generate more than one bvec.
> At that time they can remove this warning :)

Or just use a proper interface, no new code should touch submit_bh().
Badari Pulavarty Aug. 17, 2006, 11:14 p.m. UTC | #20
On Wed, 2006-08-02 at 09:30 +1000, Herbert Xu wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> > That looks really dangerous, I'd prefer that to be a BUG_ON() as well to
> > prevent nastiness further down.
> 
> OK, I used a WARN_ON mainly because ext3 has been doing this for years
> without killing anyone until now :)
> 
> [BLOCK] bh: Ensure bh fits within a page
> 
> There is a bug in jbd with slab debugging enabled where it was submitting
> a bh obtained via jbd_rep_kmalloc which crossed a page boundary.  A lot
> of time was spent on tracking this down because the symptoms were far off
> from where the problem was.
> 
> This patch adds a sanity check to submit_bh so we can immediately spot
> anyone doing similar things in future.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> Cheers,

Hi,

I am working on a fix to address this problem. Do you have any testcases
to reproduce the problem ? I have been running bunch of tests with
CONFIG_DEBUG_SLAB with WARN_ON() and I haven't seen it yet :(

Wondering, if there is any special thing I need to do to reproduce
the problem ? Please let me know.

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Herbert Xu Aug. 18, 2006, 2:53 a.m. UTC | #21
On Thu, Aug 17, 2006 at 04:14:16PM -0700, Badari Pulavarty wrote:
> 
> Wondering, if there is any special thing I need to do to reproduce
> the problem ? Please let me know.

I must say that I didn't try to reproduce the problem.  However,
here is the bugzilla entry for it.  It apparently shows up during
installation:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=200127

Cheers,

Patch
diff mbox series

diff --git a/fs/buffer.c b/fs/buffer.c
index 71649ef..b998f08 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2790,6 +2790,7 @@  int submit_bh(int rw, struct buffer_head
 	BUG_ON(!buffer_locked(bh));
 	BUG_ON(!buffer_mapped(bh));
 	BUG_ON(!bh->b_end_io);
+	WARN_ON(bh_offset(bh) + bh->b_size > PAGE_SIZE);
 
 	if (buffer_ordered(bh) && (rw == WRITE))
 		rw = WRITE_BARRIER;