All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-08 12:37 ` Shuge
  0 siblings, 0 replies; 74+ messages in thread
From: Shuge @ 2013-03-08 12:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-ext4
  Cc: Kevin, Jan Kara, Theodore Ts'o, Jens Axboe

The bounce accept slab pages from jbd2, and flush dcache on them.
When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
So, check PageSlab to avoid it in __blk_queue_bounce().

Bug URL: http://lkml.org/lkml/2013/3/7/56

Signed-off-by: shuge <shuge@allwinnertech.com>
---
  mm/bounce.c |    3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/bounce.c b/mm/bounce.c
index 4e9ae72..f352c03 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
*q, struct bio **bio_orig,
  		if (rw == WRITE) {
  			char *vto, *vfrom;
  -			flush_dcache_page(from->bv_page);
+			if (unlikely(!PageSlab(from->bv_page)))
+				flush_dcache_page(from->bv_page);
  			vto = page_address(to->bv_page) + to->bv_offset;
  			vfrom = kmap(from->bv_page) + from->bv_offset;
  			memcpy(vto, vfrom, to->bv_len);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-08 12:37 ` Shuge
  0 siblings, 0 replies; 74+ messages in thread
From: Shuge @ 2013-03-08 12:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-ext4
  Cc: Kevin, Jan Kara, Theodore Ts'o, Jens Axboe

The bounce accept slab pages from jbd2, and flush dcache on them.
When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
So, check PageSlab to avoid it in __blk_queue_bounce().

Bug URL: http://lkml.org/lkml/2013/3/7/56

Signed-off-by: shuge <shuge@allwinnertech.com>
---
  mm/bounce.c |    3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/bounce.c b/mm/bounce.c
index 4e9ae72..f352c03 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
*q, struct bio **bio_orig,
  		if (rw == WRITE) {
  			char *vto, *vfrom;
  -			flush_dcache_page(from->bv_page);
+			if (unlikely(!PageSlab(from->bv_page)))
+				flush_dcache_page(from->bv_page);
  			vto = page_address(to->bv_page) + to->bv_offset;
  			vfrom = kmap(from->bv_page) + from->bv_offset;
  			memcpy(vto, vfrom, to->bv_len);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
  2013-03-08 12:37 ` Shuge
@ 2013-03-12 22:32   ` Andrew Morton
  -1 siblings, 0 replies; 74+ messages in thread
From: Andrew Morton @ 2013-03-12 22:32 UTC (permalink / raw)
  To: Shuge
  Cc: linux-kernel, linux-mm, linux-ext4, Kevin, Jan Kara,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Darrick J. Wong

On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:

> The bounce accept slab pages from jbd2, and flush dcache on them.
> When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> So, check PageSlab to avoid it in __blk_queue_bounce().
> 
> Bug URL: http://lkml.org/lkml/2013/3/7/56
> 
> ...
>
> --- a/mm/bounce.c
> +++ b/mm/bounce.c
> @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> *q, struct bio **bio_orig,
>   		if (rw == WRITE) {
>   			char *vto, *vfrom;
>   -			flush_dcache_page(from->bv_page);
> +			if (unlikely(!PageSlab(from->bv_page)))
> +				flush_dcache_page(from->bv_page);
>   			vto = page_address(to->bv_page) + to->bv_offset;
>   			vfrom = kmap(from->bv_page) + from->bv_offset;
>   			memcpy(vto, vfrom, to->bv_len);

I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
maintenance routines"), which added a page_mapping() call to arm64's
arch/arm64/mm/flush.c:flush_dcache_page().

What's happening is that jbd2 is using kmalloc() to allocate buffer_head
data.  That gets submitted down the BIO layer and __blk_queue_bounce()
calls flush_dcache_page() which in the arm64 case calls page_mapping()
and page_mapping() does VM_BUG_ON(PageSlab) and splat.

The unusual thing about all of this is that the payload for some disk
IO is coming from kmalloc, rather than being a user page.  It's oddball
but we've done this for ages and should continue to support it.


Now, the page from kmalloc() cannot be in highmem, so why did the
bounce code decide to bounce it?

__blk_queue_bounce() does

		/*
		 * is destination page below bounce pfn?
		 */
		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
			continue;

and `force' comes from must_snapshot_stable_pages().  But
must_snapshot_stable_pages() must have returned false, because if it
had returned true then it would have been must_snapshot_stable_pages()
which went BUG, because must_snapshot_stable_pages() calls page_mapping().

So my tentative diagosis is that arm64 is fishy.  A page which was
allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
above arm64's queue_bounce_pfn().  Can you please do a bit of
investigation to work out if this is what is happening?  Find out why
__blk_queue_bounce() decided to bounce a page which shouldn't have been
bounced?

This is all terribly fragile :( afaict if someone sets
bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
hit that BUG_ON() again, via must_snapshot_stable_pages()'s
page_mapping() call.  (Darrick, this means you ;))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-12 22:32   ` Andrew Morton
  0 siblings, 0 replies; 74+ messages in thread
From: Andrew Morton @ 2013-03-12 22:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:

> The bounce accept slab pages from jbd2, and flush dcache on them.
> When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> So, check PageSlab to avoid it in __blk_queue_bounce().
> 
> Bug URL: http://lkml.org/lkml/2013/3/7/56
> 
> ...
>
> --- a/mm/bounce.c
> +++ b/mm/bounce.c
> @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> *q, struct bio **bio_orig,
>   		if (rw == WRITE) {
>   			char *vto, *vfrom;
>   -			flush_dcache_page(from->bv_page);
> +			if (unlikely(!PageSlab(from->bv_page)))
> +				flush_dcache_page(from->bv_page);
>   			vto = page_address(to->bv_page) + to->bv_offset;
>   			vfrom = kmap(from->bv_page) + from->bv_offset;
>   			memcpy(vto, vfrom, to->bv_len);

I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
maintenance routines"), which added a page_mapping() call to arm64's
arch/arm64/mm/flush.c:flush_dcache_page().

What's happening is that jbd2 is using kmalloc() to allocate buffer_head
data.  That gets submitted down the BIO layer and __blk_queue_bounce()
calls flush_dcache_page() which in the arm64 case calls page_mapping()
and page_mapping() does VM_BUG_ON(PageSlab) and splat.

The unusual thing about all of this is that the payload for some disk
IO is coming from kmalloc, rather than being a user page.  It's oddball
but we've done this for ages and should continue to support it.


Now, the page from kmalloc() cannot be in highmem, so why did the
bounce code decide to bounce it?

__blk_queue_bounce() does

		/*
		 * is destination page below bounce pfn?
		 */
		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
			continue;

and `force' comes from must_snapshot_stable_pages().  But
must_snapshot_stable_pages() must have returned false, because if it
had returned true then it would have been must_snapshot_stable_pages()
which went BUG, because must_snapshot_stable_pages() calls page_mapping().

So my tentative diagosis is that arm64 is fishy.  A page which was
allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
above arm64's queue_bounce_pfn().  Can you please do a bit of
investigation to work out if this is what is happening?  Find out why
__blk_queue_bounce() decided to bounce a page which shouldn't have been
bounced?

This is all terribly fragile :( afaict if someone sets
bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
hit that BUG_ON() again, via must_snapshot_stable_pages()'s
page_mapping() call.  (Darrick, this means you ;))

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
  2013-03-12 22:32   ` Andrew Morton
  (?)
@ 2013-03-13  1:10     ` Darrick J. Wong
  -1 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-13  1:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shuge, linux-kernel, linux-mm, linux-ext4, Kevin, Jan Kara,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel

On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> 
> > The bounce accept slab pages from jbd2, and flush dcache on them.
> > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > So, check PageSlab to avoid it in __blk_queue_bounce().
> > 
> > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > 
> > ...
> >
> > --- a/mm/bounce.c
> > +++ b/mm/bounce.c
> > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > *q, struct bio **bio_orig,
> >   		if (rw == WRITE) {
> >   			char *vto, *vfrom;
> >   -			flush_dcache_page(from->bv_page);
> > +			if (unlikely(!PageSlab(from->bv_page)))
> > +				flush_dcache_page(from->bv_page);
> >   			vto = page_address(to->bv_page) + to->bv_offset;
> >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> >   			memcpy(vto, vfrom, to->bv_len);
> 
> I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> maintenance routines"), which added a page_mapping() call to arm64's
> arch/arm64/mm/flush.c:flush_dcache_page().
> 
> What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> calls flush_dcache_page() which in the arm64 case calls page_mapping()
> and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> 
> The unusual thing about all of this is that the payload for some disk
> IO is coming from kmalloc, rather than being a user page.  It's oddball
> but we've done this for ages and should continue to support it.
> 
> 
> Now, the page from kmalloc() cannot be in highmem, so why did the
> bounce code decide to bounce it?
> 
> __blk_queue_bounce() does
> 
> 		/*
> 		 * is destination page below bounce pfn?
> 		 */
> 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> 			continue;
> 
> and `force' comes from must_snapshot_stable_pages().  But
> must_snapshot_stable_pages() must have returned false, because if it
> had returned true then it would have been must_snapshot_stable_pages()
> which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> 
> So my tentative diagosis is that arm64 is fishy.  A page which was
> allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> above arm64's queue_bounce_pfn().  Can you please do a bit of
> investigation to work out if this is what is happening?  Find out why
> __blk_queue_bounce() decided to bounce a page which shouldn't have been
> bounced?

That sure is strange.  I didn't see any obvious reasons why we'd end up with a
kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.

> This is all terribly fragile :( afaict if someone sets
> bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> page_mapping() call.  (Darrick, this means you ;))

Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
We can keep walking the bio segments to find a non-slab page that can tell us
MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.

How does something like this look?  (+ the patch above)

--D

Subject: [PATCH] mm: Don't blow up on slab pages being written to disk

Don't assume that all pages attached to a bio are non-slab pages.  This happens
if (for example) jbd2 allocates a buffer out of the slab to hold frozen data.
If we encounter a slab page, just ignore the page and keep searching.
Hopefully filesystems are smart enough to guarantee that slab pages won't
be dirtied while they're also being written to disk.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 mm/bounce.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/bounce.c b/mm/bounce.c
index 5f89017..af34855 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -199,6 +199,8 @@ static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
 	 */
 	bio_for_each_segment(from, bio, i) {
 		page = from->bv_page;
+		if (PageSlab(page))
+			continue;
 		mapping = page_mapping(page);
 		if (!mapping)
 			continue;

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-13  1:10     ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-13  1:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shuge, linux-kernel, linux-mm, linux-ext4, Kevin, Jan Kara,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel

On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> 
> > The bounce accept slab pages from jbd2, and flush dcache on them.
> > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > So, check PageSlab to avoid it in __blk_queue_bounce().
> > 
> > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > 
> > ...
> >
> > --- a/mm/bounce.c
> > +++ b/mm/bounce.c
> > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > *q, struct bio **bio_orig,
> >   		if (rw == WRITE) {
> >   			char *vto, *vfrom;
> >   -			flush_dcache_page(from->bv_page);
> > +			if (unlikely(!PageSlab(from->bv_page)))
> > +				flush_dcache_page(from->bv_page);
> >   			vto = page_address(to->bv_page) + to->bv_offset;
> >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> >   			memcpy(vto, vfrom, to->bv_len);
> 
> I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> maintenance routines"), which added a page_mapping() call to arm64's
> arch/arm64/mm/flush.c:flush_dcache_page().
> 
> What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> calls flush_dcache_page() which in the arm64 case calls page_mapping()
> and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> 
> The unusual thing about all of this is that the payload for some disk
> IO is coming from kmalloc, rather than being a user page.  It's oddball
> but we've done this for ages and should continue to support it.
> 
> 
> Now, the page from kmalloc() cannot be in highmem, so why did the
> bounce code decide to bounce it?
> 
> __blk_queue_bounce() does
> 
> 		/*
> 		 * is destination page below bounce pfn?
> 		 */
> 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> 			continue;
> 
> and `force' comes from must_snapshot_stable_pages().  But
> must_snapshot_stable_pages() must have returned false, because if it
> had returned true then it would have been must_snapshot_stable_pages()
> which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> 
> So my tentative diagosis is that arm64 is fishy.  A page which was
> allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> above arm64's queue_bounce_pfn().  Can you please do a bit of
> investigation to work out if this is what is happening?  Find out why
> __blk_queue_bounce() decided to bounce a page which shouldn't have been
> bounced?

That sure is strange.  I didn't see any obvious reasons why we'd end up with a
kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.

> This is all terribly fragile :( afaict if someone sets
> bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> page_mapping() call.  (Darrick, this means you ;))

Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
We can keep walking the bio segments to find a non-slab page that can tell us
MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.

How does something like this look?  (+ the patch above)

--D

Subject: [PATCH] mm: Don't blow up on slab pages being written to disk

Don't assume that all pages attached to a bio are non-slab pages.  This happens
if (for example) jbd2 allocates a buffer out of the slab to hold frozen data.
If we encounter a slab page, just ignore the page and keep searching.
Hopefully filesystems are smart enough to guarantee that slab pages won't
be dirtied while they're also being written to disk.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 mm/bounce.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/bounce.c b/mm/bounce.c
index 5f89017..af34855 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -199,6 +199,8 @@ static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
 	 */
 	bio_for_each_segment(from, bio, i) {
 		page = from->bv_page;
+		if (PageSlab(page))
+			continue;
 		mapping = page_mapping(page);
 		if (!mapping)
 			continue;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-13  1:10     ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-13  1:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> 
> > The bounce accept slab pages from jbd2, and flush dcache on them.
> > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > So, check PageSlab to avoid it in __blk_queue_bounce().
> > 
> > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > 
> > ...
> >
> > --- a/mm/bounce.c
> > +++ b/mm/bounce.c
> > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > *q, struct bio **bio_orig,
> >   		if (rw == WRITE) {
> >   			char *vto, *vfrom;
> >   -			flush_dcache_page(from->bv_page);
> > +			if (unlikely(!PageSlab(from->bv_page)))
> > +				flush_dcache_page(from->bv_page);
> >   			vto = page_address(to->bv_page) + to->bv_offset;
> >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> >   			memcpy(vto, vfrom, to->bv_len);
> 
> I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> maintenance routines"), which added a page_mapping() call to arm64's
> arch/arm64/mm/flush.c:flush_dcache_page().
> 
> What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> calls flush_dcache_page() which in the arm64 case calls page_mapping()
> and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> 
> The unusual thing about all of this is that the payload for some disk
> IO is coming from kmalloc, rather than being a user page.  It's oddball
> but we've done this for ages and should continue to support it.
> 
> 
> Now, the page from kmalloc() cannot be in highmem, so why did the
> bounce code decide to bounce it?
> 
> __blk_queue_bounce() does
> 
> 		/*
> 		 * is destination page below bounce pfn?
> 		 */
> 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> 			continue;
> 
> and `force' comes from must_snapshot_stable_pages().  But
> must_snapshot_stable_pages() must have returned false, because if it
> had returned true then it would have been must_snapshot_stable_pages()
> which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> 
> So my tentative diagosis is that arm64 is fishy.  A page which was
> allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> above arm64's queue_bounce_pfn().  Can you please do a bit of
> investigation to work out if this is what is happening?  Find out why
> __blk_queue_bounce() decided to bounce a page which shouldn't have been
> bounced?

That sure is strange.  I didn't see any obvious reasons why we'd end up with a
kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.

> This is all terribly fragile :( afaict if someone sets
> bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> page_mapping() call.  (Darrick, this means you ;))

Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
We can keep walking the bio segments to find a non-slab page that can tell us
MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.

How does something like this look?  (+ the patch above)

--D

Subject: [PATCH] mm: Don't blow up on slab pages being written to disk

Don't assume that all pages attached to a bio are non-slab pages.  This happens
if (for example) jbd2 allocates a buffer out of the slab to hold frozen data.
If we encounter a slab page, just ignore the page and keep searching.
Hopefully filesystems are smart enough to guarantee that slab pages won't
be dirtied while they're also being written to disk.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 mm/bounce.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/bounce.c b/mm/bounce.c
index 5f89017..af34855 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -199,6 +199,8 @@ static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
 	 */
 	bio_for_each_segment(from, bio, i) {
 		page = from->bv_page;
+		if (PageSlab(page))
+			continue;
 		mapping = page_mapping(page);
 		if (!mapping)
 			continue;

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
  2013-03-13  1:10     ` Darrick J. Wong
  (?)
@ 2013-03-13  3:35       ` Shuge
  -1 siblings, 0 replies; 74+ messages in thread
From: Shuge @ 2013-03-13  3:35 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrew Morton, linux-kernel, linux-mm, linux-ext4, Kevin,
	Jan Kara, Theodore Ts'o, Jens Axboe, Catalin Marinas,
	Will Deacon, linux-arm-kernel

Hi all
>>> The bounce accept slab pages from jbd2, and flush dcache on them.
>>> When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
>>> So, check PageSlab to avoid it in __blk_queue_bounce().
>>>
>>> Bug URL: http://lkml.org/lkml/2013/3/7/56
>>>
>>> ...
>>>
>> ......
>>
> That sure is strange.  I didn't see any obvious reasons why we'd end up with a
>
......

     Well, this problem not only appear in arm64, but also arm32. And my 
kernel version is 3.3.0, arch is arm32.
Following the newest kernel, the problem shoulde be exist.
     I agree with Darrick's modification. Hum, if 
CONFIG_NEED_BOUNCE_POOL is not set, it also flush dcahce on
the pages of b_frozen_data, some of them are allocated by kmem_cache_alloc.
     As we know, jbd2_alloc allocate a buffer from jbd2_xk slab pool, 
when the size is smaller than PAGE_SIZE.
The b_frozen_data  is not mapped to usrspace, not aliasing cache. It cat 
be lazy flush or other. Is it right?


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-13  3:35       ` Shuge
  0 siblings, 0 replies; 74+ messages in thread
From: Shuge @ 2013-03-13  3:35 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrew Morton, linux-kernel, linux-mm, linux-ext4, Kevin,
	Jan Kara, Theodore Ts'o, Jens Axboe, Catalin Marinas,
	Will Deacon, linux-arm-kernel

Hi all
>>> The bounce accept slab pages from jbd2, and flush dcache on them.
>>> When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
>>> So, check PageSlab to avoid it in __blk_queue_bounce().
>>>
>>> Bug URL: http://lkml.org/lkml/2013/3/7/56
>>>
>>> ...
>>>
>> ......
>>
> That sure is strange.  I didn't see any obvious reasons why we'd end up with a
>
......

     Well, this problem not only appear in arm64, but also arm32. And my 
kernel version is 3.3.0, arch is arm32.
Following the newest kernel, the problem shoulde be exist.
     I agree with Darrick's modification. Hum, if 
CONFIG_NEED_BOUNCE_POOL is not set, it also flush dcahce on
the pages of b_frozen_data, some of them are allocated by kmem_cache_alloc.
     As we know, jbd2_alloc allocate a buffer from jbd2_xk slab pool, 
when the size is smaller than PAGE_SIZE.
The b_frozen_data  is not mapped to usrspace, not aliasing cache. It cat 
be lazy flush or other. Is it right?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-13  3:35       ` Shuge
  0 siblings, 0 replies; 74+ messages in thread
From: Shuge @ 2013-03-13  3:35 UTC (permalink / raw)
  To: linux-arm-kernel

Hi all
>>> The bounce accept slab pages from jbd2, and flush dcache on them.
>>> When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
>>> So, check PageSlab to avoid it in __blk_queue_bounce().
>>>
>>> Bug URL: http://lkml.org/lkml/2013/3/7/56
>>>
>>> ...
>>>
>> ......
>>
> That sure is strange.  I didn't see any obvious reasons why we'd end up with a
>
......

     Well, this problem not only appear in arm64, but also arm32. And my 
kernel version is 3.3.0, arch is arm32.
Following the newest kernel, the problem shoulde be exist.
     I agree with Darrick's modification. Hum, if 
CONFIG_NEED_BOUNCE_POOL is not set, it also flush dcahce on
the pages of b_frozen_data, some of them are allocated by kmem_cache_alloc.
     As we know, jbd2_alloc allocate a buffer from jbd2_xk slab pool, 
when the size is smaller than PAGE_SIZE.
The b_frozen_data  is not mapped to usrspace, not aliasing cache. It cat 
be lazy flush or other. Is it right?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
  2013-03-13  3:35       ` Shuge
  (?)
@ 2013-03-13  4:11         ` Andrew Morton
  -1 siblings, 0 replies; 74+ messages in thread
From: Andrew Morton @ 2013-03-13  4:11 UTC (permalink / raw)
  To: Shuge
  Cc: Darrick J. Wong, linux-kernel, linux-mm, linux-ext4, Kevin,
	Jan Kara, Theodore Ts'o, Jens Axboe, Catalin Marinas,
	Will Deacon, linux-arm-kernel

On Wed, 13 Mar 2013 11:35:15 +0800 Shuge <shugelinux@gmail.com> wrote:

> Hi all
> >>> The bounce accept slab pages from jbd2, and flush dcache on them.
> >>> When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> >>> So, check PageSlab to avoid it in __blk_queue_bounce().
> >>>
> >>> Bug URL: http://lkml.org/lkml/2013/3/7/56
> >>>
> >>> ...
> >>>
> >> ......
> >>
> > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> >
> ......
> 
>      Well, this problem not only appear in arm64, but also arm32. And my 
> kernel version is 3.3.0, arch is arm32.
> Following the newest kernel, the problem shoulde be exist.
>      I agree with Darrick's modification. Hum, if 
> CONFIG_NEED_BOUNCE_POOL is not set, it also flush dcahce on
> the pages of b_frozen_data, some of them are allocated by kmem_cache_alloc.
>      As we know, jbd2_alloc allocate a buffer from jbd2_xk slab pool, 
> when the size is smaller than PAGE_SIZE.
> The b_frozen_data  is not mapped to usrspace, not aliasing cache. It cat 
> be lazy flush or other. Is it right?

Please reread my email.  The page at b_frozen_data was allocated with
GFP_NOFS.  Hence it should not need bounce treatment (if arm is
anything like x86).

And yet it *did* receive bounce treatment.  Why?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-13  4:11         ` Andrew Morton
  0 siblings, 0 replies; 74+ messages in thread
From: Andrew Morton @ 2013-03-13  4:11 UTC (permalink / raw)
  To: Shuge
  Cc: Darrick J. Wong, linux-kernel, linux-mm, linux-ext4, Kevin,
	Jan Kara, Theodore Ts'o, Jens Axboe, Catalin Marinas,
	Will Deacon, linux-arm-kernel

On Wed, 13 Mar 2013 11:35:15 +0800 Shuge <shugelinux@gmail.com> wrote:

> Hi all
> >>> The bounce accept slab pages from jbd2, and flush dcache on them.
> >>> When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> >>> So, check PageSlab to avoid it in __blk_queue_bounce().
> >>>
> >>> Bug URL: http://lkml.org/lkml/2013/3/7/56
> >>>
> >>> ...
> >>>
> >> ......
> >>
> > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> >
> ......
> 
>      Well, this problem not only appear in arm64, but also arm32. And my 
> kernel version is 3.3.0, arch is arm32.
> Following the newest kernel, the problem shoulde be exist.
>      I agree with Darrick's modification. Hum, if 
> CONFIG_NEED_BOUNCE_POOL is not set, it also flush dcahce on
> the pages of b_frozen_data, some of them are allocated by kmem_cache_alloc.
>      As we know, jbd2_alloc allocate a buffer from jbd2_xk slab pool, 
> when the size is smaller than PAGE_SIZE.
> The b_frozen_data  is not mapped to usrspace, not aliasing cache. It cat 
> be lazy flush or other. Is it right?

Please reread my email.  The page at b_frozen_data was allocated with
GFP_NOFS.  Hence it should not need bounce treatment (if arm is
anything like x86).

And yet it *did* receive bounce treatment.  Why?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-13  4:11         ` Andrew Morton
  0 siblings, 0 replies; 74+ messages in thread
From: Andrew Morton @ 2013-03-13  4:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 13 Mar 2013 11:35:15 +0800 Shuge <shugelinux@gmail.com> wrote:

> Hi all
> >>> The bounce accept slab pages from jbd2, and flush dcache on them.
> >>> When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> >>> So, check PageSlab to avoid it in __blk_queue_bounce().
> >>>
> >>> Bug URL: http://lkml.org/lkml/2013/3/7/56
> >>>
> >>> ...
> >>>
> >> ......
> >>
> > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> >
> ......
> 
>      Well, this problem not only appear in arm64, but also arm32. And my 
> kernel version is 3.3.0, arch is arm32.
> Following the newest kernel, the problem shoulde be exist.
>      I agree with Darrick's modification. Hum, if 
> CONFIG_NEED_BOUNCE_POOL is not set, it also flush dcahce on
> the pages of b_frozen_data, some of them are allocated by kmem_cache_alloc.
>      As we know, jbd2_alloc allocate a buffer from jbd2_xk slab pool, 
> when the size is smaller than PAGE_SIZE.
> The b_frozen_data  is not mapped to usrspace, not aliasing cache. It cat 
> be lazy flush or other. Is it right?

Please reread my email.  The page at b_frozen_data was allocated with
GFP_NOFS.  Hence it should not need bounce treatment (if arm is
anything like x86).

And yet it *did* receive bounce treatment.  Why?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
  2013-03-13  1:10     ` Darrick J. Wong
  (?)
@ 2013-03-13  8:50       ` Jan Kara
  -1 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-03-13  8:50 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrew Morton, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Jan Kara, Theodore Ts'o, Jens Axboe, Catalin Marinas,
	Will Deacon, linux-arm-kernel

On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > 
> > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > 
> > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > 
> > > ...
> > >
> > > --- a/mm/bounce.c
> > > +++ b/mm/bounce.c
> > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > *q, struct bio **bio_orig,
> > >   		if (rw == WRITE) {
> > >   			char *vto, *vfrom;
> > >   -			flush_dcache_page(from->bv_page);
> > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > +				flush_dcache_page(from->bv_page);
> > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > >   			memcpy(vto, vfrom, to->bv_len);
> > 
> > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > maintenance routines"), which added a page_mapping() call to arm64's
> > arch/arm64/mm/flush.c:flush_dcache_page().
> > 
> > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > 
> > The unusual thing about all of this is that the payload for some disk
> > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > but we've done this for ages and should continue to support it.
> > 
> > 
> > Now, the page from kmalloc() cannot be in highmem, so why did the
> > bounce code decide to bounce it?
> > 
> > __blk_queue_bounce() does
> > 
> > 		/*
> > 		 * is destination page below bounce pfn?
> > 		 */
> > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > 			continue;
> > 
> > and `force' comes from must_snapshot_stable_pages().  But
> > must_snapshot_stable_pages() must have returned false, because if it
> > had returned true then it would have been must_snapshot_stable_pages()
> > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > 
> > So my tentative diagosis is that arm64 is fishy.  A page which was
> > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > investigation to work out if this is what is happening?  Find out why
> > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > bounced?
> 
> That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> 
> > This is all terribly fragile :( afaict if someone sets
> > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > page_mapping() call.  (Darrick, this means you ;))
> 
> Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> We can keep walking the bio segments to find a non-slab page that can tell us
> MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> 
> How does something like this look?  (+ the patch above)
  Umm, this won't quite work. We can have a bio which has just PageSlab
page attached and so you won't be able to get to the superblock. Heh, isn't
the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
do direct IO, these pages come directly from userspace and hell knows where
they come from. Definitely their page_mapping() doesn't give us anything
useful... Sorry for not realizing this earlier when reviewing the patch.

... remembering why we need to get to sb and why ext3 needs this ... So
maybe a better solution would be to have a bio flag meaning that pages need
bouncing? And we would set it from filesystems that need it - in case of
ext3 only writeback of data from kjournald actually needs to bounce the
pages. Thoughts?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-13  8:50       ` Jan Kara
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-03-13  8:50 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrew Morton, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Jan Kara, Theodore Ts'o, Jens Axboe, Catalin Marinas,
	Will Deacon, linux-arm-kernel

On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > 
> > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > 
> > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > 
> > > ...
> > >
> > > --- a/mm/bounce.c
> > > +++ b/mm/bounce.c
> > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > *q, struct bio **bio_orig,
> > >   		if (rw == WRITE) {
> > >   			char *vto, *vfrom;
> > >   -			flush_dcache_page(from->bv_page);
> > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > +				flush_dcache_page(from->bv_page);
> > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > >   			memcpy(vto, vfrom, to->bv_len);
> > 
> > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > maintenance routines"), which added a page_mapping() call to arm64's
> > arch/arm64/mm/flush.c:flush_dcache_page().
> > 
> > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > 
> > The unusual thing about all of this is that the payload for some disk
> > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > but we've done this for ages and should continue to support it.
> > 
> > 
> > Now, the page from kmalloc() cannot be in highmem, so why did the
> > bounce code decide to bounce it?
> > 
> > __blk_queue_bounce() does
> > 
> > 		/*
> > 		 * is destination page below bounce pfn?
> > 		 */
> > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > 			continue;
> > 
> > and `force' comes from must_snapshot_stable_pages().  But
> > must_snapshot_stable_pages() must have returned false, because if it
> > had returned true then it would have been must_snapshot_stable_pages()
> > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > 
> > So my tentative diagosis is that arm64 is fishy.  A page which was
> > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > investigation to work out if this is what is happening?  Find out why
> > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > bounced?
> 
> That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> 
> > This is all terribly fragile :( afaict if someone sets
> > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > page_mapping() call.  (Darrick, this means you ;))
> 
> Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> We can keep walking the bio segments to find a non-slab page that can tell us
> MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> 
> How does something like this look?  (+ the patch above)
  Umm, this won't quite work. We can have a bio which has just PageSlab
page attached and so you won't be able to get to the superblock. Heh, isn't
the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
do direct IO, these pages come directly from userspace and hell knows where
they come from. Definitely their page_mapping() doesn't give us anything
useful... Sorry for not realizing this earlier when reviewing the patch.

... remembering why we need to get to sb and why ext3 needs this ... So
maybe a better solution would be to have a bio flag meaning that pages need
bouncing? And we would set it from filesystems that need it - in case of
ext3 only writeback of data from kjournald actually needs to bounce the
pages. Thoughts?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-13  8:50       ` Jan Kara
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-03-13  8:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > 
> > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > 
> > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > 
> > > ...
> > >
> > > --- a/mm/bounce.c
> > > +++ b/mm/bounce.c
> > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > *q, struct bio **bio_orig,
> > >   		if (rw == WRITE) {
> > >   			char *vto, *vfrom;
> > >   -			flush_dcache_page(from->bv_page);
> > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > +				flush_dcache_page(from->bv_page);
> > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > >   			memcpy(vto, vfrom, to->bv_len);
> > 
> > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > maintenance routines"), which added a page_mapping() call to arm64's
> > arch/arm64/mm/flush.c:flush_dcache_page().
> > 
> > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > 
> > The unusual thing about all of this is that the payload for some disk
> > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > but we've done this for ages and should continue to support it.
> > 
> > 
> > Now, the page from kmalloc() cannot be in highmem, so why did the
> > bounce code decide to bounce it?
> > 
> > __blk_queue_bounce() does
> > 
> > 		/*
> > 		 * is destination page below bounce pfn?
> > 		 */
> > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > 			continue;
> > 
> > and `force' comes from must_snapshot_stable_pages().  But
> > must_snapshot_stable_pages() must have returned false, because if it
> > had returned true then it would have been must_snapshot_stable_pages()
> > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > 
> > So my tentative diagosis is that arm64 is fishy.  A page which was
> > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > investigation to work out if this is what is happening?  Find out why
> > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > bounced?
> 
> That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> 
> > This is all terribly fragile :( afaict if someone sets
> > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > page_mapping() call.  (Darrick, this means you ;))
> 
> Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> We can keep walking the bio segments to find a non-slab page that can tell us
> MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> 
> How does something like this look?  (+ the patch above)
  Umm, this won't quite work. We can have a bio which has just PageSlab
page attached and so you won't be able to get to the superblock. Heh, isn't
the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
do direct IO, these pages come directly from userspace and hell knows where
they come from. Definitely their page_mapping() doesn't give us anything
useful... Sorry for not realizing this earlier when reviewing the patch.

... remembering why we need to get to sb and why ext3 needs this ... So
maybe a better solution would be to have a bio flag meaning that pages need
bouncing? And we would set it from filesystems that need it - in case of
ext3 only writeback of data from kjournald actually needs to bounce the
pages. Thoughts?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
  2013-03-13  4:11         ` Andrew Morton
  (?)
@ 2013-03-13  9:42           ` Russell King - ARM Linux
  -1 siblings, 0 replies; 74+ messages in thread
From: Russell King - ARM Linux @ 2013-03-13  9:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shuge, Jens Axboe, Jan Kara, Darrick J. Wong, Catalin Marinas,
	Will Deacon, linux-kernel, linux-mm, Theodore Ts'o,
	linux-ext4, Kevin, linux-arm-kernel

On Tue, Mar 12, 2013 at 09:11:38PM -0700, Andrew Morton wrote:
> Please reread my email.  The page at b_frozen_data was allocated with
> GFP_NOFS.  Hence it should not need bounce treatment (if arm is
> anything like x86).
> 
> And yet it *did* receive bounce treatment.  Why?

If I had to guess, it's because you've uncovered a bug in the utter crap
that we call a "dma mask".

When is a mask not a mask?  When it is used as a numerical limit.  When
is a mask really a mask?  When it indicates which bits are significant in
a DMA address.

The problem here is that there's a duality in the way the mask is used,
and that is caused by memory on x86 always starting at physical address
zero.  The problem is this:

On ARM, we have some platforms which offset the start of physical memory.
This offset can be significant - maybe 3GB.  However, only a limited
amount of that memory may be DMA-able.  So, we may end up with the
maximum physical address of DMA-able memory being 3GB + 64MB for example,
or 0xc4000000, because the DMA controller only has 26 address lines.  So,
this brings up the problem of whether we set the DMA mask to 0xc3ffffff
or 0x03ffffff.

There's places in the kernel which assume that DMA masks are a set of
zero bits followed by a set of one bits, and nothing else does...

Now, max_low_pfn is initialized this way:

/**
 * init_bootmem - register boot memory
 * @start: pfn where the bitmap is to be placed
 * @pages: number of available physical pages
 *
 * Returns the number of bytes needed to hold the bitmap.
 */
unsigned long __init init_bootmem(unsigned long start, unsigned long pages)
{
        max_low_pfn = pages;
        min_low_pfn = start;
        return init_bootmem_core(NODE_DATA(0)->bdata, start, 0, pages);
}

So, min_low_pfn is the PFN offset of the start of physical memory (so
3GB >> PAGE_SHIFT) and max_low_pfn ends up being the number of pages,
_not_ the maximum PFN value - if it were to be the maximum PFN value,
then we end up with a _huge_ bootmem bitmap which may not even fit in
the available memory we have.

However, other places in the kernel treat max_low_pfn entirely
differently:

        blk_max_low_pfn = max_low_pfn - 1;

void blk_queue_bounce_limit(struct request_queue *q, u64 dma_mask)
{
        unsigned long b_pfn = dma_mask >> PAGE_SHIFT;

        if (b_pfn < blk_max_low_pfn)
                dma = 1;
        q->limits.bounce_pfn = b_pfn;

And then we have stuff doing this:

	page_to_pfn(bv->bv_page) > queue_bounce_pfn(q);
                if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
                if (queue_bounce_pfn(q) >= blk_max_pfn && !must_bounce)

So, "max_low_pfn" is totally and utterly confused in the kernel as to
what it is, and it only really works on x86 (and other architectures)
that start their memory at physical address 0 (because then it doesn't
matter how you interpret it.)

So the whole thing about "is a DMA mask a mask or a maximum address"
is totally confused in the kernel in such a way that platforms like ARM
get a very hard time, and what we now have in place has worked 100%
fine for all the platforms we've had for the last 10+ years.

It's a very longstanding bug in the kernel, going all the way back to
2.2 days or so.

What to do about it, I have no idea - changing to satisfy the "DMA mask
is a maximum address" is likely to break things.  What we need is a
proper fix, and a consistent way to interpret DMA masks which works not
only on x86, but also on platforms which have limited DMA to memory
which has huge physical offsets.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-13  9:42           ` Russell King - ARM Linux
  0 siblings, 0 replies; 74+ messages in thread
From: Russell King - ARM Linux @ 2013-03-13  9:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shuge, Jens Axboe, Jan Kara, Darrick J. Wong, Catalin Marinas,
	Will Deacon, linux-kernel, linux-mm, Theodore Ts'o,
	linux-ext4, Kevin, linux-arm-kernel

On Tue, Mar 12, 2013 at 09:11:38PM -0700, Andrew Morton wrote:
> Please reread my email.  The page at b_frozen_data was allocated with
> GFP_NOFS.  Hence it should not need bounce treatment (if arm is
> anything like x86).
> 
> And yet it *did* receive bounce treatment.  Why?

If I had to guess, it's because you've uncovered a bug in the utter crap
that we call a "dma mask".

When is a mask not a mask?  When it is used as a numerical limit.  When
is a mask really a mask?  When it indicates which bits are significant in
a DMA address.

The problem here is that there's a duality in the way the mask is used,
and that is caused by memory on x86 always starting at physical address
zero.  The problem is this:

On ARM, we have some platforms which offset the start of physical memory.
This offset can be significant - maybe 3GB.  However, only a limited
amount of that memory may be DMA-able.  So, we may end up with the
maximum physical address of DMA-able memory being 3GB + 64MB for example,
or 0xc4000000, because the DMA controller only has 26 address lines.  So,
this brings up the problem of whether we set the DMA mask to 0xc3ffffff
or 0x03ffffff.

There's places in the kernel which assume that DMA masks are a set of
zero bits followed by a set of one bits, and nothing else does...

Now, max_low_pfn is initialized this way:

/**
 * init_bootmem - register boot memory
 * @start: pfn where the bitmap is to be placed
 * @pages: number of available physical pages
 *
 * Returns the number of bytes needed to hold the bitmap.
 */
unsigned long __init init_bootmem(unsigned long start, unsigned long pages)
{
        max_low_pfn = pages;
        min_low_pfn = start;
        return init_bootmem_core(NODE_DATA(0)->bdata, start, 0, pages);
}

So, min_low_pfn is the PFN offset of the start of physical memory (so
3GB >> PAGE_SHIFT) and max_low_pfn ends up being the number of pages,
_not_ the maximum PFN value - if it were to be the maximum PFN value,
then we end up with a _huge_ bootmem bitmap which may not even fit in
the available memory we have.

However, other places in the kernel treat max_low_pfn entirely
differently:

        blk_max_low_pfn = max_low_pfn - 1;

void blk_queue_bounce_limit(struct request_queue *q, u64 dma_mask)
{
        unsigned long b_pfn = dma_mask >> PAGE_SHIFT;

        if (b_pfn < blk_max_low_pfn)
                dma = 1;
        q->limits.bounce_pfn = b_pfn;

And then we have stuff doing this:

	page_to_pfn(bv->bv_page) > queue_bounce_pfn(q);
                if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
                if (queue_bounce_pfn(q) >= blk_max_pfn && !must_bounce)

So, "max_low_pfn" is totally and utterly confused in the kernel as to
what it is, and it only really works on x86 (and other architectures)
that start their memory at physical address 0 (because then it doesn't
matter how you interpret it.)

So the whole thing about "is a DMA mask a mask or a maximum address"
is totally confused in the kernel in such a way that platforms like ARM
get a very hard time, and what we now have in place has worked 100%
fine for all the platforms we've had for the last 10+ years.

It's a very longstanding bug in the kernel, going all the way back to
2.2 days or so.

What to do about it, I have no idea - changing to satisfy the "DMA mask
is a maximum address" is likely to break things.  What we need is a
proper fix, and a consistent way to interpret DMA masks which works not
only on x86, but also on platforms which have limited DMA to memory
which has huge physical offsets.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-13  9:42           ` Russell King - ARM Linux
  0 siblings, 0 replies; 74+ messages in thread
From: Russell King - ARM Linux @ 2013-03-13  9:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Mar 12, 2013 at 09:11:38PM -0700, Andrew Morton wrote:
> Please reread my email.  The page at b_frozen_data was allocated with
> GFP_NOFS.  Hence it should not need bounce treatment (if arm is
> anything like x86).
> 
> And yet it *did* receive bounce treatment.  Why?

If I had to guess, it's because you've uncovered a bug in the utter crap
that we call a "dma mask".

When is a mask not a mask?  When it is used as a numerical limit.  When
is a mask really a mask?  When it indicates which bits are significant in
a DMA address.

The problem here is that there's a duality in the way the mask is used,
and that is caused by memory on x86 always starting at physical address
zero.  The problem is this:

On ARM, we have some platforms which offset the start of physical memory.
This offset can be significant - maybe 3GB.  However, only a limited
amount of that memory may be DMA-able.  So, we may end up with the
maximum physical address of DMA-able memory being 3GB + 64MB for example,
or 0xc4000000, because the DMA controller only has 26 address lines.  So,
this brings up the problem of whether we set the DMA mask to 0xc3ffffff
or 0x03ffffff.

There's places in the kernel which assume that DMA masks are a set of
zero bits followed by a set of one bits, and nothing else does...

Now, max_low_pfn is initialized this way:

/**
 * init_bootmem - register boot memory
 * @start: pfn where the bitmap is to be placed
 * @pages: number of available physical pages
 *
 * Returns the number of bytes needed to hold the bitmap.
 */
unsigned long __init init_bootmem(unsigned long start, unsigned long pages)
{
        max_low_pfn = pages;
        min_low_pfn = start;
        return init_bootmem_core(NODE_DATA(0)->bdata, start, 0, pages);
}

So, min_low_pfn is the PFN offset of the start of physical memory (so
3GB >> PAGE_SHIFT) and max_low_pfn ends up being the number of pages,
_not_ the maximum PFN value - if it were to be the maximum PFN value,
then we end up with a _huge_ bootmem bitmap which may not even fit in
the available memory we have.

However, other places in the kernel treat max_low_pfn entirely
differently:

        blk_max_low_pfn = max_low_pfn - 1;

void blk_queue_bounce_limit(struct request_queue *q, u64 dma_mask)
{
        unsigned long b_pfn = dma_mask >> PAGE_SHIFT;

        if (b_pfn < blk_max_low_pfn)
                dma = 1;
        q->limits.bounce_pfn = b_pfn;

And then we have stuff doing this:

	page_to_pfn(bv->bv_page) > queue_bounce_pfn(q);
                if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
                if (queue_bounce_pfn(q) >= blk_max_pfn && !must_bounce)

So, "max_low_pfn" is totally and utterly confused in the kernel as to
what it is, and it only really works on x86 (and other architectures)
that start their memory at physical address 0 (because then it doesn't
matter how you interpret it.)

So the whole thing about "is a DMA mask a mask or a maximum address"
is totally confused in the kernel in such a way that platforms like ARM
get a very hard time, and what we now have in place has worked 100%
fine for all the platforms we've had for the last 10+ years.

It's a very longstanding bug in the kernel, going all the way back to
2.2 days or so.

What to do about it, I have no idea - changing to satisfy the "DMA mask
is a maximum address" is likely to break things.  What we need is a
proper fix, and a consistent way to interpret DMA masks which works not
only on x86, but also on platforms which have limited DMA to memory
which has huge physical offsets.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
  2013-03-13  8:50       ` Jan Kara
  (?)
@ 2013-03-13 19:44         ` Darrick J. Wong
  -1 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-13 19:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel

On Wed, Mar 13, 2013 at 09:50:21AM +0100, Jan Kara wrote:
> On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> > On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > > 
> > > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > > 
> > > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > > 
> > > > ...
> > > >
> > > > --- a/mm/bounce.c
> > > > +++ b/mm/bounce.c
> > > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > > *q, struct bio **bio_orig,
> > > >   		if (rw == WRITE) {
> > > >   			char *vto, *vfrom;
> > > >   -			flush_dcache_page(from->bv_page);
> > > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > > +				flush_dcache_page(from->bv_page);
> > > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > > >   			memcpy(vto, vfrom, to->bv_len);
> > > 
> > > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > > maintenance routines"), which added a page_mapping() call to arm64's
> > > arch/arm64/mm/flush.c:flush_dcache_page().
> > > 
> > > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > > 
> > > The unusual thing about all of this is that the payload for some disk
> > > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > > but we've done this for ages and should continue to support it.
> > > 
> > > 
> > > Now, the page from kmalloc() cannot be in highmem, so why did the
> > > bounce code decide to bounce it?
> > > 
> > > __blk_queue_bounce() does
> > > 
> > > 		/*
> > > 		 * is destination page below bounce pfn?
> > > 		 */
> > > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > > 			continue;
> > > 
> > > and `force' comes from must_snapshot_stable_pages().  But
> > > must_snapshot_stable_pages() must have returned false, because if it
> > > had returned true then it would have been must_snapshot_stable_pages()
> > > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > > 
> > > So my tentative diagosis is that arm64 is fishy.  A page which was
> > > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > > investigation to work out if this is what is happening?  Find out why
> > > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > > bounced?
> > 
> > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> > kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> > 
> > > This is all terribly fragile :( afaict if someone sets
> > > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > > page_mapping() call.  (Darrick, this means you ;))
> > 
> > Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> > We can keep walking the bio segments to find a non-slab page that can tell us
> > MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> > 
> > How does something like this look?  (+ the patch above)
>   Umm, this won't quite work. We can have a bio which has just PageSlab
> page attached and so you won't be able to get to the superblock. Heh, isn't
> the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
> do direct IO, these pages come directly from userspace and hell knows where
> they come from. Definitely their page_mapping() doesn't give us anything
> useful... Sorry for not realizing this earlier when reviewing the patch.
> 
> ... remembering why we need to get to sb and why ext3 needs this ... So
> maybe a better solution would be to have a bio flag meaning that pages need
> bouncing? And we would set it from filesystems that need it - in case of
> ext3 only writeback of data from kjournald actually needs to bounce the
> pages. Thoughts?

What about dirty pages that don't result in journal transactions?  I think
ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
__block_write_full_page, which in turn calls submit_bh().

--D
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-13 19:44         ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-13 19:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel

On Wed, Mar 13, 2013 at 09:50:21AM +0100, Jan Kara wrote:
> On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> > On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > > 
> > > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > > 
> > > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > > 
> > > > ...
> > > >
> > > > --- a/mm/bounce.c
> > > > +++ b/mm/bounce.c
> > > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > > *q, struct bio **bio_orig,
> > > >   		if (rw == WRITE) {
> > > >   			char *vto, *vfrom;
> > > >   -			flush_dcache_page(from->bv_page);
> > > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > > +				flush_dcache_page(from->bv_page);
> > > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > > >   			memcpy(vto, vfrom, to->bv_len);
> > > 
> > > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > > maintenance routines"), which added a page_mapping() call to arm64's
> > > arch/arm64/mm/flush.c:flush_dcache_page().
> > > 
> > > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > > 
> > > The unusual thing about all of this is that the payload for some disk
> > > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > > but we've done this for ages and should continue to support it.
> > > 
> > > 
> > > Now, the page from kmalloc() cannot be in highmem, so why did the
> > > bounce code decide to bounce it?
> > > 
> > > __blk_queue_bounce() does
> > > 
> > > 		/*
> > > 		 * is destination page below bounce pfn?
> > > 		 */
> > > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > > 			continue;
> > > 
> > > and `force' comes from must_snapshot_stable_pages().  But
> > > must_snapshot_stable_pages() must have returned false, because if it
> > > had returned true then it would have been must_snapshot_stable_pages()
> > > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > > 
> > > So my tentative diagosis is that arm64 is fishy.  A page which was
> > > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > > investigation to work out if this is what is happening?  Find out why
> > > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > > bounced?
> > 
> > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> > kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> > 
> > > This is all terribly fragile :( afaict if someone sets
> > > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > > page_mapping() call.  (Darrick, this means you ;))
> > 
> > Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> > We can keep walking the bio segments to find a non-slab page that can tell us
> > MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> > 
> > How does something like this look?  (+ the patch above)
>   Umm, this won't quite work. We can have a bio which has just PageSlab
> page attached and so you won't be able to get to the superblock. Heh, isn't
> the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
> do direct IO, these pages come directly from userspace and hell knows where
> they come from. Definitely their page_mapping() doesn't give us anything
> useful... Sorry for not realizing this earlier when reviewing the patch.
> 
> ... remembering why we need to get to sb and why ext3 needs this ... So
> maybe a better solution would be to have a bio flag meaning that pages need
> bouncing? And we would set it from filesystems that need it - in case of
> ext3 only writeback of data from kjournald actually needs to bounce the
> pages. Thoughts?

What about dirty pages that don't result in journal transactions?  I think
ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
__block_write_full_page, which in turn calls submit_bh().

--D
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-13 19:44         ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-13 19:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 13, 2013 at 09:50:21AM +0100, Jan Kara wrote:
> On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> > On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > > 
> > > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > > 
> > > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > > 
> > > > ...
> > > >
> > > > --- a/mm/bounce.c
> > > > +++ b/mm/bounce.c
> > > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > > *q, struct bio **bio_orig,
> > > >   		if (rw == WRITE) {
> > > >   			char *vto, *vfrom;
> > > >   -			flush_dcache_page(from->bv_page);
> > > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > > +				flush_dcache_page(from->bv_page);
> > > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > > >   			memcpy(vto, vfrom, to->bv_len);
> > > 
> > > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > > maintenance routines"), which added a page_mapping() call to arm64's
> > > arch/arm64/mm/flush.c:flush_dcache_page().
> > > 
> > > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > > 
> > > The unusual thing about all of this is that the payload for some disk
> > > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > > but we've done this for ages and should continue to support it.
> > > 
> > > 
> > > Now, the page from kmalloc() cannot be in highmem, so why did the
> > > bounce code decide to bounce it?
> > > 
> > > __blk_queue_bounce() does
> > > 
> > > 		/*
> > > 		 * is destination page below bounce pfn?
> > > 		 */
> > > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > > 			continue;
> > > 
> > > and `force' comes from must_snapshot_stable_pages().  But
> > > must_snapshot_stable_pages() must have returned false, because if it
> > > had returned true then it would have been must_snapshot_stable_pages()
> > > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > > 
> > > So my tentative diagosis is that arm64 is fishy.  A page which was
> > > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > > investigation to work out if this is what is happening?  Find out why
> > > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > > bounced?
> > 
> > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> > kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> > 
> > > This is all terribly fragile :( afaict if someone sets
> > > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > > page_mapping() call.  (Darrick, this means you ;))
> > 
> > Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> > We can keep walking the bio segments to find a non-slab page that can tell us
> > MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> > 
> > How does something like this look?  (+ the patch above)
>   Umm, this won't quite work. We can have a bio which has just PageSlab
> page attached and so you won't be able to get to the superblock. Heh, isn't
> the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
> do direct IO, these pages come directly from userspace and hell knows where
> they come from. Definitely their page_mapping() doesn't give us anything
> useful... Sorry for not realizing this earlier when reviewing the patch.
> 
> ... remembering why we need to get to sb and why ext3 needs this ... So
> maybe a better solution would be to have a bio flag meaning that pages need
> bouncing? And we would set it from filesystems that need it - in case of
> ext3 only writeback of data from kjournald actually needs to bounce the
> pages. Thoughts?

What about dirty pages that don't result in journal transactions?  I think
ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
__block_write_full_page, which in turn calls submit_bh().

--D
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
  2013-03-13 19:44         ` Darrick J. Wong
  (?)
@ 2013-03-13 21:02           ` Jan Kara
  -1 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-03-13 21:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, Andrew Morton, Shuge, linux-kernel, linux-mm,
	linux-ext4, Kevin, Theodore Ts'o, Jens Axboe,
	Catalin Marinas, Will Deacon, linux-arm-kernel

On Wed 13-03-13 12:44:29, Darrick J. Wong wrote:
> On Wed, Mar 13, 2013 at 09:50:21AM +0100, Jan Kara wrote:
> > On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> > > On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > > > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > > > 
> > > > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > > > 
> > > > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > > > 
> > > > > ...
> > > > >
> > > > > --- a/mm/bounce.c
> > > > > +++ b/mm/bounce.c
> > > > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > > > *q, struct bio **bio_orig,
> > > > >   		if (rw == WRITE) {
> > > > >   			char *vto, *vfrom;
> > > > >   -			flush_dcache_page(from->bv_page);
> > > > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > > > +				flush_dcache_page(from->bv_page);
> > > > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > > > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > > > >   			memcpy(vto, vfrom, to->bv_len);
> > > > 
> > > > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > > > maintenance routines"), which added a page_mapping() call to arm64's
> > > > arch/arm64/mm/flush.c:flush_dcache_page().
> > > > 
> > > > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > > > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > > > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > > > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > > > 
> > > > The unusual thing about all of this is that the payload for some disk
> > > > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > > > but we've done this for ages and should continue to support it.
> > > > 
> > > > 
> > > > Now, the page from kmalloc() cannot be in highmem, so why did the
> > > > bounce code decide to bounce it?
> > > > 
> > > > __blk_queue_bounce() does
> > > > 
> > > > 		/*
> > > > 		 * is destination page below bounce pfn?
> > > > 		 */
> > > > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > > > 			continue;
> > > > 
> > > > and `force' comes from must_snapshot_stable_pages().  But
> > > > must_snapshot_stable_pages() must have returned false, because if it
> > > > had returned true then it would have been must_snapshot_stable_pages()
> > > > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > > > 
> > > > So my tentative diagosis is that arm64 is fishy.  A page which was
> > > > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > > > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > > > investigation to work out if this is what is happening?  Find out why
> > > > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > > > bounced?
> > > 
> > > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> > > kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> > > 
> > > > This is all terribly fragile :( afaict if someone sets
> > > > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > > > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > > > page_mapping() call.  (Darrick, this means you ;))
> > > 
> > > Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> > > We can keep walking the bio segments to find a non-slab page that can tell us
> > > MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> > > 
> > > How does something like this look?  (+ the patch above)
> >   Umm, this won't quite work. We can have a bio which has just PageSlab
> > page attached and so you won't be able to get to the superblock. Heh, isn't
> > the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
> > do direct IO, these pages come directly from userspace and hell knows where
> > they come from. Definitely their page_mapping() doesn't give us anything
> > useful... Sorry for not realizing this earlier when reviewing the patch.
> > 
> > ... remembering why we need to get to sb and why ext3 needs this ... So
> > maybe a better solution would be to have a bio flag meaning that pages need
> > bouncing? And we would set it from filesystems that need it - in case of
> > ext3 only writeback of data from kjournald actually needs to bounce the
> > pages. Thoughts?
> 
> What about dirty pages that don't result in journal transactions?  I think
> ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> __block_write_full_page, which in turn calls submit_bh().
  So here we have two options:
Either we let ext3 wait the same way as other filesystems when stable pages
are required. Then only data IO from kjournald needs to be bounced (all
other IO is properly protected by PageWriteback bit).

Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
needs bouncing, and set the bio flag in __block_write_full_page() and
kjournald based on the sb flag.

I think the first option is slightly better but I don't feel strongly
about that.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-13 21:02           ` Jan Kara
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-03-13 21:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, Andrew Morton, Shuge, linux-kernel, linux-mm,
	linux-ext4, Kevin, Theodore Ts'o, Jens Axboe,
	Catalin Marinas, Will Deacon, linux-arm-kernel

On Wed 13-03-13 12:44:29, Darrick J. Wong wrote:
> On Wed, Mar 13, 2013 at 09:50:21AM +0100, Jan Kara wrote:
> > On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> > > On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > > > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > > > 
> > > > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > > > 
> > > > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > > > 
> > > > > ...
> > > > >
> > > > > --- a/mm/bounce.c
> > > > > +++ b/mm/bounce.c
> > > > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > > > *q, struct bio **bio_orig,
> > > > >   		if (rw == WRITE) {
> > > > >   			char *vto, *vfrom;
> > > > >   -			flush_dcache_page(from->bv_page);
> > > > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > > > +				flush_dcache_page(from->bv_page);
> > > > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > > > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > > > >   			memcpy(vto, vfrom, to->bv_len);
> > > > 
> > > > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > > > maintenance routines"), which added a page_mapping() call to arm64's
> > > > arch/arm64/mm/flush.c:flush_dcache_page().
> > > > 
> > > > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > > > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > > > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > > > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > > > 
> > > > The unusual thing about all of this is that the payload for some disk
> > > > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > > > but we've done this for ages and should continue to support it.
> > > > 
> > > > 
> > > > Now, the page from kmalloc() cannot be in highmem, so why did the
> > > > bounce code decide to bounce it?
> > > > 
> > > > __blk_queue_bounce() does
> > > > 
> > > > 		/*
> > > > 		 * is destination page below bounce pfn?
> > > > 		 */
> > > > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > > > 			continue;
> > > > 
> > > > and `force' comes from must_snapshot_stable_pages().  But
> > > > must_snapshot_stable_pages() must have returned false, because if it
> > > > had returned true then it would have been must_snapshot_stable_pages()
> > > > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > > > 
> > > > So my tentative diagosis is that arm64 is fishy.  A page which was
> > > > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > > > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > > > investigation to work out if this is what is happening?  Find out why
> > > > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > > > bounced?
> > > 
> > > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> > > kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> > > 
> > > > This is all terribly fragile :( afaict if someone sets
> > > > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > > > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > > > page_mapping() call.  (Darrick, this means you ;))
> > > 
> > > Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> > > We can keep walking the bio segments to find a non-slab page that can tell us
> > > MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> > > 
> > > How does something like this look?  (+ the patch above)
> >   Umm, this won't quite work. We can have a bio which has just PageSlab
> > page attached and so you won't be able to get to the superblock. Heh, isn't
> > the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
> > do direct IO, these pages come directly from userspace and hell knows where
> > they come from. Definitely their page_mapping() doesn't give us anything
> > useful... Sorry for not realizing this earlier when reviewing the patch.
> > 
> > ... remembering why we need to get to sb and why ext3 needs this ... So
> > maybe a better solution would be to have a bio flag meaning that pages need
> > bouncing? And we would set it from filesystems that need it - in case of
> > ext3 only writeback of data from kjournald actually needs to bounce the
> > pages. Thoughts?
> 
> What about dirty pages that don't result in journal transactions?  I think
> ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> __block_write_full_page, which in turn calls submit_bh().
  So here we have two options:
Either we let ext3 wait the same way as other filesystems when stable pages
are required. Then only data IO from kjournald needs to be bounced (all
other IO is properly protected by PageWriteback bit).

Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
needs bouncing, and set the bio flag in __block_write_full_page() and
kjournald based on the sb flag.

I think the first option is slightly better but I don't feel strongly
about that.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-13 21:02           ` Jan Kara
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-03-13 21:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed 13-03-13 12:44:29, Darrick J. Wong wrote:
> On Wed, Mar 13, 2013 at 09:50:21AM +0100, Jan Kara wrote:
> > On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> > > On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > > > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > > > 
> > > > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > > > 
> > > > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > > > 
> > > > > ...
> > > > >
> > > > > --- a/mm/bounce.c
> > > > > +++ b/mm/bounce.c
> > > > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > > > *q, struct bio **bio_orig,
> > > > >   		if (rw == WRITE) {
> > > > >   			char *vto, *vfrom;
> > > > >   -			flush_dcache_page(from->bv_page);
> > > > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > > > +				flush_dcache_page(from->bv_page);
> > > > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > > > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > > > >   			memcpy(vto, vfrom, to->bv_len);
> > > > 
> > > > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > > > maintenance routines"), which added a page_mapping() call to arm64's
> > > > arch/arm64/mm/flush.c:flush_dcache_page().
> > > > 
> > > > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > > > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > > > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > > > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > > > 
> > > > The unusual thing about all of this is that the payload for some disk
> > > > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > > > but we've done this for ages and should continue to support it.
> > > > 
> > > > 
> > > > Now, the page from kmalloc() cannot be in highmem, so why did the
> > > > bounce code decide to bounce it?
> > > > 
> > > > __blk_queue_bounce() does
> > > > 
> > > > 		/*
> > > > 		 * is destination page below bounce pfn?
> > > > 		 */
> > > > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > > > 			continue;
> > > > 
> > > > and `force' comes from must_snapshot_stable_pages().  But
> > > > must_snapshot_stable_pages() must have returned false, because if it
> > > > had returned true then it would have been must_snapshot_stable_pages()
> > > > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > > > 
> > > > So my tentative diagosis is that arm64 is fishy.  A page which was
> > > > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > > > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > > > investigation to work out if this is what is happening?  Find out why
> > > > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > > > bounced?
> > > 
> > > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> > > kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> > > 
> > > > This is all terribly fragile :( afaict if someone sets
> > > > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > > > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > > > page_mapping() call.  (Darrick, this means you ;))
> > > 
> > > Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> > > We can keep walking the bio segments to find a non-slab page that can tell us
> > > MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> > > 
> > > How does something like this look?  (+ the patch above)
> >   Umm, this won't quite work. We can have a bio which has just PageSlab
> > page attached and so you won't be able to get to the superblock. Heh, isn't
> > the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
> > do direct IO, these pages come directly from userspace and hell knows where
> > they come from. Definitely their page_mapping() doesn't give us anything
> > useful... Sorry for not realizing this earlier when reviewing the patch.
> > 
> > ... remembering why we need to get to sb and why ext3 needs this ... So
> > maybe a better solution would be to have a bio flag meaning that pages need
> > bouncing? And we would set it from filesystems that need it - in case of
> > ext3 only writeback of data from kjournald actually needs to bounce the
> > pages. Thoughts?
> 
> What about dirty pages that don't result in journal transactions?  I think
> ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> __block_write_full_page, which in turn calls submit_bh().
  So here we have two options:
Either we let ext3 wait the same way as other filesystems when stable pages
are required. Then only data IO from kjournald needs to be bounced (all
other IO is properly protected by PageWriteback bit).

Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
needs bouncing, and set the bio flag in __block_write_full_page() and
kjournald based on the sb flag.

I think the first option is slightly better but I don't feel strongly
about that.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
  2013-03-13 21:02           ` Jan Kara
  (?)
@ 2013-03-14 22:42             ` Darrick J. Wong
  -1 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-14 22:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel

On Wed, Mar 13, 2013 at 10:02:16PM +0100, Jan Kara wrote:
> On Wed 13-03-13 12:44:29, Darrick J. Wong wrote:
> > On Wed, Mar 13, 2013 at 09:50:21AM +0100, Jan Kara wrote:
> > > On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> > > > On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > > > > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > > > > 
> > > > > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > > > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > > > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > > > > 
> > > > > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > > > > 
> > > > > > ...
> > > > > >
> > > > > > --- a/mm/bounce.c
> > > > > > +++ b/mm/bounce.c
> > > > > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > > > > *q, struct bio **bio_orig,
> > > > > >   		if (rw == WRITE) {
> > > > > >   			char *vto, *vfrom;
> > > > > >   -			flush_dcache_page(from->bv_page);
> > > > > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > > > > +				flush_dcache_page(from->bv_page);
> > > > > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > > > > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > > > > >   			memcpy(vto, vfrom, to->bv_len);
> > > > > 
> > > > > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > > > > maintenance routines"), which added a page_mapping() call to arm64's
> > > > > arch/arm64/mm/flush.c:flush_dcache_page().
> > > > > 
> > > > > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > > > > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > > > > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > > > > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > > > > 
> > > > > The unusual thing about all of this is that the payload for some disk
> > > > > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > > > > but we've done this for ages and should continue to support it.
> > > > > 
> > > > > 
> > > > > Now, the page from kmalloc() cannot be in highmem, so why did the
> > > > > bounce code decide to bounce it?
> > > > > 
> > > > > __blk_queue_bounce() does
> > > > > 
> > > > > 		/*
> > > > > 		 * is destination page below bounce pfn?
> > > > > 		 */
> > > > > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > > > > 			continue;
> > > > > 
> > > > > and `force' comes from must_snapshot_stable_pages().  But
> > > > > must_snapshot_stable_pages() must have returned false, because if it
> > > > > had returned true then it would have been must_snapshot_stable_pages()
> > > > > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > > > > 
> > > > > So my tentative diagosis is that arm64 is fishy.  A page which was
> > > > > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > > > > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > > > > investigation to work out if this is what is happening?  Find out why
> > > > > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > > > > bounced?
> > > > 
> > > > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> > > > kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> > > > 
> > > > > This is all terribly fragile :( afaict if someone sets
> > > > > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > > > > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > > > > page_mapping() call.  (Darrick, this means you ;))
> > > > 
> > > > Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> > > > We can keep walking the bio segments to find a non-slab page that can tell us
> > > > MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> > > > 
> > > > How does something like this look?  (+ the patch above)
> > >   Umm, this won't quite work. We can have a bio which has just PageSlab
> > > page attached and so you won't be able to get to the superblock. Heh, isn't
> > > the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
> > > do direct IO, these pages come directly from userspace and hell knows where
> > > they come from. Definitely their page_mapping() doesn't give us anything
> > > useful... Sorry for not realizing this earlier when reviewing the patch.
> > > 
> > > ... remembering why we need to get to sb and why ext3 needs this ... So
> > > maybe a better solution would be to have a bio flag meaning that pages need
> > > bouncing? And we would set it from filesystems that need it - in case of
> > > ext3 only writeback of data from kjournald actually needs to bounce the
> > > pages. Thoughts?
> > 
> > What about dirty pages that don't result in journal transactions?  I think
> > ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> > __block_write_full_page, which in turn calls submit_bh().
>   So here we have two options:
> Either we let ext3 wait the same way as other filesystems when stable pages
> are required. Then only data IO from kjournald needs to be bounced (all
> other IO is properly protected by PageWriteback bit).
> 
> Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
> needs bouncing, and set the bio flag in __block_write_full_page() and
> kjournald based on the sb flag.
> 
> I think the first option is slightly better but I don't feel strongly
> about that.

I like that first option -- it contains the kludgery to jbd instead of
spreading it around.  Here's a patch that passes a quick smoke test on ext[34],
xfs, and vfat.  What do you think of this one?  Should I create a
submit_snapshot_bh() instead of letting callers stuff in arbitrary dangerous
BH_ flags?

--D
---
From: Darrick J. Wong <darrick.wong@oracle.com>
Subject: [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation

Walking a bio's page mappings has proved problematic, so create a new bio flag
to indicate that a bio's data needs to be snapshotted in order to guarantee
stable pages during writeback.  Next, for the one user (ext3/jbd) of
snapshotting, hook all the places where writes can be initiated without
PG_writeback set, and set BIO_SNAP_STABLE there.  Finally, the MS_SNAP_STABLE
mount flag (only used by ext3) is now superfluous, so get rid of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/buffer.c                 |    9 ++++++++-
 fs/ext3/super.c             |    1 -
 fs/jbd/commit.c             |    4 ++--
 include/linux/blk_types.h   |    3 ++-
 include/linux/buffer_head.h |    1 +
 include/uapi/linux/fs.h     |    1 -
 mm/bounce.c                 |   21 +--------------------
 mm/page-writeback.c         |    4 ----
 8 files changed, 14 insertions(+), 30 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b4dcb34..8c1c21a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
 	}
 }
 
-int submit_bh(int rw, struct buffer_head * bh)
+int _submit_bh(int rw, struct buffer_head * bh, unsigned long flags)
 {
 	struct bio *bio;
 	int ret = 0;
@@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
 
 	bio->bi_end_io = end_bio_bh_io_sync;
 	bio->bi_private = bh;
+	bio->bi_flags |= flags;
 
 	/* Take care of bh's that straddle the end of the device */
 	guard_bh_eod(rw, bio, bh);
@@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
 	bio_put(bio);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(_submit_bh);
+
+int submit_bh(int rw, struct buffer_head * bh)
+{
+	return _submit_bh(rw, bh, 0);
+}
 EXPORT_SYMBOL(submit_bh);
 
 /**
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 1d6e2ed..4fff1b7 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2067,7 +2067,6 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
 		"writeback");
-	sb->s_flags |= MS_SNAP_STABLE;
 
 	return 0;
 
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 86b39b1..b91b688 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -163,7 +163,7 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
 	for (i = 0; i < bufs; i++) {
 		wbuf[i]->b_end_io = end_buffer_write_sync;
 		/* We use-up our safety reference in submit_bh() */
-		submit_bh(write_op, wbuf[i]);
+		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
 	}
 }
 
@@ -667,7 +667,7 @@ start_journal_io:
 				clear_buffer_dirty(bh);
 				set_buffer_uptodate(bh);
 				bh->b_end_io = journal_end_buffer_io_sync;
-				submit_bh(write_op, bh);
+				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
 			}
 			cond_resched();
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cdf1119..22990cf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -111,12 +111,13 @@ struct bio {
 #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
 #define BIO_QUIET	10	/* Make BIO Quiet */
 #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
+#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
 
 /*
  * Flags starting here get preserved by bio_reset() - this includes
  * BIO_POOL_IDX()
  */
-#define BIO_RESET_BITS	12
+#define BIO_RESET_BITS	13
 
 #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
 
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 5afc4f9..714d5d9 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
 int sync_dirty_buffer(struct buffer_head *bh);
 int __sync_dirty_buffer(struct buffer_head *bh, int rw);
 void write_dirty_buffer(struct buffer_head *bh, int rw);
+int _submit_bh(int, struct buffer_head *, unsigned long);
 int submit_bh(int, struct buffer_head *);
 void write_boundary_block(struct block_device *bdev,
 			sector_t bblock, unsigned blocksize);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index c7fc1e6..a4ed56c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -88,7 +88,6 @@ struct inodes_stat_t {
 #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
 
 /* These sb flags are internal to the kernel */
-#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
 #define MS_NOSEC	(1<<28)
 #define MS_BORN		(1<<29)
 #define MS_ACTIVE	(1<<30)
diff --git a/mm/bounce.c b/mm/bounce.c
index 5f89017..a5c2ec3 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
 #ifdef CONFIG_NEED_BOUNCE_POOL
 static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
 {
-	struct page *page;
-	struct backing_dev_info *bdi;
-	struct address_space *mapping;
-	struct bio_vec *from;
-	int i;
-
 	if (bio_data_dir(bio) != WRITE)
 		return 0;
 
 	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
 		return 0;
 
-	/*
-	 * Based on the first page that has a valid mapping, decide whether or
-	 * not we have to employ bounce buffering to guarantee stable pages.
-	 */
-	bio_for_each_segment(from, bio, i) {
-		page = from->bv_page;
-		mapping = page_mapping(page);
-		if (!mapping)
-			continue;
-		bdi = mapping->backing_dev_info;
-		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
-	}
-
-	return 0;
+	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
 }
 #else
 static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index efe6814..4514ad7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
 
 	if (!bdi_cap_stable_pages_required(bdi))
 		return;
-#ifdef CONFIG_NEED_BOUNCE_POOL
-	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
-		return;
-#endif /* CONFIG_NEED_BOUNCE_POOL */
 
 	wait_on_page_writeback(page);
 }

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-14 22:42             ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-14 22:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel

On Wed, Mar 13, 2013 at 10:02:16PM +0100, Jan Kara wrote:
> On Wed 13-03-13 12:44:29, Darrick J. Wong wrote:
> > On Wed, Mar 13, 2013 at 09:50:21AM +0100, Jan Kara wrote:
> > > On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> > > > On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > > > > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > > > > 
> > > > > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > > > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > > > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > > > > 
> > > > > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > > > > 
> > > > > > ...
> > > > > >
> > > > > > --- a/mm/bounce.c
> > > > > > +++ b/mm/bounce.c
> > > > > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > > > > *q, struct bio **bio_orig,
> > > > > >   		if (rw == WRITE) {
> > > > > >   			char *vto, *vfrom;
> > > > > >   -			flush_dcache_page(from->bv_page);
> > > > > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > > > > +				flush_dcache_page(from->bv_page);
> > > > > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > > > > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > > > > >   			memcpy(vto, vfrom, to->bv_len);
> > > > > 
> > > > > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > > > > maintenance routines"), which added a page_mapping() call to arm64's
> > > > > arch/arm64/mm/flush.c:flush_dcache_page().
> > > > > 
> > > > > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > > > > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > > > > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > > > > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > > > > 
> > > > > The unusual thing about all of this is that the payload for some disk
> > > > > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > > > > but we've done this for ages and should continue to support it.
> > > > > 
> > > > > 
> > > > > Now, the page from kmalloc() cannot be in highmem, so why did the
> > > > > bounce code decide to bounce it?
> > > > > 
> > > > > __blk_queue_bounce() does
> > > > > 
> > > > > 		/*
> > > > > 		 * is destination page below bounce pfn?
> > > > > 		 */
> > > > > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > > > > 			continue;
> > > > > 
> > > > > and `force' comes from must_snapshot_stable_pages().  But
> > > > > must_snapshot_stable_pages() must have returned false, because if it
> > > > > had returned true then it would have been must_snapshot_stable_pages()
> > > > > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > > > > 
> > > > > So my tentative diagosis is that arm64 is fishy.  A page which was
> > > > > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > > > > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > > > > investigation to work out if this is what is happening?  Find out why
> > > > > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > > > > bounced?
> > > > 
> > > > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> > > > kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> > > > 
> > > > > This is all terribly fragile :( afaict if someone sets
> > > > > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > > > > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > > > > page_mapping() call.  (Darrick, this means you ;))
> > > > 
> > > > Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> > > > We can keep walking the bio segments to find a non-slab page that can tell us
> > > > MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> > > > 
> > > > How does something like this look?  (+ the patch above)
> > >   Umm, this won't quite work. We can have a bio which has just PageSlab
> > > page attached and so you won't be able to get to the superblock. Heh, isn't
> > > the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
> > > do direct IO, these pages come directly from userspace and hell knows where
> > > they come from. Definitely their page_mapping() doesn't give us anything
> > > useful... Sorry for not realizing this earlier when reviewing the patch.
> > > 
> > > ... remembering why we need to get to sb and why ext3 needs this ... So
> > > maybe a better solution would be to have a bio flag meaning that pages need
> > > bouncing? And we would set it from filesystems that need it - in case of
> > > ext3 only writeback of data from kjournald actually needs to bounce the
> > > pages. Thoughts?
> > 
> > What about dirty pages that don't result in journal transactions?  I think
> > ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> > __block_write_full_page, which in turn calls submit_bh().
>   So here we have two options:
> Either we let ext3 wait the same way as other filesystems when stable pages
> are required. Then only data IO from kjournald needs to be bounced (all
> other IO is properly protected by PageWriteback bit).
> 
> Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
> needs bouncing, and set the bio flag in __block_write_full_page() and
> kjournald based on the sb flag.
> 
> I think the first option is slightly better but I don't feel strongly
> about that.

I like that first option -- it contains the kludgery to jbd instead of
spreading it around.  Here's a patch that passes a quick smoke test on ext[34],
xfs, and vfat.  What do you think of this one?  Should I create a
submit_snapshot_bh() instead of letting callers stuff in arbitrary dangerous
BH_ flags?

--D
---
From: Darrick J. Wong <darrick.wong@oracle.com>
Subject: [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation

Walking a bio's page mappings has proved problematic, so create a new bio flag
to indicate that a bio's data needs to be snapshotted in order to guarantee
stable pages during writeback.  Next, for the one user (ext3/jbd) of
snapshotting, hook all the places where writes can be initiated without
PG_writeback set, and set BIO_SNAP_STABLE there.  Finally, the MS_SNAP_STABLE
mount flag (only used by ext3) is now superfluous, so get rid of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/buffer.c                 |    9 ++++++++-
 fs/ext3/super.c             |    1 -
 fs/jbd/commit.c             |    4 ++--
 include/linux/blk_types.h   |    3 ++-
 include/linux/buffer_head.h |    1 +
 include/uapi/linux/fs.h     |    1 -
 mm/bounce.c                 |   21 +--------------------
 mm/page-writeback.c         |    4 ----
 8 files changed, 14 insertions(+), 30 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b4dcb34..8c1c21a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
 	}
 }
 
-int submit_bh(int rw, struct buffer_head * bh)
+int _submit_bh(int rw, struct buffer_head * bh, unsigned long flags)
 {
 	struct bio *bio;
 	int ret = 0;
@@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
 
 	bio->bi_end_io = end_bio_bh_io_sync;
 	bio->bi_private = bh;
+	bio->bi_flags |= flags;
 
 	/* Take care of bh's that straddle the end of the device */
 	guard_bh_eod(rw, bio, bh);
@@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
 	bio_put(bio);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(_submit_bh);
+
+int submit_bh(int rw, struct buffer_head * bh)
+{
+	return _submit_bh(rw, bh, 0);
+}
 EXPORT_SYMBOL(submit_bh);
 
 /**
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 1d6e2ed..4fff1b7 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2067,7 +2067,6 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
 		"writeback");
-	sb->s_flags |= MS_SNAP_STABLE;
 
 	return 0;
 
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 86b39b1..b91b688 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -163,7 +163,7 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
 	for (i = 0; i < bufs; i++) {
 		wbuf[i]->b_end_io = end_buffer_write_sync;
 		/* We use-up our safety reference in submit_bh() */
-		submit_bh(write_op, wbuf[i]);
+		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
 	}
 }
 
@@ -667,7 +667,7 @@ start_journal_io:
 				clear_buffer_dirty(bh);
 				set_buffer_uptodate(bh);
 				bh->b_end_io = journal_end_buffer_io_sync;
-				submit_bh(write_op, bh);
+				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
 			}
 			cond_resched();
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cdf1119..22990cf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -111,12 +111,13 @@ struct bio {
 #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
 #define BIO_QUIET	10	/* Make BIO Quiet */
 #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
+#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
 
 /*
  * Flags starting here get preserved by bio_reset() - this includes
  * BIO_POOL_IDX()
  */
-#define BIO_RESET_BITS	12
+#define BIO_RESET_BITS	13
 
 #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
 
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 5afc4f9..714d5d9 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
 int sync_dirty_buffer(struct buffer_head *bh);
 int __sync_dirty_buffer(struct buffer_head *bh, int rw);
 void write_dirty_buffer(struct buffer_head *bh, int rw);
+int _submit_bh(int, struct buffer_head *, unsigned long);
 int submit_bh(int, struct buffer_head *);
 void write_boundary_block(struct block_device *bdev,
 			sector_t bblock, unsigned blocksize);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index c7fc1e6..a4ed56c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -88,7 +88,6 @@ struct inodes_stat_t {
 #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
 
 /* These sb flags are internal to the kernel */
-#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
 #define MS_NOSEC	(1<<28)
 #define MS_BORN		(1<<29)
 #define MS_ACTIVE	(1<<30)
diff --git a/mm/bounce.c b/mm/bounce.c
index 5f89017..a5c2ec3 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
 #ifdef CONFIG_NEED_BOUNCE_POOL
 static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
 {
-	struct page *page;
-	struct backing_dev_info *bdi;
-	struct address_space *mapping;
-	struct bio_vec *from;
-	int i;
-
 	if (bio_data_dir(bio) != WRITE)
 		return 0;
 
 	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
 		return 0;
 
-	/*
-	 * Based on the first page that has a valid mapping, decide whether or
-	 * not we have to employ bounce buffering to guarantee stable pages.
-	 */
-	bio_for_each_segment(from, bio, i) {
-		page = from->bv_page;
-		mapping = page_mapping(page);
-		if (!mapping)
-			continue;
-		bdi = mapping->backing_dev_info;
-		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
-	}
-
-	return 0;
+	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
 }
 #else
 static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index efe6814..4514ad7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
 
 	if (!bdi_cap_stable_pages_required(bdi))
 		return;
-#ifdef CONFIG_NEED_BOUNCE_POOL
-	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
-		return;
-#endif /* CONFIG_NEED_BOUNCE_POOL */
 
 	wait_on_page_writeback(page);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-14 22:42             ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-14 22:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 13, 2013 at 10:02:16PM +0100, Jan Kara wrote:
> On Wed 13-03-13 12:44:29, Darrick J. Wong wrote:
> > On Wed, Mar 13, 2013 at 09:50:21AM +0100, Jan Kara wrote:
> > > On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> > > > On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > > > > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > > > > 
> > > > > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > > > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > > > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > > > > 
> > > > > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > > > > 
> > > > > > ...
> > > > > >
> > > > > > --- a/mm/bounce.c
> > > > > > +++ b/mm/bounce.c
> > > > > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > > > > *q, struct bio **bio_orig,
> > > > > >   		if (rw == WRITE) {
> > > > > >   			char *vto, *vfrom;
> > > > > >   -			flush_dcache_page(from->bv_page);
> > > > > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > > > > +				flush_dcache_page(from->bv_page);
> > > > > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > > > > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > > > > >   			memcpy(vto, vfrom, to->bv_len);
> > > > > 
> > > > > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > > > > maintenance routines"), which added a page_mapping() call to arm64's
> > > > > arch/arm64/mm/flush.c:flush_dcache_page().
> > > > > 
> > > > > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > > > > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > > > > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > > > > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > > > > 
> > > > > The unusual thing about all of this is that the payload for some disk
> > > > > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > > > > but we've done this for ages and should continue to support it.
> > > > > 
> > > > > 
> > > > > Now, the page from kmalloc() cannot be in highmem, so why did the
> > > > > bounce code decide to bounce it?
> > > > > 
> > > > > __blk_queue_bounce() does
> > > > > 
> > > > > 		/*
> > > > > 		 * is destination page below bounce pfn?
> > > > > 		 */
> > > > > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > > > > 			continue;
> > > > > 
> > > > > and `force' comes from must_snapshot_stable_pages().  But
> > > > > must_snapshot_stable_pages() must have returned false, because if it
> > > > > had returned true then it would have been must_snapshot_stable_pages()
> > > > > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > > > > 
> > > > > So my tentative diagosis is that arm64 is fishy.  A page which was
> > > > > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > > > > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > > > > investigation to work out if this is what is happening?  Find out why
> > > > > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > > > > bounced?
> > > > 
> > > > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> > > > kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> > > > 
> > > > > This is all terribly fragile :( afaict if someone sets
> > > > > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > > > > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > > > > page_mapping() call.  (Darrick, this means you ;))
> > > > 
> > > > Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> > > > We can keep walking the bio segments to find a non-slab page that can tell us
> > > > MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> > > > 
> > > > How does something like this look?  (+ the patch above)
> > >   Umm, this won't quite work. We can have a bio which has just PageSlab
> > > page attached and so you won't be able to get to the superblock. Heh, isn't
> > > the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
> > > do direct IO, these pages come directly from userspace and hell knows where
> > > they come from. Definitely their page_mapping() doesn't give us anything
> > > useful... Sorry for not realizing this earlier when reviewing the patch.
> > > 
> > > ... remembering why we need to get to sb and why ext3 needs this ... So
> > > maybe a better solution would be to have a bio flag meaning that pages need
> > > bouncing? And we would set it from filesystems that need it - in case of
> > > ext3 only writeback of data from kjournald actually needs to bounce the
> > > pages. Thoughts?
> > 
> > What about dirty pages that don't result in journal transactions?  I think
> > ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> > __block_write_full_page, which in turn calls submit_bh().
>   So here we have two options:
> Either we let ext3 wait the same way as other filesystems when stable pages
> are required. Then only data IO from kjournald needs to be bounced (all
> other IO is properly protected by PageWriteback bit).
> 
> Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
> needs bouncing, and set the bio flag in __block_write_full_page() and
> kjournald based on the sb flag.
> 
> I think the first option is slightly better but I don't feel strongly
> about that.

I like that first option -- it contains the kludgery to jbd instead of
spreading it around.  Here's a patch that passes a quick smoke test on ext[34],
xfs, and vfat.  What do you think of this one?  Should I create a
submit_snapshot_bh() instead of letting callers stuff in arbitrary dangerous
BH_ flags?

--D
---
From: Darrick J. Wong <darrick.wong@oracle.com>
Subject: [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation

Walking a bio's page mappings has proved problematic, so create a new bio flag
to indicate that a bio's data needs to be snapshotted in order to guarantee
stable pages during writeback.  Next, for the one user (ext3/jbd) of
snapshotting, hook all the places where writes can be initiated without
PG_writeback set, and set BIO_SNAP_STABLE there.  Finally, the MS_SNAP_STABLE
mount flag (only used by ext3) is now superfluous, so get rid of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/buffer.c                 |    9 ++++++++-
 fs/ext3/super.c             |    1 -
 fs/jbd/commit.c             |    4 ++--
 include/linux/blk_types.h   |    3 ++-
 include/linux/buffer_head.h |    1 +
 include/uapi/linux/fs.h     |    1 -
 mm/bounce.c                 |   21 +--------------------
 mm/page-writeback.c         |    4 ----
 8 files changed, 14 insertions(+), 30 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b4dcb34..8c1c21a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
 	}
 }
 
-int submit_bh(int rw, struct buffer_head * bh)
+int _submit_bh(int rw, struct buffer_head * bh, unsigned long flags)
 {
 	struct bio *bio;
 	int ret = 0;
@@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
 
 	bio->bi_end_io = end_bio_bh_io_sync;
 	bio->bi_private = bh;
+	bio->bi_flags |= flags;
 
 	/* Take care of bh's that straddle the end of the device */
 	guard_bh_eod(rw, bio, bh);
@@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
 	bio_put(bio);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(_submit_bh);
+
+int submit_bh(int rw, struct buffer_head * bh)
+{
+	return _submit_bh(rw, bh, 0);
+}
 EXPORT_SYMBOL(submit_bh);
 
 /**
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 1d6e2ed..4fff1b7 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2067,7 +2067,6 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
 		"writeback");
-	sb->s_flags |= MS_SNAP_STABLE;
 
 	return 0;
 
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 86b39b1..b91b688 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -163,7 +163,7 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
 	for (i = 0; i < bufs; i++) {
 		wbuf[i]->b_end_io = end_buffer_write_sync;
 		/* We use-up our safety reference in submit_bh() */
-		submit_bh(write_op, wbuf[i]);
+		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
 	}
 }
 
@@ -667,7 +667,7 @@ start_journal_io:
 				clear_buffer_dirty(bh);
 				set_buffer_uptodate(bh);
 				bh->b_end_io = journal_end_buffer_io_sync;
-				submit_bh(write_op, bh);
+				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
 			}
 			cond_resched();
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cdf1119..22990cf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -111,12 +111,13 @@ struct bio {
 #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
 #define BIO_QUIET	10	/* Make BIO Quiet */
 #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
+#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
 
 /*
  * Flags starting here get preserved by bio_reset() - this includes
  * BIO_POOL_IDX()
  */
-#define BIO_RESET_BITS	12
+#define BIO_RESET_BITS	13
 
 #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
 
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 5afc4f9..714d5d9 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
 int sync_dirty_buffer(struct buffer_head *bh);
 int __sync_dirty_buffer(struct buffer_head *bh, int rw);
 void write_dirty_buffer(struct buffer_head *bh, int rw);
+int _submit_bh(int, struct buffer_head *, unsigned long);
 int submit_bh(int, struct buffer_head *);
 void write_boundary_block(struct block_device *bdev,
 			sector_t bblock, unsigned blocksize);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index c7fc1e6..a4ed56c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -88,7 +88,6 @@ struct inodes_stat_t {
 #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
 
 /* These sb flags are internal to the kernel */
-#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
 #define MS_NOSEC	(1<<28)
 #define MS_BORN		(1<<29)
 #define MS_ACTIVE	(1<<30)
diff --git a/mm/bounce.c b/mm/bounce.c
index 5f89017..a5c2ec3 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
 #ifdef CONFIG_NEED_BOUNCE_POOL
 static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
 {
-	struct page *page;
-	struct backing_dev_info *bdi;
-	struct address_space *mapping;
-	struct bio_vec *from;
-	int i;
-
 	if (bio_data_dir(bio) != WRITE)
 		return 0;
 
 	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
 		return 0;
 
-	/*
-	 * Based on the first page that has a valid mapping, decide whether or
-	 * not we have to employ bounce buffering to guarantee stable pages.
-	 */
-	bio_for_each_segment(from, bio, i) {
-		page = from->bv_page;
-		mapping = page_mapping(page);
-		if (!mapping)
-			continue;
-		bdi = mapping->backing_dev_info;
-		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
-	}
-
-	return 0;
+	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
 }
 #else
 static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index efe6814..4514ad7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
 
 	if (!bdi_cap_stable_pages_required(bdi))
 		return;
-#ifdef CONFIG_NEED_BOUNCE_POOL
-	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
-		return;
-#endif /* CONFIG_NEED_BOUNCE_POOL */
 
 	wait_on_page_writeback(page);
 }

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
  2013-03-13 21:02           ` Jan Kara
  (?)
@ 2013-03-14 22:46             ` Andrew Morton
  -1 siblings, 0 replies; 74+ messages in thread
From: Andrew Morton @ 2013-03-14 22:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Darrick J. Wong, Shuge, linux-kernel, linux-mm, linux-ext4,
	Kevin, Theodore Ts'o, Jens Axboe, Catalin Marinas,
	Will Deacon, linux-arm-kernel

On Wed, 13 Mar 2013 22:02:16 +0100 Jan Kara <jack@suse.cz> wrote:

> > > ... remembering why we need to get to sb and why ext3 needs this ... So
> > > maybe a better solution would be to have a bio flag meaning that pages need
> > > bouncing? And we would set it from filesystems that need it - in case of
> > > ext3 only writeback of data from kjournald actually needs to bounce the
> > > pages. Thoughts?
> > 
> > What about dirty pages that don't result in journal transactions?  I think
> > ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> > __block_write_full_page, which in turn calls submit_bh().
>   So here we have two options:
> Either we let ext3 wait the same way as other filesystems when stable pages
> are required. Then only data IO from kjournald needs to be bounced (all
> other IO is properly protected by PageWriteback bit).
> 
> Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
> needs bouncing, and set the bio flag in __block_write_full_page() and
> kjournald based on the sb flag.
> 
> I think the first option is slightly better but I don't feel strongly
> about that.

It seems Just Wrong that we're dicking around with filesystem
superblocks at this level.  It's the bounce code, for heavens sake!


What the heck's going on here and why wasn't I able to work that out
from reading the code :( The need to stabilise these pages is driven by
the characteristics of the underlying device and driver stack, isn't
it?  Things like checksumming?  What else drives this requirement? 
</rant>

Because I *think* it should be sufficient to maintain this boolean in
the backing_dev.  My *guess* is that this is all here because we want
to enable stable-snapshotting on a per-fs basis rather than on a
per-device basis?  If so, why?  If not, what?



btw, local variable `bdi' in must_snapshot_stable_pages() doesn't do
anything.


None of this will stop Shuge's kernel from going splat either.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-14 22:46             ` Andrew Morton
  0 siblings, 0 replies; 74+ messages in thread
From: Andrew Morton @ 2013-03-14 22:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Darrick J. Wong, Shuge, linux-kernel, linux-mm, linux-ext4,
	Kevin, Theodore Ts'o, Jens Axboe, Catalin Marinas,
	Will Deacon, linux-arm-kernel

On Wed, 13 Mar 2013 22:02:16 +0100 Jan Kara <jack@suse.cz> wrote:

> > > ... remembering why we need to get to sb and why ext3 needs this ... So
> > > maybe a better solution would be to have a bio flag meaning that pages need
> > > bouncing? And we would set it from filesystems that need it - in case of
> > > ext3 only writeback of data from kjournald actually needs to bounce the
> > > pages. Thoughts?
> > 
> > What about dirty pages that don't result in journal transactions?  I think
> > ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> > __block_write_full_page, which in turn calls submit_bh().
>   So here we have two options:
> Either we let ext3 wait the same way as other filesystems when stable pages
> are required. Then only data IO from kjournald needs to be bounced (all
> other IO is properly protected by PageWriteback bit).
> 
> Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
> needs bouncing, and set the bio flag in __block_write_full_page() and
> kjournald based on the sb flag.
> 
> I think the first option is slightly better but I don't feel strongly
> about that.

It seems Just Wrong that we're dicking around with filesystem
superblocks at this level.  It's the bounce code, for heavens sake!


What the heck's going on here and why wasn't I able to work that out
from reading the code :( The need to stabilise these pages is driven by
the characteristics of the underlying device and driver stack, isn't
it?  Things like checksumming?  What else drives this requirement? 
</rant>

Because I *think* it should be sufficient to maintain this boolean in
the backing_dev.  My *guess* is that this is all here because we want
to enable stable-snapshotting on a per-fs basis rather than on a
per-device basis?  If so, why?  If not, what?



btw, local variable `bdi' in must_snapshot_stable_pages() doesn't do
anything.


None of this will stop Shuge's kernel from going splat either.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-14 22:46             ` Andrew Morton
  0 siblings, 0 replies; 74+ messages in thread
From: Andrew Morton @ 2013-03-14 22:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 13 Mar 2013 22:02:16 +0100 Jan Kara <jack@suse.cz> wrote:

> > > ... remembering why we need to get to sb and why ext3 needs this ... So
> > > maybe a better solution would be to have a bio flag meaning that pages need
> > > bouncing? And we would set it from filesystems that need it - in case of
> > > ext3 only writeback of data from kjournald actually needs to bounce the
> > > pages. Thoughts?
> > 
> > What about dirty pages that don't result in journal transactions?  I think
> > ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> > __block_write_full_page, which in turn calls submit_bh().
>   So here we have two options:
> Either we let ext3 wait the same way as other filesystems when stable pages
> are required. Then only data IO from kjournald needs to be bounced (all
> other IO is properly protected by PageWriteback bit).
> 
> Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
> needs bouncing, and set the bio flag in __block_write_full_page() and
> kjournald based on the sb flag.
> 
> I think the first option is slightly better but I don't feel strongly
> about that.

It seems Just Wrong that we're dicking around with filesystem
superblocks at this level.  It's the bounce code, for heavens sake!


What the heck's going on here and why wasn't I able to work that out
from reading the code :( The need to stabilise these pages is driven by
the characteristics of the underlying device and driver stack, isn't
it?  Things like checksumming?  What else drives this requirement? 
</rant>

Because I *think* it should be sufficient to maintain this boolean in
the backing_dev.  My *guess* is that this is all here because we want
to enable stable-snapshotting on a per-fs basis rather than on a
per-device basis?  If so, why?  If not, what?



btw, local variable `bdi' in must_snapshot_stable_pages() doesn't do
anything.


None of this will stop Shuge's kernel from going splat either.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
  2013-03-14 22:42             ` Darrick J. Wong
  (?)
@ 2013-03-14 23:01               ` Andrew Morton
  -1 siblings, 0 replies; 74+ messages in thread
From: Andrew Morton @ 2013-03-14 23:01 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel

On Thu, 14 Mar 2013 15:42:43 -0700 "Darrick J. Wong" <darrick.wong@oracle.com> wrote:

> Subject: [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
> 
> Walking a bio's page mappings has proved problematic, so create a new bio flag
> to indicate that a bio's data needs to be snapshotted in order to guarantee
> stable pages during writeback.  Next, for the one user (ext3/jbd) of
> snapshotting, hook all the places where writes can be initiated without
> PG_writeback set, and set BIO_SNAP_STABLE there.  Finally, the MS_SNAP_STABLE
> mount flag (only used by ext3) is now superfluous, so get rid of it.

whoa, that looks way better.

Must do this though:

From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-make-snapshotting-pages-for-stable-writes-a-per-bio-operation-fix

rename _submit_bh()'s `flags' to `bio_flags', delobotomize the _submit_bh declaration

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/buffer.c                 |    4 ++--
 include/linux/buffer_head.h |    2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff -puN fs/buffer.c~mm-make-snapshotting-pages-for-stable-writes-a-per-bio-operation-fix fs/buffer.c
--- a/fs/buffer.c~mm-make-snapshotting-pages-for-stable-writes-a-per-bio-operation-fix
+++ a/fs/buffer.c
@@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct
 	}
 }
 
-int _submit_bh(int rw, struct buffer_head * bh, unsigned long flags)
+int _submit_bh(int rw, struct buffer_head * bh, unsigned long bio_flags)
 {
 	struct bio *bio;
 	int ret = 0;
@@ -2984,7 +2984,7 @@ int _submit_bh(int rw, struct buffer_hea
 
 	bio->bi_end_io = end_bio_bh_io_sync;
 	bio->bi_private = bh;
-	bio->bi_flags |= flags;
+	bio->bi_flags |= bio_flags;
 
 	/* Take care of bh's that straddle the end of the device */
 	guard_bh_eod(rw, bio, bh);
diff -puN include/linux/buffer_head.h~mm-make-snapshotting-pages-for-stable-writes-a-per-bio-operation-fix include/linux/buffer_head.h
--- a/include/linux/buffer_head.h~mm-make-snapshotting-pages-for-stable-writes-a-per-bio-operation-fix
+++ a/include/linux/buffer_head.h
@@ -181,7 +181,7 @@ void ll_rw_block(int, int, struct buffer
 int sync_dirty_buffer(struct buffer_head *bh);
 int __sync_dirty_buffer(struct buffer_head *bh, int rw);
 void write_dirty_buffer(struct buffer_head *bh, int rw);
-int _submit_bh(int, struct buffer_head *, unsigned long);
+int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags);
 int submit_bh(int, struct buffer_head *);
 void write_boundary_block(struct block_device *bdev,
 			sector_t bblock, unsigned blocksize);
_


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-14 23:01               ` Andrew Morton
  0 siblings, 0 replies; 74+ messages in thread
From: Andrew Morton @ 2013-03-14 23:01 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel

On Thu, 14 Mar 2013 15:42:43 -0700 "Darrick J. Wong" <darrick.wong@oracle.com> wrote:

> Subject: [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
> 
> Walking a bio's page mappings has proved problematic, so create a new bio flag
> to indicate that a bio's data needs to be snapshotted in order to guarantee
> stable pages during writeback.  Next, for the one user (ext3/jbd) of
> snapshotting, hook all the places where writes can be initiated without
> PG_writeback set, and set BIO_SNAP_STABLE there.  Finally, the MS_SNAP_STABLE
> mount flag (only used by ext3) is now superfluous, so get rid of it.

whoa, that looks way better.

Must do this though:

From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-make-snapshotting-pages-for-stable-writes-a-per-bio-operation-fix

rename _submit_bh()'s `flags' to `bio_flags', delobotomize the _submit_bh declaration

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/buffer.c                 |    4 ++--
 include/linux/buffer_head.h |    2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff -puN fs/buffer.c~mm-make-snapshotting-pages-for-stable-writes-a-per-bio-operation-fix fs/buffer.c
--- a/fs/buffer.c~mm-make-snapshotting-pages-for-stable-writes-a-per-bio-operation-fix
+++ a/fs/buffer.c
@@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct
 	}
 }
 
-int _submit_bh(int rw, struct buffer_head * bh, unsigned long flags)
+int _submit_bh(int rw, struct buffer_head * bh, unsigned long bio_flags)
 {
 	struct bio *bio;
 	int ret = 0;
@@ -2984,7 +2984,7 @@ int _submit_bh(int rw, struct buffer_hea
 
 	bio->bi_end_io = end_bio_bh_io_sync;
 	bio->bi_private = bh;
-	bio->bi_flags |= flags;
+	bio->bi_flags |= bio_flags;
 
 	/* Take care of bh's that straddle the end of the device */
 	guard_bh_eod(rw, bio, bh);
diff -puN include/linux/buffer_head.h~mm-make-snapshotting-pages-for-stable-writes-a-per-bio-operation-fix include/linux/buffer_head.h
--- a/include/linux/buffer_head.h~mm-make-snapshotting-pages-for-stable-writes-a-per-bio-operation-fix
+++ a/include/linux/buffer_head.h
@@ -181,7 +181,7 @@ void ll_rw_block(int, int, struct buffer
 int sync_dirty_buffer(struct buffer_head *bh);
 int __sync_dirty_buffer(struct buffer_head *bh, int rw);
 void write_dirty_buffer(struct buffer_head *bh, int rw);
-int _submit_bh(int, struct buffer_head *, unsigned long);
+int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags);
 int submit_bh(int, struct buffer_head *);
 void write_boundary_block(struct block_device *bdev,
 			sector_t bblock, unsigned blocksize);
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-14 23:01               ` Andrew Morton
  0 siblings, 0 replies; 74+ messages in thread
From: Andrew Morton @ 2013-03-14 23:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 14 Mar 2013 15:42:43 -0700 "Darrick J. Wong" <darrick.wong@oracle.com> wrote:

> Subject: [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
> 
> Walking a bio's page mappings has proved problematic, so create a new bio flag
> to indicate that a bio's data needs to be snapshotted in order to guarantee
> stable pages during writeback.  Next, for the one user (ext3/jbd) of
> snapshotting, hook all the places where writes can be initiated without
> PG_writeback set, and set BIO_SNAP_STABLE there.  Finally, the MS_SNAP_STABLE
> mount flag (only used by ext3) is now superfluous, so get rid of it.

whoa, that looks way better.

Must do this though:

From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-make-snapshotting-pages-for-stable-writes-a-per-bio-operation-fix

rename _submit_bh()'s `flags' to `bio_flags', delobotomize the _submit_bh declaration

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/buffer.c                 |    4 ++--
 include/linux/buffer_head.h |    2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff -puN fs/buffer.c~mm-make-snapshotting-pages-for-stable-writes-a-per-bio-operation-fix fs/buffer.c
--- a/fs/buffer.c~mm-make-snapshotting-pages-for-stable-writes-a-per-bio-operation-fix
+++ a/fs/buffer.c
@@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct
 	}
 }
 
-int _submit_bh(int rw, struct buffer_head * bh, unsigned long flags)
+int _submit_bh(int rw, struct buffer_head * bh, unsigned long bio_flags)
 {
 	struct bio *bio;
 	int ret = 0;
@@ -2984,7 +2984,7 @@ int _submit_bh(int rw, struct buffer_hea
 
 	bio->bi_end_io = end_bio_bh_io_sync;
 	bio->bi_private = bh;
-	bio->bi_flags |= flags;
+	bio->bi_flags |= bio_flags;
 
 	/* Take care of bh's that straddle the end of the device */
 	guard_bh_eod(rw, bio, bh);
diff -puN include/linux/buffer_head.h~mm-make-snapshotting-pages-for-stable-writes-a-per-bio-operation-fix include/linux/buffer_head.h
--- a/include/linux/buffer_head.h~mm-make-snapshotting-pages-for-stable-writes-a-per-bio-operation-fix
+++ a/include/linux/buffer_head.h
@@ -181,7 +181,7 @@ void ll_rw_block(int, int, struct buffer
 int sync_dirty_buffer(struct buffer_head *bh);
 int __sync_dirty_buffer(struct buffer_head *bh, int rw);
 void write_dirty_buffer(struct buffer_head *bh, int rw);
-int _submit_bh(int, struct buffer_head *, unsigned long);
+int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags);
 int submit_bh(int, struct buffer_head *);
 void write_boundary_block(struct block_device *bdev,
 			sector_t bblock, unsigned blocksize);
_

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
  2013-03-14 22:46             ` Andrew Morton
  (?)
@ 2013-03-14 23:27               ` Darrick J. Wong
  -1 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-14 23:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel

On Thu, Mar 14, 2013 at 03:46:51PM -0700, Andrew Morton wrote:
> On Wed, 13 Mar 2013 22:02:16 +0100 Jan Kara <jack@suse.cz> wrote:
> 
> > > > ... remembering why we need to get to sb and why ext3 needs this ... So
> > > > maybe a better solution would be to have a bio flag meaning that pages need
> > > > bouncing? And we would set it from filesystems that need it - in case of
> > > > ext3 only writeback of data from kjournald actually needs to bounce the
> > > > pages. Thoughts?
> > > 
> > > What about dirty pages that don't result in journal transactions?  I think
> > > ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> > > __block_write_full_page, which in turn calls submit_bh().
> >   So here we have two options:
> > Either we let ext3 wait the same way as other filesystems when stable pages
> > are required. Then only data IO from kjournald needs to be bounced (all
> > other IO is properly protected by PageWriteback bit).
> > 
> > Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
> > needs bouncing, and set the bio flag in __block_write_full_page() and
> > kjournald based on the sb flag.
> > 
> > I think the first option is slightly better but I don't feel strongly
> > about that.
> 
> It seems Just Wrong that we're dicking around with filesystem
> superblocks at this level.  It's the bounce code, for heavens sake!
> 
> 
> What the heck's going on here and why wasn't I able to work that out
> from reading the code :( The need to stabilise these pages is driven by
> the characteristics of the underlying device and driver stack, isn't
> it?  Things like checksumming?  What else drives this requirement? 
> </rant>

Right now, checksumming for weird DIF/DIX devices is the only requirement for
this behavior.  In theory we can also hook checksumming iscsi and other things
up to this, but for now they have their own solutions for keeping writeback
page contents stable.

> Because I *think* it should be sufficient to maintain this boolean in
> the backing_dev.  My *guess* is that this is all here because we want
> to enable stable-snapshotting on a per-fs basis rather than on a
> per-device basis?  If so, why?  If not, what?

Yes, we do want to enable stable-snapshotting on a per-fs basis.  Here's why:

The first time I tried to solve this problem, I simply had everything use the
bounce buffer.  That was shot down because bounce buffers add memory pressure,
there might not be free pages available when we're doing writeback, etc.

The second attempt was to simply make everything wait for writeback to finish
before dirtying pages.  That's what everything (except ext3) does now.  jbd
initiates writeback on pages without setting PG_writeback, which means that our
convenient wait_on_stable_pages is broken in this case.  Hence ext3/jbd need to
be able to stable-snapshot.  However, it's the /only/ filesystem in the kernel
that needs this.  Everything else is either ok with waiting (ext4, xfs) or
implements their own tricks (tux3, btrfs) to make stable pages work correctly.

Fixing jbd to set PG_writeback has been discussed and rejected, because it's a
lot of work and you'd end up with something rather jbd2-like.  However,
bouncing the outgoing buffers is a fairly small change to jbd.  Jan (at least a
few months ago) was ok with band-aiding ext3.

I could rip out ext3 entirely, but people seem uncomfortable with that, and it
hasn't (yet) been proven that ext4 can provide a perfect imitation of ext3.

I could also just fix up Kconfig so that you can't use a BLK_DEV_INTEGRITY
device with JBD, but that was also shot down as ridiculous.

Given that a backing_dev covers a whole disk, which could contain several
different filesystems and an ext3, I don't want to make /all/ of them use
bounce buffering just because jbd is broken.  We've already established that
bounce pages should be used only when necessary, and (as it turns out), ext3
can initiate writeout of certain dirty user data pages without needing to go
through jbd, which means that those pages don't need to be bounced either.

Therefore, this really is a per-fs thing.

> btw, local variable `bdi' in must_snapshot_stable_pages() doesn't do
> anything.
>
> None of this will stop Shuge's kernel from going splat either.

I'm not trying to fix that in this patch; his splat resulted from stuff going
on in ext4/jbd2.

--D

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-14 23:27               ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-14 23:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel

On Thu, Mar 14, 2013 at 03:46:51PM -0700, Andrew Morton wrote:
> On Wed, 13 Mar 2013 22:02:16 +0100 Jan Kara <jack@suse.cz> wrote:
> 
> > > > ... remembering why we need to get to sb and why ext3 needs this ... So
> > > > maybe a better solution would be to have a bio flag meaning that pages need
> > > > bouncing? And we would set it from filesystems that need it - in case of
> > > > ext3 only writeback of data from kjournald actually needs to bounce the
> > > > pages. Thoughts?
> > > 
> > > What about dirty pages that don't result in journal transactions?  I think
> > > ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> > > __block_write_full_page, which in turn calls submit_bh().
> >   So here we have two options:
> > Either we let ext3 wait the same way as other filesystems when stable pages
> > are required. Then only data IO from kjournald needs to be bounced (all
> > other IO is properly protected by PageWriteback bit).
> > 
> > Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
> > needs bouncing, and set the bio flag in __block_write_full_page() and
> > kjournald based on the sb flag.
> > 
> > I think the first option is slightly better but I don't feel strongly
> > about that.
> 
> It seems Just Wrong that we're dicking around with filesystem
> superblocks at this level.  It's the bounce code, for heavens sake!
> 
> 
> What the heck's going on here and why wasn't I able to work that out
> from reading the code :( The need to stabilise these pages is driven by
> the characteristics of the underlying device and driver stack, isn't
> it?  Things like checksumming?  What else drives this requirement? 
> </rant>

Right now, checksumming for weird DIF/DIX devices is the only requirement for
this behavior.  In theory we can also hook checksumming iscsi and other things
up to this, but for now they have their own solutions for keeping writeback
page contents stable.

> Because I *think* it should be sufficient to maintain this boolean in
> the backing_dev.  My *guess* is that this is all here because we want
> to enable stable-snapshotting on a per-fs basis rather than on a
> per-device basis?  If so, why?  If not, what?

Yes, we do want to enable stable-snapshotting on a per-fs basis.  Here's why:

The first time I tried to solve this problem, I simply had everything use the
bounce buffer.  That was shot down because bounce buffers add memory pressure,
there might not be free pages available when we're doing writeback, etc.

The second attempt was to simply make everything wait for writeback to finish
before dirtying pages.  That's what everything (except ext3) does now.  jbd
initiates writeback on pages without setting PG_writeback, which means that our
convenient wait_on_stable_pages is broken in this case.  Hence ext3/jbd need to
be able to stable-snapshot.  However, it's the /only/ filesystem in the kernel
that needs this.  Everything else is either ok with waiting (ext4, xfs) or
implements their own tricks (tux3, btrfs) to make stable pages work correctly.

Fixing jbd to set PG_writeback has been discussed and rejected, because it's a
lot of work and you'd end up with something rather jbd2-like.  However,
bouncing the outgoing buffers is a fairly small change to jbd.  Jan (at least a
few months ago) was ok with band-aiding ext3.

I could rip out ext3 entirely, but people seem uncomfortable with that, and it
hasn't (yet) been proven that ext4 can provide a perfect imitation of ext3.

I could also just fix up Kconfig so that you can't use a BLK_DEV_INTEGRITY
device with JBD, but that was also shot down as ridiculous.

Given that a backing_dev covers a whole disk, which could contain several
different filesystems and an ext3, I don't want to make /all/ of them use
bounce buffering just because jbd is broken.  We've already established that
bounce pages should be used only when necessary, and (as it turns out), ext3
can initiate writeout of certain dirty user data pages without needing to go
through jbd, which means that those pages don't need to be bounced either.

Therefore, this really is a per-fs thing.

> btw, local variable `bdi' in must_snapshot_stable_pages() doesn't do
> anything.
>
> None of this will stop Shuge's kernel from going splat either.

I'm not trying to fix that in this patch; his splat resulted from stuff going
on in ext4/jbd2.

--D

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-14 23:27               ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-14 23:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 14, 2013 at 03:46:51PM -0700, Andrew Morton wrote:
> On Wed, 13 Mar 2013 22:02:16 +0100 Jan Kara <jack@suse.cz> wrote:
> 
> > > > ... remembering why we need to get to sb and why ext3 needs this ... So
> > > > maybe a better solution would be to have a bio flag meaning that pages need
> > > > bouncing? And we would set it from filesystems that need it - in case of
> > > > ext3 only writeback of data from kjournald actually needs to bounce the
> > > > pages. Thoughts?
> > > 
> > > What about dirty pages that don't result in journal transactions?  I think
> > > ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> > > __block_write_full_page, which in turn calls submit_bh().
> >   So here we have two options:
> > Either we let ext3 wait the same way as other filesystems when stable pages
> > are required. Then only data IO from kjournald needs to be bounced (all
> > other IO is properly protected by PageWriteback bit).
> > 
> > Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
> > needs bouncing, and set the bio flag in __block_write_full_page() and
> > kjournald based on the sb flag.
> > 
> > I think the first option is slightly better but I don't feel strongly
> > about that.
> 
> It seems Just Wrong that we're dicking around with filesystem
> superblocks at this level.  It's the bounce code, for heavens sake!
> 
> 
> What the heck's going on here and why wasn't I able to work that out
> from reading the code :( The need to stabilise these pages is driven by
> the characteristics of the underlying device and driver stack, isn't
> it?  Things like checksumming?  What else drives this requirement? 
> </rant>

Right now, checksumming for weird DIF/DIX devices is the only requirement for
this behavior.  In theory we can also hook checksumming iscsi and other things
up to this, but for now they have their own solutions for keeping writeback
page contents stable.

> Because I *think* it should be sufficient to maintain this boolean in
> the backing_dev.  My *guess* is that this is all here because we want
> to enable stable-snapshotting on a per-fs basis rather than on a
> per-device basis?  If so, why?  If not, what?

Yes, we do want to enable stable-snapshotting on a per-fs basis.  Here's why:

The first time I tried to solve this problem, I simply had everything use the
bounce buffer.  That was shot down because bounce buffers add memory pressure,
there might not be free pages available when we're doing writeback, etc.

The second attempt was to simply make everything wait for writeback to finish
before dirtying pages.  That's what everything (except ext3) does now.  jbd
initiates writeback on pages without setting PG_writeback, which means that our
convenient wait_on_stable_pages is broken in this case.  Hence ext3/jbd need to
be able to stable-snapshot.  However, it's the /only/ filesystem in the kernel
that needs this.  Everything else is either ok with waiting (ext4, xfs) or
implements their own tricks (tux3, btrfs) to make stable pages work correctly.

Fixing jbd to set PG_writeback has been discussed and rejected, because it's a
lot of work and you'd end up with something rather jbd2-like.  However,
bouncing the outgoing buffers is a fairly small change to jbd.  Jan (at least a
few months ago) was ok with band-aiding ext3.

I could rip out ext3 entirely, but people seem uncomfortable with that, and it
hasn't (yet) been proven that ext4 can provide a perfect imitation of ext3.

I could also just fix up Kconfig so that you can't use a BLK_DEV_INTEGRITY
device with JBD, but that was also shot down as ridiculous.

Given that a backing_dev covers a whole disk, which could contain several
different filesystems and an ext3, I don't want to make /all/ of them use
bounce buffering just because jbd is broken.  We've already established that
bounce pages should be used only when necessary, and (as it turns out), ext3
can initiate writeout of certain dirty user data pages without needing to go
through jbd, which means that those pages don't need to be bounced either.

Therefore, this really is a per-fs thing.

> btw, local variable `bdi' in must_snapshot_stable_pages() doesn't do
> anything.
>
> None of this will stop Shuge's kernel from going splat either.

I'm not trying to fix that in this patch; his splat resulted from stuff going
on in ext4/jbd2.

--D

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
  2013-03-14 22:42             ` Darrick J. Wong
  (?)
@ 2013-03-15 10:01               ` Jan Kara
  -1 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-03-15 10:01 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, Andrew Morton, Shuge, linux-kernel, linux-mm,
	linux-ext4, Kevin, Theodore Ts'o, Jens Axboe,
	Catalin Marinas, Will Deacon, linux-arm-kernel

On Thu 14-03-13 15:42:43, Darrick J. Wong wrote:
> On Wed, Mar 13, 2013 at 10:02:16PM +0100, Jan Kara wrote:
> > On Wed 13-03-13 12:44:29, Darrick J. Wong wrote:
> > > On Wed, Mar 13, 2013 at 09:50:21AM +0100, Jan Kara wrote:
> > > > On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> > > > > On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > > > > > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > > > > > 
> > > > > > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > > > > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > > > > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > > > > > 
> > > > > > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > > > > > 
> > > > > > > ...
> > > > > > >
> > > > > > > --- a/mm/bounce.c
> > > > > > > +++ b/mm/bounce.c
> > > > > > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > > > > > *q, struct bio **bio_orig,
> > > > > > >   		if (rw == WRITE) {
> > > > > > >   			char *vto, *vfrom;
> > > > > > >   -			flush_dcache_page(from->bv_page);
> > > > > > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > > > > > +				flush_dcache_page(from->bv_page);
> > > > > > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > > > > > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > > > > > >   			memcpy(vto, vfrom, to->bv_len);
> > > > > > 
> > > > > > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > > > > > maintenance routines"), which added a page_mapping() call to arm64's
> > > > > > arch/arm64/mm/flush.c:flush_dcache_page().
> > > > > > 
> > > > > > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > > > > > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > > > > > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > > > > > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > > > > > 
> > > > > > The unusual thing about all of this is that the payload for some disk
> > > > > > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > > > > > but we've done this for ages and should continue to support it.
> > > > > > 
> > > > > > 
> > > > > > Now, the page from kmalloc() cannot be in highmem, so why did the
> > > > > > bounce code decide to bounce it?
> > > > > > 
> > > > > > __blk_queue_bounce() does
> > > > > > 
> > > > > > 		/*
> > > > > > 		 * is destination page below bounce pfn?
> > > > > > 		 */
> > > > > > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > > > > > 			continue;
> > > > > > 
> > > > > > and `force' comes from must_snapshot_stable_pages().  But
> > > > > > must_snapshot_stable_pages() must have returned false, because if it
> > > > > > had returned true then it would have been must_snapshot_stable_pages()
> > > > > > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > > > > > 
> > > > > > So my tentative diagosis is that arm64 is fishy.  A page which was
> > > > > > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > > > > > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > > > > > investigation to work out if this is what is happening?  Find out why
> > > > > > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > > > > > bounced?
> > > > > 
> > > > > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> > > > > kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> > > > > 
> > > > > > This is all terribly fragile :( afaict if someone sets
> > > > > > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > > > > > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > > > > > page_mapping() call.  (Darrick, this means you ;))
> > > > > 
> > > > > Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> > > > > We can keep walking the bio segments to find a non-slab page that can tell us
> > > > > MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> > > > > 
> > > > > How does something like this look?  (+ the patch above)
> > > >   Umm, this won't quite work. We can have a bio which has just PageSlab
> > > > page attached and so you won't be able to get to the superblock. Heh, isn't
> > > > the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
> > > > do direct IO, these pages come directly from userspace and hell knows where
> > > > they come from. Definitely their page_mapping() doesn't give us anything
> > > > useful... Sorry for not realizing this earlier when reviewing the patch.
> > > > 
> > > > ... remembering why we need to get to sb and why ext3 needs this ... So
> > > > maybe a better solution would be to have a bio flag meaning that pages need
> > > > bouncing? And we would set it from filesystems that need it - in case of
> > > > ext3 only writeback of data from kjournald actually needs to bounce the
> > > > pages. Thoughts?
> > > 
> > > What about dirty pages that don't result in journal transactions?  I think
> > > ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> > > __block_write_full_page, which in turn calls submit_bh().
> >   So here we have two options:
> > Either we let ext3 wait the same way as other filesystems when stable pages
> > are required. Then only data IO from kjournald needs to be bounced (all
> > other IO is properly protected by PageWriteback bit).
> > 
> > Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
> > needs bouncing, and set the bio flag in __block_write_full_page() and
> > kjournald based on the sb flag.
> > 
> > I think the first option is slightly better but I don't feel strongly
> > about that.
> 
> I like that first option -- it contains the kludgery to jbd instead of
> spreading it around.  Here's a patch that passes a quick smoke test on ext[34],
> xfs, and vfat.  What do you think of this one?  Should I create a
> submit_snapshot_bh() instead of letting callers stuff in arbitrary dangerous
> BH_ flags?
  Thanks for writing the patch. I think _submit_bh() is OK as you did it. I
have just two comments below.

> ---
> From: Darrick J. Wong <darrick.wong@oracle.com>
> Subject: [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
> 
> Walking a bio's page mappings has proved problematic, so create a new bio flag
> to indicate that a bio's data needs to be snapshotted in order to guarantee
> stable pages during writeback.  Next, for the one user (ext3/jbd) of
> snapshotting, hook all the places where writes can be initiated without
> PG_writeback set, and set BIO_SNAP_STABLE there.  Finally, the MS_SNAP_STABLE
> mount flag (only used by ext3) is now superfluous, so get rid of it.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/buffer.c                 |    9 ++++++++-
>  fs/ext3/super.c             |    1 -
>  fs/jbd/commit.c             |    4 ++--
>  include/linux/blk_types.h   |    3 ++-
>  include/linux/buffer_head.h |    1 +
>  include/uapi/linux/fs.h     |    1 -
>  mm/bounce.c                 |   21 +--------------------
>  mm/page-writeback.c         |    4 ----
>  8 files changed, 14 insertions(+), 30 deletions(-)
> 
...
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index 86b39b1..b91b688 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -163,7 +163,7 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
>  	for (i = 0; i < bufs; i++) {
>  		wbuf[i]->b_end_io = end_buffer_write_sync;
>  		/* We use-up our safety reference in submit_bh() */
> -		submit_bh(write_op, wbuf[i]);
> +		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
  Please add a comment here why we need BIO_SNAP_STABLE. Something like:
/*
 * Here we write back pagecache data that may be mmaped. Since we cannot
 * afford to clean the page and set PageWriteback here due to lock ordering
 * (page lock ranks above transaction start), the data can change while IO is
 * in flight. Tell the block layer it should bounce the bio pages if stable
 * data during write is required.
 */

>  	}
>  }
>  
> @@ -667,7 +667,7 @@ start_journal_io:
>  				clear_buffer_dirty(bh);
>  				set_buffer_uptodate(bh);
>  				bh->b_end_io = journal_end_buffer_io_sync;
> -				submit_bh(write_op, bh);
> +				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
  And this isn't needed. Here we write out only metadata and JBD already
handles copying those / waiting for IO in flight for metadata.

The rest of the patch looks OK and I like it much more than the previous
version :)

									Honza
>  			}
>  			cond_resched();
>  
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index cdf1119..22990cf 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -111,12 +111,13 @@ struct bio {
>  #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
>  #define BIO_QUIET	10	/* Make BIO Quiet */
>  #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
> +#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
>  
>  /*
>   * Flags starting here get preserved by bio_reset() - this includes
>   * BIO_POOL_IDX()
>   */
> -#define BIO_RESET_BITS	12
> +#define BIO_RESET_BITS	13
>  
>  #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
>  
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 5afc4f9..714d5d9 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
>  int sync_dirty_buffer(struct buffer_head *bh);
>  int __sync_dirty_buffer(struct buffer_head *bh, int rw);
>  void write_dirty_buffer(struct buffer_head *bh, int rw);
> +int _submit_bh(int, struct buffer_head *, unsigned long);
>  int submit_bh(int, struct buffer_head *);
>  void write_boundary_block(struct block_device *bdev,
>  			sector_t bblock, unsigned blocksize);
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index c7fc1e6..a4ed56c 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -88,7 +88,6 @@ struct inodes_stat_t {
>  #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
>  
>  /* These sb flags are internal to the kernel */
> -#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
>  #define MS_NOSEC	(1<<28)
>  #define MS_BORN		(1<<29)
>  #define MS_ACTIVE	(1<<30)
> diff --git a/mm/bounce.c b/mm/bounce.c
> index 5f89017..a5c2ec3 100644
> --- a/mm/bounce.c
> +++ b/mm/bounce.c
> @@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
>  #ifdef CONFIG_NEED_BOUNCE_POOL
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
>  {
> -	struct page *page;
> -	struct backing_dev_info *bdi;
> -	struct address_space *mapping;
> -	struct bio_vec *from;
> -	int i;
> -
>  	if (bio_data_dir(bio) != WRITE)
>  		return 0;
>  
>  	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
>  		return 0;
>  
> -	/*
> -	 * Based on the first page that has a valid mapping, decide whether or
> -	 * not we have to employ bounce buffering to guarantee stable pages.
> -	 */
> -	bio_for_each_segment(from, bio, i) {
> -		page = from->bv_page;
> -		mapping = page_mapping(page);
> -		if (!mapping)
> -			continue;
> -		bdi = mapping->backing_dev_info;
> -		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
> -	}
> -
> -	return 0;
> +	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
>  }
>  #else
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index efe6814..4514ad7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
>  
>  	if (!bdi_cap_stable_pages_required(bdi))
>  		return;
> -#ifdef CONFIG_NEED_BOUNCE_POOL
> -	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
> -		return;
> -#endif /* CONFIG_NEED_BOUNCE_POOL */
>  
>  	wait_on_page_writeback(page);
>  }
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-15 10:01               ` Jan Kara
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-03-15 10:01 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, Andrew Morton, Shuge, linux-kernel, linux-mm,
	linux-ext4, Kevin, Theodore Ts'o, Jens Axboe,
	Catalin Marinas, Will Deacon, linux-arm-kernel

On Thu 14-03-13 15:42:43, Darrick J. Wong wrote:
> On Wed, Mar 13, 2013 at 10:02:16PM +0100, Jan Kara wrote:
> > On Wed 13-03-13 12:44:29, Darrick J. Wong wrote:
> > > On Wed, Mar 13, 2013 at 09:50:21AM +0100, Jan Kara wrote:
> > > > On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> > > > > On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > > > > > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > > > > > 
> > > > > > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > > > > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > > > > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > > > > > 
> > > > > > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > > > > > 
> > > > > > > ...
> > > > > > >
> > > > > > > --- a/mm/bounce.c
> > > > > > > +++ b/mm/bounce.c
> > > > > > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > > > > > *q, struct bio **bio_orig,
> > > > > > >   		if (rw == WRITE) {
> > > > > > >   			char *vto, *vfrom;
> > > > > > >   -			flush_dcache_page(from->bv_page);
> > > > > > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > > > > > +				flush_dcache_page(from->bv_page);
> > > > > > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > > > > > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > > > > > >   			memcpy(vto, vfrom, to->bv_len);
> > > > > > 
> > > > > > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > > > > > maintenance routines"), which added a page_mapping() call to arm64's
> > > > > > arch/arm64/mm/flush.c:flush_dcache_page().
> > > > > > 
> > > > > > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > > > > > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > > > > > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > > > > > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > > > > > 
> > > > > > The unusual thing about all of this is that the payload for some disk
> > > > > > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > > > > > but we've done this for ages and should continue to support it.
> > > > > > 
> > > > > > 
> > > > > > Now, the page from kmalloc() cannot be in highmem, so why did the
> > > > > > bounce code decide to bounce it?
> > > > > > 
> > > > > > __blk_queue_bounce() does
> > > > > > 
> > > > > > 		/*
> > > > > > 		 * is destination page below bounce pfn?
> > > > > > 		 */
> > > > > > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > > > > > 			continue;
> > > > > > 
> > > > > > and `force' comes from must_snapshot_stable_pages().  But
> > > > > > must_snapshot_stable_pages() must have returned false, because if it
> > > > > > had returned true then it would have been must_snapshot_stable_pages()
> > > > > > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > > > > > 
> > > > > > So my tentative diagosis is that arm64 is fishy.  A page which was
> > > > > > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > > > > > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > > > > > investigation to work out if this is what is happening?  Find out why
> > > > > > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > > > > > bounced?
> > > > > 
> > > > > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> > > > > kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> > > > > 
> > > > > > This is all terribly fragile :( afaict if someone sets
> > > > > > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > > > > > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > > > > > page_mapping() call.  (Darrick, this means you ;))
> > > > > 
> > > > > Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> > > > > We can keep walking the bio segments to find a non-slab page that can tell us
> > > > > MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> > > > > 
> > > > > How does something like this look?  (+ the patch above)
> > > >   Umm, this won't quite work. We can have a bio which has just PageSlab
> > > > page attached and so you won't be able to get to the superblock. Heh, isn't
> > > > the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
> > > > do direct IO, these pages come directly from userspace and hell knows where
> > > > they come from. Definitely their page_mapping() doesn't give us anything
> > > > useful... Sorry for not realizing this earlier when reviewing the patch.
> > > > 
> > > > ... remembering why we need to get to sb and why ext3 needs this ... So
> > > > maybe a better solution would be to have a bio flag meaning that pages need
> > > > bouncing? And we would set it from filesystems that need it - in case of
> > > > ext3 only writeback of data from kjournald actually needs to bounce the
> > > > pages. Thoughts?
> > > 
> > > What about dirty pages that don't result in journal transactions?  I think
> > > ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> > > __block_write_full_page, which in turn calls submit_bh().
> >   So here we have two options:
> > Either we let ext3 wait the same way as other filesystems when stable pages
> > are required. Then only data IO from kjournald needs to be bounced (all
> > other IO is properly protected by PageWriteback bit).
> > 
> > Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
> > needs bouncing, and set the bio flag in __block_write_full_page() and
> > kjournald based on the sb flag.
> > 
> > I think the first option is slightly better but I don't feel strongly
> > about that.
> 
> I like that first option -- it contains the kludgery to jbd instead of
> spreading it around.  Here's a patch that passes a quick smoke test on ext[34],
> xfs, and vfat.  What do you think of this one?  Should I create a
> submit_snapshot_bh() instead of letting callers stuff in arbitrary dangerous
> BH_ flags?
  Thanks for writing the patch. I think _submit_bh() is OK as you did it. I
have just two comments below.

> ---
> From: Darrick J. Wong <darrick.wong@oracle.com>
> Subject: [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
> 
> Walking a bio's page mappings has proved problematic, so create a new bio flag
> to indicate that a bio's data needs to be snapshotted in order to guarantee
> stable pages during writeback.  Next, for the one user (ext3/jbd) of
> snapshotting, hook all the places where writes can be initiated without
> PG_writeback set, and set BIO_SNAP_STABLE there.  Finally, the MS_SNAP_STABLE
> mount flag (only used by ext3) is now superfluous, so get rid of it.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/buffer.c                 |    9 ++++++++-
>  fs/ext3/super.c             |    1 -
>  fs/jbd/commit.c             |    4 ++--
>  include/linux/blk_types.h   |    3 ++-
>  include/linux/buffer_head.h |    1 +
>  include/uapi/linux/fs.h     |    1 -
>  mm/bounce.c                 |   21 +--------------------
>  mm/page-writeback.c         |    4 ----
>  8 files changed, 14 insertions(+), 30 deletions(-)
> 
...
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index 86b39b1..b91b688 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -163,7 +163,7 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
>  	for (i = 0; i < bufs; i++) {
>  		wbuf[i]->b_end_io = end_buffer_write_sync;
>  		/* We use-up our safety reference in submit_bh() */
> -		submit_bh(write_op, wbuf[i]);
> +		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
  Please add a comment here why we need BIO_SNAP_STABLE. Something like:
/*
 * Here we write back pagecache data that may be mmaped. Since we cannot
 * afford to clean the page and set PageWriteback here due to lock ordering
 * (page lock ranks above transaction start), the data can change while IO is
 * in flight. Tell the block layer it should bounce the bio pages if stable
 * data during write is required.
 */

>  	}
>  }
>  
> @@ -667,7 +667,7 @@ start_journal_io:
>  				clear_buffer_dirty(bh);
>  				set_buffer_uptodate(bh);
>  				bh->b_end_io = journal_end_buffer_io_sync;
> -				submit_bh(write_op, bh);
> +				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
  And this isn't needed. Here we write out only metadata and JBD already
handles copying those / waiting for IO in flight for metadata.

The rest of the patch looks OK and I like it much more than the previous
version :)

									Honza
>  			}
>  			cond_resched();
>  
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index cdf1119..22990cf 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -111,12 +111,13 @@ struct bio {
>  #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
>  #define BIO_QUIET	10	/* Make BIO Quiet */
>  #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
> +#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
>  
>  /*
>   * Flags starting here get preserved by bio_reset() - this includes
>   * BIO_POOL_IDX()
>   */
> -#define BIO_RESET_BITS	12
> +#define BIO_RESET_BITS	13
>  
>  #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
>  
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 5afc4f9..714d5d9 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
>  int sync_dirty_buffer(struct buffer_head *bh);
>  int __sync_dirty_buffer(struct buffer_head *bh, int rw);
>  void write_dirty_buffer(struct buffer_head *bh, int rw);
> +int _submit_bh(int, struct buffer_head *, unsigned long);
>  int submit_bh(int, struct buffer_head *);
>  void write_boundary_block(struct block_device *bdev,
>  			sector_t bblock, unsigned blocksize);
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index c7fc1e6..a4ed56c 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -88,7 +88,6 @@ struct inodes_stat_t {
>  #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
>  
>  /* These sb flags are internal to the kernel */
> -#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
>  #define MS_NOSEC	(1<<28)
>  #define MS_BORN		(1<<29)
>  #define MS_ACTIVE	(1<<30)
> diff --git a/mm/bounce.c b/mm/bounce.c
> index 5f89017..a5c2ec3 100644
> --- a/mm/bounce.c
> +++ b/mm/bounce.c
> @@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
>  #ifdef CONFIG_NEED_BOUNCE_POOL
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
>  {
> -	struct page *page;
> -	struct backing_dev_info *bdi;
> -	struct address_space *mapping;
> -	struct bio_vec *from;
> -	int i;
> -
>  	if (bio_data_dir(bio) != WRITE)
>  		return 0;
>  
>  	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
>  		return 0;
>  
> -	/*
> -	 * Based on the first page that has a valid mapping, decide whether or
> -	 * not we have to employ bounce buffering to guarantee stable pages.
> -	 */
> -	bio_for_each_segment(from, bio, i) {
> -		page = from->bv_page;
> -		mapping = page_mapping(page);
> -		if (!mapping)
> -			continue;
> -		bdi = mapping->backing_dev_info;
> -		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
> -	}
> -
> -	return 0;
> +	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
>  }
>  #else
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index efe6814..4514ad7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
>  
>  	if (!bdi_cap_stable_pages_required(bdi))
>  		return;
> -#ifdef CONFIG_NEED_BOUNCE_POOL
> -	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
> -		return;
> -#endif /* CONFIG_NEED_BOUNCE_POOL */
>  
>  	wait_on_page_writeback(page);
>  }
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-15 10:01               ` Jan Kara
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-03-15 10:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu 14-03-13 15:42:43, Darrick J. Wong wrote:
> On Wed, Mar 13, 2013 at 10:02:16PM +0100, Jan Kara wrote:
> > On Wed 13-03-13 12:44:29, Darrick J. Wong wrote:
> > > On Wed, Mar 13, 2013 at 09:50:21AM +0100, Jan Kara wrote:
> > > > On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> > > > > On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > > > > > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > > > > > 
> > > > > > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > > > > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > > > > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > > > > > 
> > > > > > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > > > > > 
> > > > > > > ...
> > > > > > >
> > > > > > > --- a/mm/bounce.c
> > > > > > > +++ b/mm/bounce.c
> > > > > > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > > > > > *q, struct bio **bio_orig,
> > > > > > >   		if (rw == WRITE) {
> > > > > > >   			char *vto, *vfrom;
> > > > > > >   -			flush_dcache_page(from->bv_page);
> > > > > > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > > > > > +				flush_dcache_page(from->bv_page);
> > > > > > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > > > > > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > > > > > >   			memcpy(vto, vfrom, to->bv_len);
> > > > > > 
> > > > > > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > > > > > maintenance routines"), which added a page_mapping() call to arm64's
> > > > > > arch/arm64/mm/flush.c:flush_dcache_page().
> > > > > > 
> > > > > > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > > > > > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > > > > > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > > > > > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > > > > > 
> > > > > > The unusual thing about all of this is that the payload for some disk
> > > > > > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > > > > > but we've done this for ages and should continue to support it.
> > > > > > 
> > > > > > 
> > > > > > Now, the page from kmalloc() cannot be in highmem, so why did the
> > > > > > bounce code decide to bounce it?
> > > > > > 
> > > > > > __blk_queue_bounce() does
> > > > > > 
> > > > > > 		/*
> > > > > > 		 * is destination page below bounce pfn?
> > > > > > 		 */
> > > > > > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > > > > > 			continue;
> > > > > > 
> > > > > > and `force' comes from must_snapshot_stable_pages().  But
> > > > > > must_snapshot_stable_pages() must have returned false, because if it
> > > > > > had returned true then it would have been must_snapshot_stable_pages()
> > > > > > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > > > > > 
> > > > > > So my tentative diagosis is that arm64 is fishy.  A page which was
> > > > > > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > > > > > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > > > > > investigation to work out if this is what is happening?  Find out why
> > > > > > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > > > > > bounced?
> > > > > 
> > > > > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> > > > > kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> > > > > 
> > > > > > This is all terribly fragile :( afaict if someone sets
> > > > > > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > > > > > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > > > > > page_mapping() call.  (Darrick, this means you ;))
> > > > > 
> > > > > Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> > > > > We can keep walking the bio segments to find a non-slab page that can tell us
> > > > > MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> > > > > 
> > > > > How does something like this look?  (+ the patch above)
> > > >   Umm, this won't quite work. We can have a bio which has just PageSlab
> > > > page attached and so you won't be able to get to the superblock. Heh, isn't
> > > > the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
> > > > do direct IO, these pages come directly from userspace and hell knows where
> > > > they come from. Definitely their page_mapping() doesn't give us anything
> > > > useful... Sorry for not realizing this earlier when reviewing the patch.
> > > > 
> > > > ... remembering why we need to get to sb and why ext3 needs this ... So
> > > > maybe a better solution would be to have a bio flag meaning that pages need
> > > > bouncing? And we would set it from filesystems that need it - in case of
> > > > ext3 only writeback of data from kjournald actually needs to bounce the
> > > > pages. Thoughts?
> > > 
> > > What about dirty pages that don't result in journal transactions?  I think
> > > ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> > > __block_write_full_page, which in turn calls submit_bh().
> >   So here we have two options:
> > Either we let ext3 wait the same way as other filesystems when stable pages
> > are required. Then only data IO from kjournald needs to be bounced (all
> > other IO is properly protected by PageWriteback bit).
> > 
> > Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
> > needs bouncing, and set the bio flag in __block_write_full_page() and
> > kjournald based on the sb flag.
> > 
> > I think the first option is slightly better but I don't feel strongly
> > about that.
> 
> I like that first option -- it contains the kludgery to jbd instead of
> spreading it around.  Here's a patch that passes a quick smoke test on ext[34],
> xfs, and vfat.  What do you think of this one?  Should I create a
> submit_snapshot_bh() instead of letting callers stuff in arbitrary dangerous
> BH_ flags?
  Thanks for writing the patch. I think _submit_bh() is OK as you did it. I
have just two comments below.

> ---
> From: Darrick J. Wong <darrick.wong@oracle.com>
> Subject: [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
> 
> Walking a bio's page mappings has proved problematic, so create a new bio flag
> to indicate that a bio's data needs to be snapshotted in order to guarantee
> stable pages during writeback.  Next, for the one user (ext3/jbd) of
> snapshotting, hook all the places where writes can be initiated without
> PG_writeback set, and set BIO_SNAP_STABLE there.  Finally, the MS_SNAP_STABLE
> mount flag (only used by ext3) is now superfluous, so get rid of it.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/buffer.c                 |    9 ++++++++-
>  fs/ext3/super.c             |    1 -
>  fs/jbd/commit.c             |    4 ++--
>  include/linux/blk_types.h   |    3 ++-
>  include/linux/buffer_head.h |    1 +
>  include/uapi/linux/fs.h     |    1 -
>  mm/bounce.c                 |   21 +--------------------
>  mm/page-writeback.c         |    4 ----
>  8 files changed, 14 insertions(+), 30 deletions(-)
> 
...
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index 86b39b1..b91b688 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -163,7 +163,7 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
>  	for (i = 0; i < bufs; i++) {
>  		wbuf[i]->b_end_io = end_buffer_write_sync;
>  		/* We use-up our safety reference in submit_bh() */
> -		submit_bh(write_op, wbuf[i]);
> +		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
  Please add a comment here why we need BIO_SNAP_STABLE. Something like:
/*
 * Here we write back pagecache data that may be mmaped. Since we cannot
 * afford to clean the page and set PageWriteback here due to lock ordering
 * (page lock ranks above transaction start), the data can change while IO is
 * in flight. Tell the block layer it should bounce the bio pages if stable
 * data during write is required.
 */

>  	}
>  }
>  
> @@ -667,7 +667,7 @@ start_journal_io:
>  				clear_buffer_dirty(bh);
>  				set_buffer_uptodate(bh);
>  				bh->b_end_io = journal_end_buffer_io_sync;
> -				submit_bh(write_op, bh);
> +				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
  And this isn't needed. Here we write out only metadata and JBD already
handles copying those / waiting for IO in flight for metadata.

The rest of the patch looks OK and I like it much more than the previous
version :)

									Honza
>  			}
>  			cond_resched();
>  
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index cdf1119..22990cf 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -111,12 +111,13 @@ struct bio {
>  #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
>  #define BIO_QUIET	10	/* Make BIO Quiet */
>  #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
> +#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
>  
>  /*
>   * Flags starting here get preserved by bio_reset() - this includes
>   * BIO_POOL_IDX()
>   */
> -#define BIO_RESET_BITS	12
> +#define BIO_RESET_BITS	13
>  
>  #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
>  
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 5afc4f9..714d5d9 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
>  int sync_dirty_buffer(struct buffer_head *bh);
>  int __sync_dirty_buffer(struct buffer_head *bh, int rw);
>  void write_dirty_buffer(struct buffer_head *bh, int rw);
> +int _submit_bh(int, struct buffer_head *, unsigned long);
>  int submit_bh(int, struct buffer_head *);
>  void write_boundary_block(struct block_device *bdev,
>  			sector_t bblock, unsigned blocksize);
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index c7fc1e6..a4ed56c 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -88,7 +88,6 @@ struct inodes_stat_t {
>  #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
>  
>  /* These sb flags are internal to the kernel */
> -#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
>  #define MS_NOSEC	(1<<28)
>  #define MS_BORN		(1<<29)
>  #define MS_ACTIVE	(1<<30)
> diff --git a/mm/bounce.c b/mm/bounce.c
> index 5f89017..a5c2ec3 100644
> --- a/mm/bounce.c
> +++ b/mm/bounce.c
> @@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
>  #ifdef CONFIG_NEED_BOUNCE_POOL
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
>  {
> -	struct page *page;
> -	struct backing_dev_info *bdi;
> -	struct address_space *mapping;
> -	struct bio_vec *from;
> -	int i;
> -
>  	if (bio_data_dir(bio) != WRITE)
>  		return 0;
>  
>  	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
>  		return 0;
>  
> -	/*
> -	 * Based on the first page that has a valid mapping, decide whether or
> -	 * not we have to employ bounce buffering to guarantee stable pages.
> -	 */
> -	bio_for_each_segment(from, bio, i) {
> -		page = from->bv_page;
> -		mapping = page_mapping(page);
> -		if (!mapping)
> -			continue;
> -		bdi = mapping->backing_dev_info;
> -		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
> -	}
> -
> -	return 0;
> +	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
>  }
>  #else
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index efe6814..4514ad7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
>  
>  	if (!bdi_cap_stable_pages_required(bdi))
>  		return;
> -#ifdef CONFIG_NEED_BOUNCE_POOL
> -	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
> -		return;
> -#endif /* CONFIG_NEED_BOUNCE_POOL */
>  
>  	wait_on_page_writeback(page);
>  }
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
  2013-03-15 10:01               ` Jan Kara
  (?)
@ 2013-03-15 17:54                 ` Darrick J. Wong
  -1 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-15 17:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel

On Fri, Mar 15, 2013 at 11:01:05AM +0100, Jan Kara wrote:
> On Thu 14-03-13 15:42:43, Darrick J. Wong wrote:
> > On Wed, Mar 13, 2013 at 10:02:16PM +0100, Jan Kara wrote:
> > > On Wed 13-03-13 12:44:29, Darrick J. Wong wrote:
> > > > On Wed, Mar 13, 2013 at 09:50:21AM +0100, Jan Kara wrote:
> > > > > On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> > > > > > On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > > > > > > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > > > > > > 
> > > > > > > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > > > > > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > > > > > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > > > > > > 
> > > > > > > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > > > > > > 
> > > > > > > > ...
> > > > > > > >
> > > > > > > > --- a/mm/bounce.c
> > > > > > > > +++ b/mm/bounce.c
> > > > > > > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > > > > > > *q, struct bio **bio_orig,
> > > > > > > >   		if (rw == WRITE) {
> > > > > > > >   			char *vto, *vfrom;
> > > > > > > >   -			flush_dcache_page(from->bv_page);
> > > > > > > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > > > > > > +				flush_dcache_page(from->bv_page);
> > > > > > > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > > > > > > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > > > > > > >   			memcpy(vto, vfrom, to->bv_len);
> > > > > > > 
> > > > > > > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > > > > > > maintenance routines"), which added a page_mapping() call to arm64's
> > > > > > > arch/arm64/mm/flush.c:flush_dcache_page().
> > > > > > > 
> > > > > > > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > > > > > > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > > > > > > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > > > > > > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > > > > > > 
> > > > > > > The unusual thing about all of this is that the payload for some disk
> > > > > > > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > > > > > > but we've done this for ages and should continue to support it.
> > > > > > > 
> > > > > > > 
> > > > > > > Now, the page from kmalloc() cannot be in highmem, so why did the
> > > > > > > bounce code decide to bounce it?
> > > > > > > 
> > > > > > > __blk_queue_bounce() does
> > > > > > > 
> > > > > > > 		/*
> > > > > > > 		 * is destination page below bounce pfn?
> > > > > > > 		 */
> > > > > > > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > > > > > > 			continue;
> > > > > > > 
> > > > > > > and `force' comes from must_snapshot_stable_pages().  But
> > > > > > > must_snapshot_stable_pages() must have returned false, because if it
> > > > > > > had returned true then it would have been must_snapshot_stable_pages()
> > > > > > > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > > > > > > 
> > > > > > > So my tentative diagosis is that arm64 is fishy.  A page which was
> > > > > > > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > > > > > > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > > > > > > investigation to work out if this is what is happening?  Find out why
> > > > > > > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > > > > > > bounced?
> > > > > > 
> > > > > > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> > > > > > kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> > > > > > 
> > > > > > > This is all terribly fragile :( afaict if someone sets
> > > > > > > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > > > > > > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > > > > > > page_mapping() call.  (Darrick, this means you ;))
> > > > > > 
> > > > > > Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> > > > > > We can keep walking the bio segments to find a non-slab page that can tell us
> > > > > > MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> > > > > > 
> > > > > > How does something like this look?  (+ the patch above)
> > > > >   Umm, this won't quite work. We can have a bio which has just PageSlab
> > > > > page attached and so you won't be able to get to the superblock. Heh, isn't
> > > > > the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
> > > > > do direct IO, these pages come directly from userspace and hell knows where
> > > > > they come from. Definitely their page_mapping() doesn't give us anything
> > > > > useful... Sorry for not realizing this earlier when reviewing the patch.
> > > > > 
> > > > > ... remembering why we need to get to sb and why ext3 needs this ... So
> > > > > maybe a better solution would be to have a bio flag meaning that pages need
> > > > > bouncing? And we would set it from filesystems that need it - in case of
> > > > > ext3 only writeback of data from kjournald actually needs to bounce the
> > > > > pages. Thoughts?
> > > > 
> > > > What about dirty pages that don't result in journal transactions?  I think
> > > > ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> > > > __block_write_full_page, which in turn calls submit_bh().
> > >   So here we have two options:
> > > Either we let ext3 wait the same way as other filesystems when stable pages
> > > are required. Then only data IO from kjournald needs to be bounced (all
> > > other IO is properly protected by PageWriteback bit).
> > > 
> > > Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
> > > needs bouncing, and set the bio flag in __block_write_full_page() and
> > > kjournald based on the sb flag.
> > > 
> > > I think the first option is slightly better but I don't feel strongly
> > > about that.
> > 
> > I like that first option -- it contains the kludgery to jbd instead of
> > spreading it around.  Here's a patch that passes a quick smoke test on ext[34],
> > xfs, and vfat.  What do you think of this one?  Should I create a
> > submit_snapshot_bh() instead of letting callers stuff in arbitrary dangerous
> > BH_ flags?
>   Thanks for writing the patch. I think _submit_bh() is OK as you did it. I
> have just two comments below.
> 
> > ---
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > Subject: [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
> > 
> > Walking a bio's page mappings has proved problematic, so create a new bio flag
> > to indicate that a bio's data needs to be snapshotted in order to guarantee
> > stable pages during writeback.  Next, for the one user (ext3/jbd) of
> > snapshotting, hook all the places where writes can be initiated without
> > PG_writeback set, and set BIO_SNAP_STABLE there.  Finally, the MS_SNAP_STABLE
> > mount flag (only used by ext3) is now superfluous, so get rid of it.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/buffer.c                 |    9 ++++++++-
> >  fs/ext3/super.c             |    1 -
> >  fs/jbd/commit.c             |    4 ++--
> >  include/linux/blk_types.h   |    3 ++-
> >  include/linux/buffer_head.h |    1 +
> >  include/uapi/linux/fs.h     |    1 -
> >  mm/bounce.c                 |   21 +--------------------
> >  mm/page-writeback.c         |    4 ----
> >  8 files changed, 14 insertions(+), 30 deletions(-)
> > 
> ...
> > diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> > index 86b39b1..b91b688 100644
> > --- a/fs/jbd/commit.c
> > +++ b/fs/jbd/commit.c
> > @@ -163,7 +163,7 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
> >  	for (i = 0; i < bufs; i++) {
> >  		wbuf[i]->b_end_io = end_buffer_write_sync;
> >  		/* We use-up our safety reference in submit_bh() */
> > -		submit_bh(write_op, wbuf[i]);
> > +		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
>   Please add a comment here why we need BIO_SNAP_STABLE. Something like:
> /*
>  * Here we write back pagecache data that may be mmaped. Since we cannot
>  * afford to clean the page and set PageWriteback here due to lock ordering
>  * (page lock ranks above transaction start), the data can change while IO is
>  * in flight. Tell the block layer it should bounce the bio pages if stable
>  * data during write is required.
>  */
> 
> >  	}
> >  }
> >  
> > @@ -667,7 +667,7 @@ start_journal_io:
> >  				clear_buffer_dirty(bh);
> >  				set_buffer_uptodate(bh);
> >  				bh->b_end_io = journal_end_buffer_io_sync;
> > -				submit_bh(write_op, bh);
> > +				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
>   And this isn't needed. Here we write out only metadata and JBD already
> handles copying those / waiting for IO in flight for metadata.

I think it only copies the page if either the buffer is also a part of the
current transaction (or someone called do_get_undo_access()).  Unfortunately,
if we're in data=journal mode, dirty data pages get pushed through jbd as if
they were fs metadata, but in the meantime other processes can still write to
those pages.  So I guess we need the journal to freeze those pages as soon as
they come in.

(Or we could retain that little piece, but I suppose it's a larger hammer than
necessary.)

--D
> 
> The rest of the patch looks OK and I like it much more than the previous
> version :)
> 
> 									Honza
> >  			}
> >  			cond_resched();
> >  
> > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> > index cdf1119..22990cf 100644
> > --- a/include/linux/blk_types.h
> > +++ b/include/linux/blk_types.h
> > @@ -111,12 +111,13 @@ struct bio {
> >  #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
> >  #define BIO_QUIET	10	/* Make BIO Quiet */
> >  #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
> > +#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
> >  
> >  /*
> >   * Flags starting here get preserved by bio_reset() - this includes
> >   * BIO_POOL_IDX()
> >   */
> > -#define BIO_RESET_BITS	12
> > +#define BIO_RESET_BITS	13
> >  
> >  #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
> >  
> > diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> > index 5afc4f9..714d5d9 100644
> > --- a/include/linux/buffer_head.h
> > +++ b/include/linux/buffer_head.h
> > @@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
> >  int sync_dirty_buffer(struct buffer_head *bh);
> >  int __sync_dirty_buffer(struct buffer_head *bh, int rw);
> >  void write_dirty_buffer(struct buffer_head *bh, int rw);
> > +int _submit_bh(int, struct buffer_head *, unsigned long);
> >  int submit_bh(int, struct buffer_head *);
> >  void write_boundary_block(struct block_device *bdev,
> >  			sector_t bblock, unsigned blocksize);
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index c7fc1e6..a4ed56c 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -88,7 +88,6 @@ struct inodes_stat_t {
> >  #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
> >  
> >  /* These sb flags are internal to the kernel */
> > -#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
> >  #define MS_NOSEC	(1<<28)
> >  #define MS_BORN		(1<<29)
> >  #define MS_ACTIVE	(1<<30)
> > diff --git a/mm/bounce.c b/mm/bounce.c
> > index 5f89017..a5c2ec3 100644
> > --- a/mm/bounce.c
> > +++ b/mm/bounce.c
> > @@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
> >  #ifdef CONFIG_NEED_BOUNCE_POOL
> >  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
> >  {
> > -	struct page *page;
> > -	struct backing_dev_info *bdi;
> > -	struct address_space *mapping;
> > -	struct bio_vec *from;
> > -	int i;
> > -
> >  	if (bio_data_dir(bio) != WRITE)
> >  		return 0;
> >  
> >  	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
> >  		return 0;
> >  
> > -	/*
> > -	 * Based on the first page that has a valid mapping, decide whether or
> > -	 * not we have to employ bounce buffering to guarantee stable pages.
> > -	 */
> > -	bio_for_each_segment(from, bio, i) {
> > -		page = from->bv_page;
> > -		mapping = page_mapping(page);
> > -		if (!mapping)
> > -			continue;
> > -		bdi = mapping->backing_dev_info;
> > -		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
> > -	}
> > -
> > -	return 0;
> > +	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
> >  }
> >  #else
> >  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index efe6814..4514ad7 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
> >  
> >  	if (!bdi_cap_stable_pages_required(bdi))
> >  		return;
> > -#ifdef CONFIG_NEED_BOUNCE_POOL
> > -	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
> > -		return;
> > -#endif /* CONFIG_NEED_BOUNCE_POOL */
> >  
> >  	wait_on_page_writeback(page);
> >  }
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-15 17:54                 ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-15 17:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel

On Fri, Mar 15, 2013 at 11:01:05AM +0100, Jan Kara wrote:
> On Thu 14-03-13 15:42:43, Darrick J. Wong wrote:
> > On Wed, Mar 13, 2013 at 10:02:16PM +0100, Jan Kara wrote:
> > > On Wed 13-03-13 12:44:29, Darrick J. Wong wrote:
> > > > On Wed, Mar 13, 2013 at 09:50:21AM +0100, Jan Kara wrote:
> > > > > On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> > > > > > On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > > > > > > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > > > > > > 
> > > > > > > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > > > > > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > > > > > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > > > > > > 
> > > > > > > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > > > > > > 
> > > > > > > > ...
> > > > > > > >
> > > > > > > > --- a/mm/bounce.c
> > > > > > > > +++ b/mm/bounce.c
> > > > > > > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > > > > > > *q, struct bio **bio_orig,
> > > > > > > >   		if (rw == WRITE) {
> > > > > > > >   			char *vto, *vfrom;
> > > > > > > >   -			flush_dcache_page(from->bv_page);
> > > > > > > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > > > > > > +				flush_dcache_page(from->bv_page);
> > > > > > > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > > > > > > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > > > > > > >   			memcpy(vto, vfrom, to->bv_len);
> > > > > > > 
> > > > > > > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > > > > > > maintenance routines"), which added a page_mapping() call to arm64's
> > > > > > > arch/arm64/mm/flush.c:flush_dcache_page().
> > > > > > > 
> > > > > > > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > > > > > > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > > > > > > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > > > > > > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > > > > > > 
> > > > > > > The unusual thing about all of this is that the payload for some disk
> > > > > > > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > > > > > > but we've done this for ages and should continue to support it.
> > > > > > > 
> > > > > > > 
> > > > > > > Now, the page from kmalloc() cannot be in highmem, so why did the
> > > > > > > bounce code decide to bounce it?
> > > > > > > 
> > > > > > > __blk_queue_bounce() does
> > > > > > > 
> > > > > > > 		/*
> > > > > > > 		 * is destination page below bounce pfn?
> > > > > > > 		 */
> > > > > > > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > > > > > > 			continue;
> > > > > > > 
> > > > > > > and `force' comes from must_snapshot_stable_pages().  But
> > > > > > > must_snapshot_stable_pages() must have returned false, because if it
> > > > > > > had returned true then it would have been must_snapshot_stable_pages()
> > > > > > > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > > > > > > 
> > > > > > > So my tentative diagosis is that arm64 is fishy.  A page which was
> > > > > > > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > > > > > > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > > > > > > investigation to work out if this is what is happening?  Find out why
> > > > > > > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > > > > > > bounced?
> > > > > > 
> > > > > > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> > > > > > kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> > > > > > 
> > > > > > > This is all terribly fragile :( afaict if someone sets
> > > > > > > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > > > > > > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > > > > > > page_mapping() call.  (Darrick, this means you ;))
> > > > > > 
> > > > > > Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> > > > > > We can keep walking the bio segments to find a non-slab page that can tell us
> > > > > > MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> > > > > > 
> > > > > > How does something like this look?  (+ the patch above)
> > > > >   Umm, this won't quite work. We can have a bio which has just PageSlab
> > > > > page attached and so you won't be able to get to the superblock. Heh, isn't
> > > > > the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
> > > > > do direct IO, these pages come directly from userspace and hell knows where
> > > > > they come from. Definitely their page_mapping() doesn't give us anything
> > > > > useful... Sorry for not realizing this earlier when reviewing the patch.
> > > > > 
> > > > > ... remembering why we need to get to sb and why ext3 needs this ... So
> > > > > maybe a better solution would be to have a bio flag meaning that pages need
> > > > > bouncing? And we would set it from filesystems that need it - in case of
> > > > > ext3 only writeback of data from kjournald actually needs to bounce the
> > > > > pages. Thoughts?
> > > > 
> > > > What about dirty pages that don't result in journal transactions?  I think
> > > > ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> > > > __block_write_full_page, which in turn calls submit_bh().
> > >   So here we have two options:
> > > Either we let ext3 wait the same way as other filesystems when stable pages
> > > are required. Then only data IO from kjournald needs to be bounced (all
> > > other IO is properly protected by PageWriteback bit).
> > > 
> > > Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
> > > needs bouncing, and set the bio flag in __block_write_full_page() and
> > > kjournald based on the sb flag.
> > > 
> > > I think the first option is slightly better but I don't feel strongly
> > > about that.
> > 
> > I like that first option -- it contains the kludgery to jbd instead of
> > spreading it around.  Here's a patch that passes a quick smoke test on ext[34],
> > xfs, and vfat.  What do you think of this one?  Should I create a
> > submit_snapshot_bh() instead of letting callers stuff in arbitrary dangerous
> > BH_ flags?
>   Thanks for writing the patch. I think _submit_bh() is OK as you did it. I
> have just two comments below.
> 
> > ---
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > Subject: [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
> > 
> > Walking a bio's page mappings has proved problematic, so create a new bio flag
> > to indicate that a bio's data needs to be snapshotted in order to guarantee
> > stable pages during writeback.  Next, for the one user (ext3/jbd) of
> > snapshotting, hook all the places where writes can be initiated without
> > PG_writeback set, and set BIO_SNAP_STABLE there.  Finally, the MS_SNAP_STABLE
> > mount flag (only used by ext3) is now superfluous, so get rid of it.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/buffer.c                 |    9 ++++++++-
> >  fs/ext3/super.c             |    1 -
> >  fs/jbd/commit.c             |    4 ++--
> >  include/linux/blk_types.h   |    3 ++-
> >  include/linux/buffer_head.h |    1 +
> >  include/uapi/linux/fs.h     |    1 -
> >  mm/bounce.c                 |   21 +--------------------
> >  mm/page-writeback.c         |    4 ----
> >  8 files changed, 14 insertions(+), 30 deletions(-)
> > 
> ...
> > diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> > index 86b39b1..b91b688 100644
> > --- a/fs/jbd/commit.c
> > +++ b/fs/jbd/commit.c
> > @@ -163,7 +163,7 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
> >  	for (i = 0; i < bufs; i++) {
> >  		wbuf[i]->b_end_io = end_buffer_write_sync;
> >  		/* We use-up our safety reference in submit_bh() */
> > -		submit_bh(write_op, wbuf[i]);
> > +		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
>   Please add a comment here why we need BIO_SNAP_STABLE. Something like:
> /*
>  * Here we write back pagecache data that may be mmaped. Since we cannot
>  * afford to clean the page and set PageWriteback here due to lock ordering
>  * (page lock ranks above transaction start), the data can change while IO is
>  * in flight. Tell the block layer it should bounce the bio pages if stable
>  * data during write is required.
>  */
> 
> >  	}
> >  }
> >  
> > @@ -667,7 +667,7 @@ start_journal_io:
> >  				clear_buffer_dirty(bh);
> >  				set_buffer_uptodate(bh);
> >  				bh->b_end_io = journal_end_buffer_io_sync;
> > -				submit_bh(write_op, bh);
> > +				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
>   And this isn't needed. Here we write out only metadata and JBD already
> handles copying those / waiting for IO in flight for metadata.

I think it only copies the page if either the buffer is also a part of the
current transaction (or someone called do_get_undo_access()).  Unfortunately,
if we're in data=journal mode, dirty data pages get pushed through jbd as if
they were fs metadata, but in the meantime other processes can still write to
those pages.  So I guess we need the journal to freeze those pages as soon as
they come in.

(Or we could retain that little piece, but I suppose it's a larger hammer than
necessary.)

--D
> 
> The rest of the patch looks OK and I like it much more than the previous
> version :)
> 
> 									Honza
> >  			}
> >  			cond_resched();
> >  
> > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> > index cdf1119..22990cf 100644
> > --- a/include/linux/blk_types.h
> > +++ b/include/linux/blk_types.h
> > @@ -111,12 +111,13 @@ struct bio {
> >  #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
> >  #define BIO_QUIET	10	/* Make BIO Quiet */
> >  #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
> > +#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
> >  
> >  /*
> >   * Flags starting here get preserved by bio_reset() - this includes
> >   * BIO_POOL_IDX()
> >   */
> > -#define BIO_RESET_BITS	12
> > +#define BIO_RESET_BITS	13
> >  
> >  #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
> >  
> > diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> > index 5afc4f9..714d5d9 100644
> > --- a/include/linux/buffer_head.h
> > +++ b/include/linux/buffer_head.h
> > @@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
> >  int sync_dirty_buffer(struct buffer_head *bh);
> >  int __sync_dirty_buffer(struct buffer_head *bh, int rw);
> >  void write_dirty_buffer(struct buffer_head *bh, int rw);
> > +int _submit_bh(int, struct buffer_head *, unsigned long);
> >  int submit_bh(int, struct buffer_head *);
> >  void write_boundary_block(struct block_device *bdev,
> >  			sector_t bblock, unsigned blocksize);
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index c7fc1e6..a4ed56c 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -88,7 +88,6 @@ struct inodes_stat_t {
> >  #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
> >  
> >  /* These sb flags are internal to the kernel */
> > -#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
> >  #define MS_NOSEC	(1<<28)
> >  #define MS_BORN		(1<<29)
> >  #define MS_ACTIVE	(1<<30)
> > diff --git a/mm/bounce.c b/mm/bounce.c
> > index 5f89017..a5c2ec3 100644
> > --- a/mm/bounce.c
> > +++ b/mm/bounce.c
> > @@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
> >  #ifdef CONFIG_NEED_BOUNCE_POOL
> >  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
> >  {
> > -	struct page *page;
> > -	struct backing_dev_info *bdi;
> > -	struct address_space *mapping;
> > -	struct bio_vec *from;
> > -	int i;
> > -
> >  	if (bio_data_dir(bio) != WRITE)
> >  		return 0;
> >  
> >  	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
> >  		return 0;
> >  
> > -	/*
> > -	 * Based on the first page that has a valid mapping, decide whether or
> > -	 * not we have to employ bounce buffering to guarantee stable pages.
> > -	 */
> > -	bio_for_each_segment(from, bio, i) {
> > -		page = from->bv_page;
> > -		mapping = page_mapping(page);
> > -		if (!mapping)
> > -			continue;
> > -		bdi = mapping->backing_dev_info;
> > -		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
> > -	}
> > -
> > -	return 0;
> > +	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
> >  }
> >  #else
> >  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index efe6814..4514ad7 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
> >  
> >  	if (!bdi_cap_stable_pages_required(bdi))
> >  		return;
> > -#ifdef CONFIG_NEED_BOUNCE_POOL
> > -	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
> > -		return;
> > -#endif /* CONFIG_NEED_BOUNCE_POOL */
> >  
> >  	wait_on_page_writeback(page);
> >  }
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-15 17:54                 ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-15 17:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 15, 2013 at 11:01:05AM +0100, Jan Kara wrote:
> On Thu 14-03-13 15:42:43, Darrick J. Wong wrote:
> > On Wed, Mar 13, 2013 at 10:02:16PM +0100, Jan Kara wrote:
> > > On Wed 13-03-13 12:44:29, Darrick J. Wong wrote:
> > > > On Wed, Mar 13, 2013 at 09:50:21AM +0100, Jan Kara wrote:
> > > > > On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> > > > > > On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > > > > > > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > > > > > > 
> > > > > > > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > > > > > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > > > > > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > > > > > > 
> > > > > > > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > > > > > > 
> > > > > > > > ...
> > > > > > > >
> > > > > > > > --- a/mm/bounce.c
> > > > > > > > +++ b/mm/bounce.c
> > > > > > > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > > > > > > *q, struct bio **bio_orig,
> > > > > > > >   		if (rw == WRITE) {
> > > > > > > >   			char *vto, *vfrom;
> > > > > > > >   -			flush_dcache_page(from->bv_page);
> > > > > > > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > > > > > > +				flush_dcache_page(from->bv_page);
> > > > > > > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > > > > > > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > > > > > > >   			memcpy(vto, vfrom, to->bv_len);
> > > > > > > 
> > > > > > > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > > > > > > maintenance routines"), which added a page_mapping() call to arm64's
> > > > > > > arch/arm64/mm/flush.c:flush_dcache_page().
> > > > > > > 
> > > > > > > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > > > > > > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > > > > > > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > > > > > > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > > > > > > 
> > > > > > > The unusual thing about all of this is that the payload for some disk
> > > > > > > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > > > > > > but we've done this for ages and should continue to support it.
> > > > > > > 
> > > > > > > 
> > > > > > > Now, the page from kmalloc() cannot be in highmem, so why did the
> > > > > > > bounce code decide to bounce it?
> > > > > > > 
> > > > > > > __blk_queue_bounce() does
> > > > > > > 
> > > > > > > 		/*
> > > > > > > 		 * is destination page below bounce pfn?
> > > > > > > 		 */
> > > > > > > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > > > > > > 			continue;
> > > > > > > 
> > > > > > > and `force' comes from must_snapshot_stable_pages().  But
> > > > > > > must_snapshot_stable_pages() must have returned false, because if it
> > > > > > > had returned true then it would have been must_snapshot_stable_pages()
> > > > > > > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > > > > > > 
> > > > > > > So my tentative diagosis is that arm64 is fishy.  A page which was
> > > > > > > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > > > > > > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > > > > > > investigation to work out if this is what is happening?  Find out why
> > > > > > > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > > > > > > bounced?
> > > > > > 
> > > > > > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> > > > > > kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> > > > > > 
> > > > > > > This is all terribly fragile :( afaict if someone sets
> > > > > > > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > > > > > > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > > > > > > page_mapping() call.  (Darrick, this means you ;))
> > > > > > 
> > > > > > Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> > > > > > We can keep walking the bio segments to find a non-slab page that can tell us
> > > > > > MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> > > > > > 
> > > > > > How does something like this look?  (+ the patch above)
> > > > >   Umm, this won't quite work. We can have a bio which has just PageSlab
> > > > > page attached and so you won't be able to get to the superblock. Heh, isn't
> > > > > the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
> > > > > do direct IO, these pages come directly from userspace and hell knows where
> > > > > they come from. Definitely their page_mapping() doesn't give us anything
> > > > > useful... Sorry for not realizing this earlier when reviewing the patch.
> > > > > 
> > > > > ... remembering why we need to get to sb and why ext3 needs this ... So
> > > > > maybe a better solution would be to have a bio flag meaning that pages need
> > > > > bouncing? And we would set it from filesystems that need it - in case of
> > > > > ext3 only writeback of data from kjournald actually needs to bounce the
> > > > > pages. Thoughts?
> > > > 
> > > > What about dirty pages that don't result in journal transactions?  I think
> > > > ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> > > > __block_write_full_page, which in turn calls submit_bh().
> > >   So here we have two options:
> > > Either we let ext3 wait the same way as other filesystems when stable pages
> > > are required. Then only data IO from kjournald needs to be bounced (all
> > > other IO is properly protected by PageWriteback bit).
> > > 
> > > Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
> > > needs bouncing, and set the bio flag in __block_write_full_page() and
> > > kjournald based on the sb flag.
> > > 
> > > I think the first option is slightly better but I don't feel strongly
> > > about that.
> > 
> > I like that first option -- it contains the kludgery to jbd instead of
> > spreading it around.  Here's a patch that passes a quick smoke test on ext[34],
> > xfs, and vfat.  What do you think of this one?  Should I create a
> > submit_snapshot_bh() instead of letting callers stuff in arbitrary dangerous
> > BH_ flags?
>   Thanks for writing the patch. I think _submit_bh() is OK as you did it. I
> have just two comments below.
> 
> > ---
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > Subject: [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
> > 
> > Walking a bio's page mappings has proved problematic, so create a new bio flag
> > to indicate that a bio's data needs to be snapshotted in order to guarantee
> > stable pages during writeback.  Next, for the one user (ext3/jbd) of
> > snapshotting, hook all the places where writes can be initiated without
> > PG_writeback set, and set BIO_SNAP_STABLE there.  Finally, the MS_SNAP_STABLE
> > mount flag (only used by ext3) is now superfluous, so get rid of it.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/buffer.c                 |    9 ++++++++-
> >  fs/ext3/super.c             |    1 -
> >  fs/jbd/commit.c             |    4 ++--
> >  include/linux/blk_types.h   |    3 ++-
> >  include/linux/buffer_head.h |    1 +
> >  include/uapi/linux/fs.h     |    1 -
> >  mm/bounce.c                 |   21 +--------------------
> >  mm/page-writeback.c         |    4 ----
> >  8 files changed, 14 insertions(+), 30 deletions(-)
> > 
> ...
> > diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> > index 86b39b1..b91b688 100644
> > --- a/fs/jbd/commit.c
> > +++ b/fs/jbd/commit.c
> > @@ -163,7 +163,7 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
> >  	for (i = 0; i < bufs; i++) {
> >  		wbuf[i]->b_end_io = end_buffer_write_sync;
> >  		/* We use-up our safety reference in submit_bh() */
> > -		submit_bh(write_op, wbuf[i]);
> > +		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
>   Please add a comment here why we need BIO_SNAP_STABLE. Something like:
> /*
>  * Here we write back pagecache data that may be mmaped. Since we cannot
>  * afford to clean the page and set PageWriteback here due to lock ordering
>  * (page lock ranks above transaction start), the data can change while IO is
>  * in flight. Tell the block layer it should bounce the bio pages if stable
>  * data during write is required.
>  */
> 
> >  	}
> >  }
> >  
> > @@ -667,7 +667,7 @@ start_journal_io:
> >  				clear_buffer_dirty(bh);
> >  				set_buffer_uptodate(bh);
> >  				bh->b_end_io = journal_end_buffer_io_sync;
> > -				submit_bh(write_op, bh);
> > +				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
>   And this isn't needed. Here we write out only metadata and JBD already
> handles copying those / waiting for IO in flight for metadata.

I think it only copies the page if either the buffer is also a part of the
current transaction (or someone called do_get_undo_access()).  Unfortunately,
if we're in data=journal mode, dirty data pages get pushed through jbd as if
they were fs metadata, but in the meantime other processes can still write to
those pages.  So I guess we need the journal to freeze those pages as soon as
they come in.

(Or we could retain that little piece, but I suppose it's a larger hammer than
necessary.)

--D
> 
> The rest of the patch looks OK and I like it much more than the previous
> version :)
> 
> 									Honza
> >  			}
> >  			cond_resched();
> >  
> > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> > index cdf1119..22990cf 100644
> > --- a/include/linux/blk_types.h
> > +++ b/include/linux/blk_types.h
> > @@ -111,12 +111,13 @@ struct bio {
> >  #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
> >  #define BIO_QUIET	10	/* Make BIO Quiet */
> >  #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
> > +#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
> >  
> >  /*
> >   * Flags starting here get preserved by bio_reset() - this includes
> >   * BIO_POOL_IDX()
> >   */
> > -#define BIO_RESET_BITS	12
> > +#define BIO_RESET_BITS	13
> >  
> >  #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
> >  
> > diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> > index 5afc4f9..714d5d9 100644
> > --- a/include/linux/buffer_head.h
> > +++ b/include/linux/buffer_head.h
> > @@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
> >  int sync_dirty_buffer(struct buffer_head *bh);
> >  int __sync_dirty_buffer(struct buffer_head *bh, int rw);
> >  void write_dirty_buffer(struct buffer_head *bh, int rw);
> > +int _submit_bh(int, struct buffer_head *, unsigned long);
> >  int submit_bh(int, struct buffer_head *);
> >  void write_boundary_block(struct block_device *bdev,
> >  			sector_t bblock, unsigned blocksize);
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index c7fc1e6..a4ed56c 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -88,7 +88,6 @@ struct inodes_stat_t {
> >  #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
> >  
> >  /* These sb flags are internal to the kernel */
> > -#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
> >  #define MS_NOSEC	(1<<28)
> >  #define MS_BORN		(1<<29)
> >  #define MS_ACTIVE	(1<<30)
> > diff --git a/mm/bounce.c b/mm/bounce.c
> > index 5f89017..a5c2ec3 100644
> > --- a/mm/bounce.c
> > +++ b/mm/bounce.c
> > @@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
> >  #ifdef CONFIG_NEED_BOUNCE_POOL
> >  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
> >  {
> > -	struct page *page;
> > -	struct backing_dev_info *bdi;
> > -	struct address_space *mapping;
> > -	struct bio_vec *from;
> > -	int i;
> > -
> >  	if (bio_data_dir(bio) != WRITE)
> >  		return 0;
> >  
> >  	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
> >  		return 0;
> >  
> > -	/*
> > -	 * Based on the first page that has a valid mapping, decide whether or
> > -	 * not we have to employ bounce buffering to guarantee stable pages.
> > -	 */
> > -	bio_for_each_segment(from, bio, i) {
> > -		page = from->bv_page;
> > -		mapping = page_mapping(page);
> > -		if (!mapping)
> > -			continue;
> > -		bdi = mapping->backing_dev_info;
> > -		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
> > -	}
> > -
> > -	return 0;
> > +	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
> >  }
> >  #else
> >  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index efe6814..4514ad7 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
> >  
> >  	if (!bdi_cap_stable_pages_required(bdi))
> >  		return;
> > -#ifdef CONFIG_NEED_BOUNCE_POOL
> > -	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
> > -		return;
> > -#endif /* CONFIG_NEED_BOUNCE_POOL */
> >  
> >  	wait_on_page_writeback(page);
> >  }
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
  2013-03-15 10:01               ` Jan Kara
  (?)
@ 2013-03-15 23:28                 ` Darrick J. Wong
  -1 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-15 23:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Jan Kara

Walking a bio's page mappings has proved problematic, so create a new bio flag
to indicate that a bio's data needs to be snapshotted in order to guarantee
stable pages during writeback.  Next, for the one user (ext3/jbd) of
snapshotting, hook all the places where writes can be initiated without
PG_writeback set, and set BIO_SNAP_STABLE there.  We must also flag journal
"metadata" bios for stable writeout if data=journal, since file data is written
through the journal.  Finally, the MS_SNAP_STABLE mount flag (only used by
ext3) is now superfluous, so get rid of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

[darrick.wong@oracle.com: Fold in a couple of small cleanups from akpm]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 fs/buffer.c                 |    9 ++++++++-
 fs/ext3/super.c             |    3 ++-
 fs/jbd/commit.c             |   28 +++++++++++++++++++++++++---
 include/linux/blk_types.h   |    3 ++-
 include/linux/buffer_head.h |    1 +
 include/linux/jbd.h         |    1 +
 include/uapi/linux/fs.h     |    1 -
 mm/bounce.c                 |   21 +--------------------
 mm/page-writeback.c         |    4 ----
 9 files changed, 40 insertions(+), 31 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b4dcb34..71578d6 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
 	}
 }
 
-int submit_bh(int rw, struct buffer_head * bh)
+int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
 {
 	struct bio *bio;
 	int ret = 0;
@@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
 
 	bio->bi_end_io = end_bio_bh_io_sync;
 	bio->bi_private = bh;
+	bio->bi_flags |= bio_flags;
 
 	/* Take care of bh's that straddle the end of the device */
 	guard_bh_eod(rw, bio, bh);
@@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
 	bio_put(bio);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(_submit_bh);
+
+int submit_bh(int rw, struct buffer_head *bh)
+{
+	return _submit_bh(rw, bh, 0);
+}
 EXPORT_SYMBOL(submit_bh);
 
 /**
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 1d6e2ed..e845b6de 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2063,11 +2063,12 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 		ext3_mark_recovery_complete(sb, es);
 		ext3_msg(sb, KERN_INFO, "recovery complete");
 	}
+	if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA)
+		EXT3_SB(sb)->s_journal->j_flags |= JFS_JOURNALS_DATA;
 	ext3_msg(sb, KERN_INFO, "mounted filesystem with %s data mode",
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
 		"writeback");
-	sb->s_flags |= MS_SNAP_STABLE;
 
 	return 0;
 
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 86b39b1..37a60dd 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -162,8 +162,17 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
 
 	for (i = 0; i < bufs; i++) {
 		wbuf[i]->b_end_io = end_buffer_write_sync;
-		/* We use-up our safety reference in submit_bh() */
-		submit_bh(write_op, wbuf[i]);
+		/*
+		 * Here we write back pagecache data that may be mmaped. Since
+		 * we cannot afford to clean the page and set PageWriteback
+		 * here due to lock ordering (page lock ranks above transaction
+		 * start), the data can change while IO is in flight. Tell the
+		 * block layer it should bounce the bio pages if stable data
+		 * during write is required.
+		 *
+		 * We use up our safety reference in submit_bh().
+		 */
+		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
 	}
 }
 
@@ -663,11 +672,24 @@ void journal_commit_transaction(journal_t *journal)
 start_journal_io:
 			for (i = 0; i < bufs; i++) {
 				struct buffer_head *bh = wbuf[i];
+				unsigned long bio_flags = 0;
 				lock_buffer(bh);
 				clear_buffer_dirty(bh);
 				set_buffer_uptodate(bh);
 				bh->b_end_io = journal_end_buffer_io_sync;
-				submit_bh(write_op, bh);
+				/*
+				 * In data=journal mode, here we can end up
+				 * writing pagecache data that might be
+				 * mmapped. Since we can't afford to clean the
+				 * page and set PageWriteback (see the comment
+				 * near the other use of _submit_bh()), the
+				 * data can change while the write is in
+				 * flight.  Tell the block layer to bounce the
+				 * bio pages if stable pages are required.
+				 */
+				if (journal->j_flags & JFS_JOURNALS_DATA)
+					bio_flags = 1 << BIO_SNAP_STABLE;
+				_submit_bh(write_op, bh, bio_flags);
 			}
 			cond_resched();
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cdf1119..22990cf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -111,12 +111,13 @@ struct bio {
 #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
 #define BIO_QUIET	10	/* Make BIO Quiet */
 #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
+#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
 
 /*
  * Flags starting here get preserved by bio_reset() - this includes
  * BIO_POOL_IDX()
  */
-#define BIO_RESET_BITS	12
+#define BIO_RESET_BITS	13
 
 #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
 
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 5afc4f9..4c16c4a 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
 int sync_dirty_buffer(struct buffer_head *bh);
 int __sync_dirty_buffer(struct buffer_head *bh, int rw);
 void write_dirty_buffer(struct buffer_head *bh, int rw);
+int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags);
 int submit_bh(int, struct buffer_head *);
 void write_boundary_block(struct block_device *bdev,
 			sector_t bblock, unsigned blocksize);
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index c8f3297..2bfd613 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -768,6 +768,7 @@ struct journal_s
 #define JFS_ABORT_ON_SYNCDATA_ERR	0x040  /* Abort the journal on file
 						* data write error in ordered
 						* mode */
+#define JFS_JOURNALS_DATA	0x080		/* data=journal mode */
 
 /*
  * Function declarations for the journaling transaction and buffer
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index c7fc1e6..a4ed56c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -88,7 +88,6 @@ struct inodes_stat_t {
 #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
 
 /* These sb flags are internal to the kernel */
-#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
 #define MS_NOSEC	(1<<28)
 #define MS_BORN		(1<<29)
 #define MS_ACTIVE	(1<<30)
diff --git a/mm/bounce.c b/mm/bounce.c
index 5f89017..a5c2ec3 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
 #ifdef CONFIG_NEED_BOUNCE_POOL
 static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
 {
-	struct page *page;
-	struct backing_dev_info *bdi;
-	struct address_space *mapping;
-	struct bio_vec *from;
-	int i;
-
 	if (bio_data_dir(bio) != WRITE)
 		return 0;
 
 	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
 		return 0;
 
-	/*
-	 * Based on the first page that has a valid mapping, decide whether or
-	 * not we have to employ bounce buffering to guarantee stable pages.
-	 */
-	bio_for_each_segment(from, bio, i) {
-		page = from->bv_page;
-		mapping = page_mapping(page);
-		if (!mapping)
-			continue;
-		bdi = mapping->backing_dev_info;
-		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
-	}
-
-	return 0;
+	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
 }
 #else
 static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index efe6814..4514ad7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
 
 	if (!bdi_cap_stable_pages_required(bdi))
 		return;
-#ifdef CONFIG_NEED_BOUNCE_POOL
-	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
-		return;
-#endif /* CONFIG_NEED_BOUNCE_POOL */
 
 	wait_on_page_writeback(page);
 }

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
@ 2013-03-15 23:28                 ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-15 23:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Jan Kara

Walking a bio's page mappings has proved problematic, so create a new bio flag
to indicate that a bio's data needs to be snapshotted in order to guarantee
stable pages during writeback.  Next, for the one user (ext3/jbd) of
snapshotting, hook all the places where writes can be initiated without
PG_writeback set, and set BIO_SNAP_STABLE there.  We must also flag journal
"metadata" bios for stable writeout if data=journal, since file data is written
through the journal.  Finally, the MS_SNAP_STABLE mount flag (only used by
ext3) is now superfluous, so get rid of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

[darrick.wong@oracle.com: Fold in a couple of small cleanups from akpm]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 fs/buffer.c                 |    9 ++++++++-
 fs/ext3/super.c             |    3 ++-
 fs/jbd/commit.c             |   28 +++++++++++++++++++++++++---
 include/linux/blk_types.h   |    3 ++-
 include/linux/buffer_head.h |    1 +
 include/linux/jbd.h         |    1 +
 include/uapi/linux/fs.h     |    1 -
 mm/bounce.c                 |   21 +--------------------
 mm/page-writeback.c         |    4 ----
 9 files changed, 40 insertions(+), 31 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b4dcb34..71578d6 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
 	}
 }
 
-int submit_bh(int rw, struct buffer_head * bh)
+int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
 {
 	struct bio *bio;
 	int ret = 0;
@@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
 
 	bio->bi_end_io = end_bio_bh_io_sync;
 	bio->bi_private = bh;
+	bio->bi_flags |= bio_flags;
 
 	/* Take care of bh's that straddle the end of the device */
 	guard_bh_eod(rw, bio, bh);
@@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
 	bio_put(bio);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(_submit_bh);
+
+int submit_bh(int rw, struct buffer_head *bh)
+{
+	return _submit_bh(rw, bh, 0);
+}
 EXPORT_SYMBOL(submit_bh);
 
 /**
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 1d6e2ed..e845b6de 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2063,11 +2063,12 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 		ext3_mark_recovery_complete(sb, es);
 		ext3_msg(sb, KERN_INFO, "recovery complete");
 	}
+	if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA)
+		EXT3_SB(sb)->s_journal->j_flags |= JFS_JOURNALS_DATA;
 	ext3_msg(sb, KERN_INFO, "mounted filesystem with %s data mode",
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
 		"writeback");
-	sb->s_flags |= MS_SNAP_STABLE;
 
 	return 0;
 
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 86b39b1..37a60dd 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -162,8 +162,17 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
 
 	for (i = 0; i < bufs; i++) {
 		wbuf[i]->b_end_io = end_buffer_write_sync;
-		/* We use-up our safety reference in submit_bh() */
-		submit_bh(write_op, wbuf[i]);
+		/*
+		 * Here we write back pagecache data that may be mmaped. Since
+		 * we cannot afford to clean the page and set PageWriteback
+		 * here due to lock ordering (page lock ranks above transaction
+		 * start), the data can change while IO is in flight. Tell the
+		 * block layer it should bounce the bio pages if stable data
+		 * during write is required.
+		 *
+		 * We use up our safety reference in submit_bh().
+		 */
+		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
 	}
 }
 
@@ -663,11 +672,24 @@ void journal_commit_transaction(journal_t *journal)
 start_journal_io:
 			for (i = 0; i < bufs; i++) {
 				struct buffer_head *bh = wbuf[i];
+				unsigned long bio_flags = 0;
 				lock_buffer(bh);
 				clear_buffer_dirty(bh);
 				set_buffer_uptodate(bh);
 				bh->b_end_io = journal_end_buffer_io_sync;
-				submit_bh(write_op, bh);
+				/*
+				 * In data=journal mode, here we can end up
+				 * writing pagecache data that might be
+				 * mmapped. Since we can't afford to clean the
+				 * page and set PageWriteback (see the comment
+				 * near the other use of _submit_bh()), the
+				 * data can change while the write is in
+				 * flight.  Tell the block layer to bounce the
+				 * bio pages if stable pages are required.
+				 */
+				if (journal->j_flags & JFS_JOURNALS_DATA)
+					bio_flags = 1 << BIO_SNAP_STABLE;
+				_submit_bh(write_op, bh, bio_flags);
 			}
 			cond_resched();
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cdf1119..22990cf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -111,12 +111,13 @@ struct bio {
 #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
 #define BIO_QUIET	10	/* Make BIO Quiet */
 #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
+#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
 
 /*
  * Flags starting here get preserved by bio_reset() - this includes
  * BIO_POOL_IDX()
  */
-#define BIO_RESET_BITS	12
+#define BIO_RESET_BITS	13
 
 #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
 
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 5afc4f9..4c16c4a 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
 int sync_dirty_buffer(struct buffer_head *bh);
 int __sync_dirty_buffer(struct buffer_head *bh, int rw);
 void write_dirty_buffer(struct buffer_head *bh, int rw);
+int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags);
 int submit_bh(int, struct buffer_head *);
 void write_boundary_block(struct block_device *bdev,
 			sector_t bblock, unsigned blocksize);
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index c8f3297..2bfd613 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -768,6 +768,7 @@ struct journal_s
 #define JFS_ABORT_ON_SYNCDATA_ERR	0x040  /* Abort the journal on file
 						* data write error in ordered
 						* mode */
+#define JFS_JOURNALS_DATA	0x080		/* data=journal mode */
 
 /*
  * Function declarations for the journaling transaction and buffer
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index c7fc1e6..a4ed56c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -88,7 +88,6 @@ struct inodes_stat_t {
 #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
 
 /* These sb flags are internal to the kernel */
-#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
 #define MS_NOSEC	(1<<28)
 #define MS_BORN		(1<<29)
 #define MS_ACTIVE	(1<<30)
diff --git a/mm/bounce.c b/mm/bounce.c
index 5f89017..a5c2ec3 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
 #ifdef CONFIG_NEED_BOUNCE_POOL
 static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
 {
-	struct page *page;
-	struct backing_dev_info *bdi;
-	struct address_space *mapping;
-	struct bio_vec *from;
-	int i;
-
 	if (bio_data_dir(bio) != WRITE)
 		return 0;
 
 	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
 		return 0;
 
-	/*
-	 * Based on the first page that has a valid mapping, decide whether or
-	 * not we have to employ bounce buffering to guarantee stable pages.
-	 */
-	bio_for_each_segment(from, bio, i) {
-		page = from->bv_page;
-		mapping = page_mapping(page);
-		if (!mapping)
-			continue;
-		bdi = mapping->backing_dev_info;
-		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
-	}
-
-	return 0;
+	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
 }
 #else
 static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index efe6814..4514ad7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
 
 	if (!bdi_cap_stable_pages_required(bdi))
 		return;
-#ifdef CONFIG_NEED_BOUNCE_POOL
-	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
-		return;
-#endif /* CONFIG_NEED_BOUNCE_POOL */
 
 	wait_on_page_writeback(page);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
@ 2013-03-15 23:28                 ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-15 23:28 UTC (permalink / raw)
  To: linux-arm-kernel

Walking a bio's page mappings has proved problematic, so create a new bio flag
to indicate that a bio's data needs to be snapshotted in order to guarantee
stable pages during writeback.  Next, for the one user (ext3/jbd) of
snapshotting, hook all the places where writes can be initiated without
PG_writeback set, and set BIO_SNAP_STABLE there.  We must also flag journal
"metadata" bios for stable writeout if data=journal, since file data is written
through the journal.  Finally, the MS_SNAP_STABLE mount flag (only used by
ext3) is now superfluous, so get rid of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

[darrick.wong at oracle.com: Fold in a couple of small cleanups from akpm]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 fs/buffer.c                 |    9 ++++++++-
 fs/ext3/super.c             |    3 ++-
 fs/jbd/commit.c             |   28 +++++++++++++++++++++++++---
 include/linux/blk_types.h   |    3 ++-
 include/linux/buffer_head.h |    1 +
 include/linux/jbd.h         |    1 +
 include/uapi/linux/fs.h     |    1 -
 mm/bounce.c                 |   21 +--------------------
 mm/page-writeback.c         |    4 ----
 9 files changed, 40 insertions(+), 31 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b4dcb34..71578d6 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
 	}
 }
 
-int submit_bh(int rw, struct buffer_head * bh)
+int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
 {
 	struct bio *bio;
 	int ret = 0;
@@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
 
 	bio->bi_end_io = end_bio_bh_io_sync;
 	bio->bi_private = bh;
+	bio->bi_flags |= bio_flags;
 
 	/* Take care of bh's that straddle the end of the device */
 	guard_bh_eod(rw, bio, bh);
@@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
 	bio_put(bio);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(_submit_bh);
+
+int submit_bh(int rw, struct buffer_head *bh)
+{
+	return _submit_bh(rw, bh, 0);
+}
 EXPORT_SYMBOL(submit_bh);
 
 /**
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 1d6e2ed..e845b6de 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2063,11 +2063,12 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 		ext3_mark_recovery_complete(sb, es);
 		ext3_msg(sb, KERN_INFO, "recovery complete");
 	}
+	if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA)
+		EXT3_SB(sb)->s_journal->j_flags |= JFS_JOURNALS_DATA;
 	ext3_msg(sb, KERN_INFO, "mounted filesystem with %s data mode",
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
 		"writeback");
-	sb->s_flags |= MS_SNAP_STABLE;
 
 	return 0;
 
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 86b39b1..37a60dd 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -162,8 +162,17 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
 
 	for (i = 0; i < bufs; i++) {
 		wbuf[i]->b_end_io = end_buffer_write_sync;
-		/* We use-up our safety reference in submit_bh() */
-		submit_bh(write_op, wbuf[i]);
+		/*
+		 * Here we write back pagecache data that may be mmaped. Since
+		 * we cannot afford to clean the page and set PageWriteback
+		 * here due to lock ordering (page lock ranks above transaction
+		 * start), the data can change while IO is in flight. Tell the
+		 * block layer it should bounce the bio pages if stable data
+		 * during write is required.
+		 *
+		 * We use up our safety reference in submit_bh().
+		 */
+		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
 	}
 }
 
@@ -663,11 +672,24 @@ void journal_commit_transaction(journal_t *journal)
 start_journal_io:
 			for (i = 0; i < bufs; i++) {
 				struct buffer_head *bh = wbuf[i];
+				unsigned long bio_flags = 0;
 				lock_buffer(bh);
 				clear_buffer_dirty(bh);
 				set_buffer_uptodate(bh);
 				bh->b_end_io = journal_end_buffer_io_sync;
-				submit_bh(write_op, bh);
+				/*
+				 * In data=journal mode, here we can end up
+				 * writing pagecache data that might be
+				 * mmapped. Since we can't afford to clean the
+				 * page and set PageWriteback (see the comment
+				 * near the other use of _submit_bh()), the
+				 * data can change while the write is in
+				 * flight.  Tell the block layer to bounce the
+				 * bio pages if stable pages are required.
+				 */
+				if (journal->j_flags & JFS_JOURNALS_DATA)
+					bio_flags = 1 << BIO_SNAP_STABLE;
+				_submit_bh(write_op, bh, bio_flags);
 			}
 			cond_resched();
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cdf1119..22990cf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -111,12 +111,13 @@ struct bio {
 #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
 #define BIO_QUIET	10	/* Make BIO Quiet */
 #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
+#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
 
 /*
  * Flags starting here get preserved by bio_reset() - this includes
  * BIO_POOL_IDX()
  */
-#define BIO_RESET_BITS	12
+#define BIO_RESET_BITS	13
 
 #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
 
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 5afc4f9..4c16c4a 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
 int sync_dirty_buffer(struct buffer_head *bh);
 int __sync_dirty_buffer(struct buffer_head *bh, int rw);
 void write_dirty_buffer(struct buffer_head *bh, int rw);
+int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags);
 int submit_bh(int, struct buffer_head *);
 void write_boundary_block(struct block_device *bdev,
 			sector_t bblock, unsigned blocksize);
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index c8f3297..2bfd613 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -768,6 +768,7 @@ struct journal_s
 #define JFS_ABORT_ON_SYNCDATA_ERR	0x040  /* Abort the journal on file
 						* data write error in ordered
 						* mode */
+#define JFS_JOURNALS_DATA	0x080		/* data=journal mode */
 
 /*
  * Function declarations for the journaling transaction and buffer
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index c7fc1e6..a4ed56c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -88,7 +88,6 @@ struct inodes_stat_t {
 #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
 
 /* These sb flags are internal to the kernel */
-#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
 #define MS_NOSEC	(1<<28)
 #define MS_BORN		(1<<29)
 #define MS_ACTIVE	(1<<30)
diff --git a/mm/bounce.c b/mm/bounce.c
index 5f89017..a5c2ec3 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
 #ifdef CONFIG_NEED_BOUNCE_POOL
 static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
 {
-	struct page *page;
-	struct backing_dev_info *bdi;
-	struct address_space *mapping;
-	struct bio_vec *from;
-	int i;
-
 	if (bio_data_dir(bio) != WRITE)
 		return 0;
 
 	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
 		return 0;
 
-	/*
-	 * Based on the first page that has a valid mapping, decide whether or
-	 * not we have to employ bounce buffering to guarantee stable pages.
-	 */
-	bio_for_each_segment(from, bio, i) {
-		page = from->bv_page;
-		mapping = page_mapping(page);
-		if (!mapping)
-			continue;
-		bdi = mapping->backing_dev_info;
-		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
-	}
-
-	return 0;
+	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
 }
 #else
 static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index efe6814..4514ad7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
 
 	if (!bdi_cap_stable_pages_required(bdi))
 		return;
-#ifdef CONFIG_NEED_BOUNCE_POOL
-	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
-		return;
-#endif /* CONFIG_NEED_BOUNCE_POOL */
 
 	wait_on_page_writeback(page);
 }

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
  2013-03-15 17:54                 ` Darrick J. Wong
  (?)
@ 2013-03-18 17:32                   ` Jan Kara
  -1 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-03-18 17:32 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, Andrew Morton, Shuge, linux-kernel, linux-mm,
	linux-ext4, Kevin, Theodore Ts'o, Jens Axboe,
	Catalin Marinas, Will Deacon, linux-arm-kernel

On Fri 15-03-13 10:54:41, Darrick J. Wong wrote:
> On Fri, Mar 15, 2013 at 11:01:05AM +0100, Jan Kara wrote:
> > On Thu 14-03-13 15:42:43, Darrick J. Wong wrote:
> > > On Wed, Mar 13, 2013 at 10:02:16PM +0100, Jan Kara wrote:
> > > > On Wed 13-03-13 12:44:29, Darrick J. Wong wrote:
> > > > > On Wed, Mar 13, 2013 at 09:50:21AM +0100, Jan Kara wrote:
> > > > > > On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> > > > > > > On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > > > > > > > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > > > > > > > 
> > > > > > > > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > > > > > > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > > > > > > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > > > > > > > 
> > > > > > > > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > > > > > > > 
> > > > > > > > > ...
> > > > > > > > >
> > > > > > > > > --- a/mm/bounce.c
> > > > > > > > > +++ b/mm/bounce.c
> > > > > > > > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > > > > > > > *q, struct bio **bio_orig,
> > > > > > > > >   		if (rw == WRITE) {
> > > > > > > > >   			char *vto, *vfrom;
> > > > > > > > >   -			flush_dcache_page(from->bv_page);
> > > > > > > > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > > > > > > > +				flush_dcache_page(from->bv_page);
> > > > > > > > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > > > > > > > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > > > > > > > >   			memcpy(vto, vfrom, to->bv_len);
> > > > > > > > 
> > > > > > > > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > > > > > > > maintenance routines"), which added a page_mapping() call to arm64's
> > > > > > > > arch/arm64/mm/flush.c:flush_dcache_page().
> > > > > > > > 
> > > > > > > > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > > > > > > > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > > > > > > > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > > > > > > > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > > > > > > > 
> > > > > > > > The unusual thing about all of this is that the payload for some disk
> > > > > > > > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > > > > > > > but we've done this for ages and should continue to support it.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Now, the page from kmalloc() cannot be in highmem, so why did the
> > > > > > > > bounce code decide to bounce it?
> > > > > > > > 
> > > > > > > > __blk_queue_bounce() does
> > > > > > > > 
> > > > > > > > 		/*
> > > > > > > > 		 * is destination page below bounce pfn?
> > > > > > > > 		 */
> > > > > > > > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > > > > > > > 			continue;
> > > > > > > > 
> > > > > > > > and `force' comes from must_snapshot_stable_pages().  But
> > > > > > > > must_snapshot_stable_pages() must have returned false, because if it
> > > > > > > > had returned true then it would have been must_snapshot_stable_pages()
> > > > > > > > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > > > > > > > 
> > > > > > > > So my tentative diagosis is that arm64 is fishy.  A page which was
> > > > > > > > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > > > > > > > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > > > > > > > investigation to work out if this is what is happening?  Find out why
> > > > > > > > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > > > > > > > bounced?
> > > > > > > 
> > > > > > > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> > > > > > > kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> > > > > > > 
> > > > > > > > This is all terribly fragile :( afaict if someone sets
> > > > > > > > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > > > > > > > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > > > > > > > page_mapping() call.  (Darrick, this means you ;))
> > > > > > > 
> > > > > > > Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> > > > > > > We can keep walking the bio segments to find a non-slab page that can tell us
> > > > > > > MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> > > > > > > 
> > > > > > > How does something like this look?  (+ the patch above)
> > > > > >   Umm, this won't quite work. We can have a bio which has just PageSlab
> > > > > > page attached and so you won't be able to get to the superblock. Heh, isn't
> > > > > > the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
> > > > > > do direct IO, these pages come directly from userspace and hell knows where
> > > > > > they come from. Definitely their page_mapping() doesn't give us anything
> > > > > > useful... Sorry for not realizing this earlier when reviewing the patch.
> > > > > > 
> > > > > > ... remembering why we need to get to sb and why ext3 needs this ... So
> > > > > > maybe a better solution would be to have a bio flag meaning that pages need
> > > > > > bouncing? And we would set it from filesystems that need it - in case of
> > > > > > ext3 only writeback of data from kjournald actually needs to bounce the
> > > > > > pages. Thoughts?
> > > > > 
> > > > > What about dirty pages that don't result in journal transactions?  I think
> > > > > ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> > > > > __block_write_full_page, which in turn calls submit_bh().
> > > >   So here we have two options:
> > > > Either we let ext3 wait the same way as other filesystems when stable pages
> > > > are required. Then only data IO from kjournald needs to be bounced (all
> > > > other IO is properly protected by PageWriteback bit).
> > > > 
> > > > Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
> > > > needs bouncing, and set the bio flag in __block_write_full_page() and
> > > > kjournald based on the sb flag.
> > > > 
> > > > I think the first option is slightly better but I don't feel strongly
> > > > about that.
> > > 
> > > I like that first option -- it contains the kludgery to jbd instead of
> > > spreading it around.  Here's a patch that passes a quick smoke test on ext[34],
> > > xfs, and vfat.  What do you think of this one?  Should I create a
> > > submit_snapshot_bh() instead of letting callers stuff in arbitrary dangerous
> > > BH_ flags?
> >   Thanks for writing the patch. I think _submit_bh() is OK as you did it. I
> > have just two comments below.
> > 
> > > ---
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > Subject: [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
> > > 
> > > Walking a bio's page mappings has proved problematic, so create a new bio flag
> > > to indicate that a bio's data needs to be snapshotted in order to guarantee
> > > stable pages during writeback.  Next, for the one user (ext3/jbd) of
> > > snapshotting, hook all the places where writes can be initiated without
> > > PG_writeback set, and set BIO_SNAP_STABLE there.  Finally, the MS_SNAP_STABLE
> > > mount flag (only used by ext3) is now superfluous, so get rid of it.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/buffer.c                 |    9 ++++++++-
> > >  fs/ext3/super.c             |    1 -
> > >  fs/jbd/commit.c             |    4 ++--
> > >  include/linux/blk_types.h   |    3 ++-
> > >  include/linux/buffer_head.h |    1 +
> > >  include/uapi/linux/fs.h     |    1 -
> > >  mm/bounce.c                 |   21 +--------------------
> > >  mm/page-writeback.c         |    4 ----
> > >  8 files changed, 14 insertions(+), 30 deletions(-)
> > > 
> > ...
> > > diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> > > index 86b39b1..b91b688 100644
> > > --- a/fs/jbd/commit.c
> > > +++ b/fs/jbd/commit.c
> > > @@ -163,7 +163,7 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
> > >  	for (i = 0; i < bufs; i++) {
> > >  		wbuf[i]->b_end_io = end_buffer_write_sync;
> > >  		/* We use-up our safety reference in submit_bh() */
> > > -		submit_bh(write_op, wbuf[i]);
> > > +		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
> >   Please add a comment here why we need BIO_SNAP_STABLE. Something like:
> > /*
> >  * Here we write back pagecache data that may be mmaped. Since we cannot
> >  * afford to clean the page and set PageWriteback here due to lock ordering
> >  * (page lock ranks above transaction start), the data can change while IO is
> >  * in flight. Tell the block layer it should bounce the bio pages if stable
> >  * data during write is required.
> >  */
> > 
> > >  	}
> > >  }
> > >  
> > > @@ -667,7 +667,7 @@ start_journal_io:
> > >  				clear_buffer_dirty(bh);
> > >  				set_buffer_uptodate(bh);
> > >  				bh->b_end_io = journal_end_buffer_io_sync;
> > > -				submit_bh(write_op, bh);
> > > +				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
> >   And this isn't needed. Here we write out only metadata and JBD already
> > handles copying those / waiting for IO in flight for metadata.
> 
> I think it only copies the page if either the buffer is also a part of the
> current transaction (or someone called do_get_undo_access()).  Unfortunately,
> if we're in data=journal mode, dirty data pages get pushed through jbd as if
> they were fs metadata, but in the meantime other processes can still write to
> those pages.  So I guess we need the journal to freeze those pages as soon as
> they come in.
  So you miss the part that do_get_write_access() actually waits for buffer
if it is undergoing commit (kjournald is writing it). But you are right
that in data=journal mode if the page is mmaped, user can change it while
kjournald is doing write out. Actually the same problem is with ext4 and
data=journal mode. So for now I'd stay with your simple patch and later we
can optimize it so that we don't have to pay the penalty when the buffer is
not journalled data.

								Honza

---
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-18 17:32                   ` Jan Kara
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-03-18 17:32 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, Andrew Morton, Shuge, linux-kernel, linux-mm,
	linux-ext4, Kevin, Theodore Ts'o, Jens Axboe,
	Catalin Marinas, Will Deacon, linux-arm-kernel

On Fri 15-03-13 10:54:41, Darrick J. Wong wrote:
> On Fri, Mar 15, 2013 at 11:01:05AM +0100, Jan Kara wrote:
> > On Thu 14-03-13 15:42:43, Darrick J. Wong wrote:
> > > On Wed, Mar 13, 2013 at 10:02:16PM +0100, Jan Kara wrote:
> > > > On Wed 13-03-13 12:44:29, Darrick J. Wong wrote:
> > > > > On Wed, Mar 13, 2013 at 09:50:21AM +0100, Jan Kara wrote:
> > > > > > On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> > > > > > > On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > > > > > > > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > > > > > > > 
> > > > > > > > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > > > > > > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > > > > > > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > > > > > > > 
> > > > > > > > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > > > > > > > 
> > > > > > > > > ...
> > > > > > > > >
> > > > > > > > > --- a/mm/bounce.c
> > > > > > > > > +++ b/mm/bounce.c
> > > > > > > > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > > > > > > > *q, struct bio **bio_orig,
> > > > > > > > >   		if (rw == WRITE) {
> > > > > > > > >   			char *vto, *vfrom;
> > > > > > > > >   -			flush_dcache_page(from->bv_page);
> > > > > > > > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > > > > > > > +				flush_dcache_page(from->bv_page);
> > > > > > > > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > > > > > > > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > > > > > > > >   			memcpy(vto, vfrom, to->bv_len);
> > > > > > > > 
> > > > > > > > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > > > > > > > maintenance routines"), which added a page_mapping() call to arm64's
> > > > > > > > arch/arm64/mm/flush.c:flush_dcache_page().
> > > > > > > > 
> > > > > > > > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > > > > > > > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > > > > > > > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > > > > > > > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > > > > > > > 
> > > > > > > > The unusual thing about all of this is that the payload for some disk
> > > > > > > > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > > > > > > > but we've done this for ages and should continue to support it.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Now, the page from kmalloc() cannot be in highmem, so why did the
> > > > > > > > bounce code decide to bounce it?
> > > > > > > > 
> > > > > > > > __blk_queue_bounce() does
> > > > > > > > 
> > > > > > > > 		/*
> > > > > > > > 		 * is destination page below bounce pfn?
> > > > > > > > 		 */
> > > > > > > > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > > > > > > > 			continue;
> > > > > > > > 
> > > > > > > > and `force' comes from must_snapshot_stable_pages().  But
> > > > > > > > must_snapshot_stable_pages() must have returned false, because if it
> > > > > > > > had returned true then it would have been must_snapshot_stable_pages()
> > > > > > > > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > > > > > > > 
> > > > > > > > So my tentative diagosis is that arm64 is fishy.  A page which was
> > > > > > > > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > > > > > > > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > > > > > > > investigation to work out if this is what is happening?  Find out why
> > > > > > > > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > > > > > > > bounced?
> > > > > > > 
> > > > > > > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> > > > > > > kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> > > > > > > 
> > > > > > > > This is all terribly fragile :( afaict if someone sets
> > > > > > > > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > > > > > > > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > > > > > > > page_mapping() call.  (Darrick, this means you ;))
> > > > > > > 
> > > > > > > Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> > > > > > > We can keep walking the bio segments to find a non-slab page that can tell us
> > > > > > > MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> > > > > > > 
> > > > > > > How does something like this look?  (+ the patch above)
> > > > > >   Umm, this won't quite work. We can have a bio which has just PageSlab
> > > > > > page attached and so you won't be able to get to the superblock. Heh, isn't
> > > > > > the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
> > > > > > do direct IO, these pages come directly from userspace and hell knows where
> > > > > > they come from. Definitely their page_mapping() doesn't give us anything
> > > > > > useful... Sorry for not realizing this earlier when reviewing the patch.
> > > > > > 
> > > > > > ... remembering why we need to get to sb and why ext3 needs this ... So
> > > > > > maybe a better solution would be to have a bio flag meaning that pages need
> > > > > > bouncing? And we would set it from filesystems that need it - in case of
> > > > > > ext3 only writeback of data from kjournald actually needs to bounce the
> > > > > > pages. Thoughts?
> > > > > 
> > > > > What about dirty pages that don't result in journal transactions?  I think
> > > > > ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> > > > > __block_write_full_page, which in turn calls submit_bh().
> > > >   So here we have two options:
> > > > Either we let ext3 wait the same way as other filesystems when stable pages
> > > > are required. Then only data IO from kjournald needs to be bounced (all
> > > > other IO is properly protected by PageWriteback bit).
> > > > 
> > > > Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
> > > > needs bouncing, and set the bio flag in __block_write_full_page() and
> > > > kjournald based on the sb flag.
> > > > 
> > > > I think the first option is slightly better but I don't feel strongly
> > > > about that.
> > > 
> > > I like that first option -- it contains the kludgery to jbd instead of
> > > spreading it around.  Here's a patch that passes a quick smoke test on ext[34],
> > > xfs, and vfat.  What do you think of this one?  Should I create a
> > > submit_snapshot_bh() instead of letting callers stuff in arbitrary dangerous
> > > BH_ flags?
> >   Thanks for writing the patch. I think _submit_bh() is OK as you did it. I
> > have just two comments below.
> > 
> > > ---
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > Subject: [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
> > > 
> > > Walking a bio's page mappings has proved problematic, so create a new bio flag
> > > to indicate that a bio's data needs to be snapshotted in order to guarantee
> > > stable pages during writeback.  Next, for the one user (ext3/jbd) of
> > > snapshotting, hook all the places where writes can be initiated without
> > > PG_writeback set, and set BIO_SNAP_STABLE there.  Finally, the MS_SNAP_STABLE
> > > mount flag (only used by ext3) is now superfluous, so get rid of it.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/buffer.c                 |    9 ++++++++-
> > >  fs/ext3/super.c             |    1 -
> > >  fs/jbd/commit.c             |    4 ++--
> > >  include/linux/blk_types.h   |    3 ++-
> > >  include/linux/buffer_head.h |    1 +
> > >  include/uapi/linux/fs.h     |    1 -
> > >  mm/bounce.c                 |   21 +--------------------
> > >  mm/page-writeback.c         |    4 ----
> > >  8 files changed, 14 insertions(+), 30 deletions(-)
> > > 
> > ...
> > > diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> > > index 86b39b1..b91b688 100644
> > > --- a/fs/jbd/commit.c
> > > +++ b/fs/jbd/commit.c
> > > @@ -163,7 +163,7 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
> > >  	for (i = 0; i < bufs; i++) {
> > >  		wbuf[i]->b_end_io = end_buffer_write_sync;
> > >  		/* We use-up our safety reference in submit_bh() */
> > > -		submit_bh(write_op, wbuf[i]);
> > > +		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
> >   Please add a comment here why we need BIO_SNAP_STABLE. Something like:
> > /*
> >  * Here we write back pagecache data that may be mmaped. Since we cannot
> >  * afford to clean the page and set PageWriteback here due to lock ordering
> >  * (page lock ranks above transaction start), the data can change while IO is
> >  * in flight. Tell the block layer it should bounce the bio pages if stable
> >  * data during write is required.
> >  */
> > 
> > >  	}
> > >  }
> > >  
> > > @@ -667,7 +667,7 @@ start_journal_io:
> > >  				clear_buffer_dirty(bh);
> > >  				set_buffer_uptodate(bh);
> > >  				bh->b_end_io = journal_end_buffer_io_sync;
> > > -				submit_bh(write_op, bh);
> > > +				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
> >   And this isn't needed. Here we write out only metadata and JBD already
> > handles copying those / waiting for IO in flight for metadata.
> 
> I think it only copies the page if either the buffer is also a part of the
> current transaction (or someone called do_get_undo_access()).  Unfortunately,
> if we're in data=journal mode, dirty data pages get pushed through jbd as if
> they were fs metadata, but in the meantime other processes can still write to
> those pages.  So I guess we need the journal to freeze those pages as soon as
> they come in.
  So you miss the part that do_get_write_access() actually waits for buffer
if it is undergoing commit (kjournald is writing it). But you are right
that in data=journal mode if the page is mmaped, user can change it while
kjournald is doing write out. Actually the same problem is with ext4 and
data=journal mode. So for now I'd stay with your simple patch and later we
can optimize it so that we don't have to pay the penalty when the buffer is
not journalled data.

								Honza

---
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2.
@ 2013-03-18 17:32                   ` Jan Kara
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-03-18 17:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri 15-03-13 10:54:41, Darrick J. Wong wrote:
> On Fri, Mar 15, 2013 at 11:01:05AM +0100, Jan Kara wrote:
> > On Thu 14-03-13 15:42:43, Darrick J. Wong wrote:
> > > On Wed, Mar 13, 2013 at 10:02:16PM +0100, Jan Kara wrote:
> > > > On Wed 13-03-13 12:44:29, Darrick J. Wong wrote:
> > > > > On Wed, Mar 13, 2013 at 09:50:21AM +0100, Jan Kara wrote:
> > > > > > On Tue 12-03-13 18:10:20, Darrick J. Wong wrote:
> > > > > > > On Tue, Mar 12, 2013 at 03:32:21PM -0700, Andrew Morton wrote:
> > > > > > > > On Fri, 08 Mar 2013 20:37:36 +0800 Shuge <shugelinux@gmail.com> wrote:
> > > > > > > > 
> > > > > > > > > The bounce accept slab pages from jbd2, and flush dcache on them.
> > > > > > > > > When enabling VM_DEBUG, it will tigger VM_BUG_ON in page_mapping().
> > > > > > > > > So, check PageSlab to avoid it in __blk_queue_bounce().
> > > > > > > > > 
> > > > > > > > > Bug URL: http://lkml.org/lkml/2013/3/7/56
> > > > > > > > > 
> > > > > > > > > ...
> > > > > > > > >
> > > > > > > > > --- a/mm/bounce.c
> > > > > > > > > +++ b/mm/bounce.c
> > > > > > > > > @@ -214,7 +214,8 @@ static void __blk_queue_bounce(struct request_queue 
> > > > > > > > > *q, struct bio **bio_orig,
> > > > > > > > >   		if (rw == WRITE) {
> > > > > > > > >   			char *vto, *vfrom;
> > > > > > > > >   -			flush_dcache_page(from->bv_page);
> > > > > > > > > +			if (unlikely(!PageSlab(from->bv_page)))
> > > > > > > > > +				flush_dcache_page(from->bv_page);
> > > > > > > > >   			vto = page_address(to->bv_page) + to->bv_offset;
> > > > > > > > >   			vfrom = kmap(from->bv_page) + from->bv_offset;
> > > > > > > > >   			memcpy(vto, vfrom, to->bv_len);
> > > > > > > > 
> > > > > > > > I guess this is triggered by Catalin's f1a0c4aa0937975b ("arm64: Cache
> > > > > > > > maintenance routines"), which added a page_mapping() call to arm64's
> > > > > > > > arch/arm64/mm/flush.c:flush_dcache_page().
> > > > > > > > 
> > > > > > > > What's happening is that jbd2 is using kmalloc() to allocate buffer_head
> > > > > > > > data.  That gets submitted down the BIO layer and __blk_queue_bounce()
> > > > > > > > calls flush_dcache_page() which in the arm64 case calls page_mapping()
> > > > > > > > and page_mapping() does VM_BUG_ON(PageSlab) and splat.
> > > > > > > > 
> > > > > > > > The unusual thing about all of this is that the payload for some disk
> > > > > > > > IO is coming from kmalloc, rather than being a user page.  It's oddball
> > > > > > > > but we've done this for ages and should continue to support it.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Now, the page from kmalloc() cannot be in highmem, so why did the
> > > > > > > > bounce code decide to bounce it?
> > > > > > > > 
> > > > > > > > __blk_queue_bounce() does
> > > > > > > > 
> > > > > > > > 		/*
> > > > > > > > 		 * is destination page below bounce pfn?
> > > > > > > > 		 */
> > > > > > > > 		if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
> > > > > > > > 			continue;
> > > > > > > > 
> > > > > > > > and `force' comes from must_snapshot_stable_pages().  But
> > > > > > > > must_snapshot_stable_pages() must have returned false, because if it
> > > > > > > > had returned true then it would have been must_snapshot_stable_pages()
> > > > > > > > which went BUG, because must_snapshot_stable_pages() calls page_mapping().
> > > > > > > > 
> > > > > > > > So my tentative diagosis is that arm64 is fishy.  A page which was
> > > > > > > > allocated via jbd2_alloc(GFP_NOFS)->kmem_cache_alloc() ended up being
> > > > > > > > above arm64's queue_bounce_pfn().  Can you please do a bit of
> > > > > > > > investigation to work out if this is what is happening?  Find out why
> > > > > > > > __blk_queue_bounce() decided to bounce a page which shouldn't have been
> > > > > > > > bounced?
> > > > > > > 
> > > > > > > That sure is strange.  I didn't see any obvious reasons why we'd end up with a
> > > > > > > kmalloc above queue_bounce_pfn().  But then I don't have any arm64s either.
> > > > > > > 
> > > > > > > > This is all terribly fragile :( afaict if someone sets
> > > > > > > > bdi_cap_stable_pages_required() against that jbd2 queue, we're going to
> > > > > > > > hit that BUG_ON() again, via must_snapshot_stable_pages()'s
> > > > > > > > page_mapping() call.  (Darrick, this means you ;))
> > > > > > > 
> > > > > > > Wheeee.  You're right, we shouldn't be calling page_mapping on slab pages.
> > > > > > > We can keep walking the bio segments to find a non-slab page that can tell us
> > > > > > > MS_SNAP_STABLE is set, since we probably won't need the bounce buffer anyway.
> > > > > > > 
> > > > > > > How does something like this look?  (+ the patch above)
> > > > > >   Umm, this won't quite work. We can have a bio which has just PageSlab
> > > > > > page attached and so you won't be able to get to the superblock. Heh, isn't
> > > > > > the whole page_mapping() thing in must_snapshot_stable_pages() wrong? When we
> > > > > > do direct IO, these pages come directly from userspace and hell knows where
> > > > > > they come from. Definitely their page_mapping() doesn't give us anything
> > > > > > useful... Sorry for not realizing this earlier when reviewing the patch.
> > > > > > 
> > > > > > ... remembering why we need to get to sb and why ext3 needs this ... So
> > > > > > maybe a better solution would be to have a bio flag meaning that pages need
> > > > > > bouncing? And we would set it from filesystems that need it - in case of
> > > > > > ext3 only writeback of data from kjournald actually needs to bounce the
> > > > > > pages. Thoughts?
> > > > > 
> > > > > What about dirty pages that don't result in journal transactions?  I think
> > > > > ext3_sync_file() eventually calls ext3_ordered_writepage, which then calls
> > > > > __block_write_full_page, which in turn calls submit_bh().
> > > >   So here we have two options:
> > > > Either we let ext3 wait the same way as other filesystems when stable pages
> > > > are required. Then only data IO from kjournald needs to be bounced (all
> > > > other IO is properly protected by PageWriteback bit).
> > > > 
> > > > Or we won't let ext3 wait (as it is now), keep the superblock flag that fs
> > > > needs bouncing, and set the bio flag in __block_write_full_page() and
> > > > kjournald based on the sb flag.
> > > > 
> > > > I think the first option is slightly better but I don't feel strongly
> > > > about that.
> > > 
> > > I like that first option -- it contains the kludgery to jbd instead of
> > > spreading it around.  Here's a patch that passes a quick smoke test on ext[34],
> > > xfs, and vfat.  What do you think of this one?  Should I create a
> > > submit_snapshot_bh() instead of letting callers stuff in arbitrary dangerous
> > > BH_ flags?
> >   Thanks for writing the patch. I think _submit_bh() is OK as you did it. I
> > have just two comments below.
> > 
> > > ---
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > Subject: [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
> > > 
> > > Walking a bio's page mappings has proved problematic, so create a new bio flag
> > > to indicate that a bio's data needs to be snapshotted in order to guarantee
> > > stable pages during writeback.  Next, for the one user (ext3/jbd) of
> > > snapshotting, hook all the places where writes can be initiated without
> > > PG_writeback set, and set BIO_SNAP_STABLE there.  Finally, the MS_SNAP_STABLE
> > > mount flag (only used by ext3) is now superfluous, so get rid of it.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/buffer.c                 |    9 ++++++++-
> > >  fs/ext3/super.c             |    1 -
> > >  fs/jbd/commit.c             |    4 ++--
> > >  include/linux/blk_types.h   |    3 ++-
> > >  include/linux/buffer_head.h |    1 +
> > >  include/uapi/linux/fs.h     |    1 -
> > >  mm/bounce.c                 |   21 +--------------------
> > >  mm/page-writeback.c         |    4 ----
> > >  8 files changed, 14 insertions(+), 30 deletions(-)
> > > 
> > ...
> > > diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> > > index 86b39b1..b91b688 100644
> > > --- a/fs/jbd/commit.c
> > > +++ b/fs/jbd/commit.c
> > > @@ -163,7 +163,7 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
> > >  	for (i = 0; i < bufs; i++) {
> > >  		wbuf[i]->b_end_io = end_buffer_write_sync;
> > >  		/* We use-up our safety reference in submit_bh() */
> > > -		submit_bh(write_op, wbuf[i]);
> > > +		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
> >   Please add a comment here why we need BIO_SNAP_STABLE. Something like:
> > /*
> >  * Here we write back pagecache data that may be mmaped. Since we cannot
> >  * afford to clean the page and set PageWriteback here due to lock ordering
> >  * (page lock ranks above transaction start), the data can change while IO is
> >  * in flight. Tell the block layer it should bounce the bio pages if stable
> >  * data during write is required.
> >  */
> > 
> > >  	}
> > >  }
> > >  
> > > @@ -667,7 +667,7 @@ start_journal_io:
> > >  				clear_buffer_dirty(bh);
> > >  				set_buffer_uptodate(bh);
> > >  				bh->b_end_io = journal_end_buffer_io_sync;
> > > -				submit_bh(write_op, bh);
> > > +				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
> >   And this isn't needed. Here we write out only metadata and JBD already
> > handles copying those / waiting for IO in flight for metadata.
> 
> I think it only copies the page if either the buffer is also a part of the
> current transaction (or someone called do_get_undo_access()).  Unfortunately,
> if we're in data=journal mode, dirty data pages get pushed through jbd as if
> they were fs metadata, but in the meantime other processes can still write to
> those pages.  So I guess we need the journal to freeze those pages as soon as
> they come in.
  So you miss the part that do_get_write_access() actually waits for buffer
if it is undergoing commit (kjournald is writing it). But you are right
that in data=journal mode if the page is mmaped, user can change it while
kjournald is doing write out. Actually the same problem is with ext4 and
data=journal mode. So for now I'd stay with your simple patch and later we
can optimize it so that we don't have to pay the penalty when the buffer is
not journalled data.

								Honza

---
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
  2013-03-15 23:28                 ` Darrick J. Wong
  (?)
@ 2013-03-18 17:41                   ` Jan Kara
  -1 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-03-18 17:41 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrew Morton, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Jan Kara

On Fri 15-03-13 16:28:16, Darrick J. Wong wrote:
> Walking a bio's page mappings has proved problematic, so create a new bio flag
> to indicate that a bio's data needs to be snapshotted in order to guarantee
> stable pages during writeback.  Next, for the one user (ext3/jbd) of
> snapshotting, hook all the places where writes can be initiated without
> PG_writeback set, and set BIO_SNAP_STABLE there.  We must also flag journal
> "metadata" bios for stable writeout if data=journal, since file data is written
> through the journal.  Finally, the MS_SNAP_STABLE mount flag (only used by
> ext3) is now superfluous, so get rid of it.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> [darrick.wong@oracle.com: Fold in a couple of small cleanups from akpm]
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>  fs/buffer.c                 |    9 ++++++++-
>  fs/ext3/super.c             |    3 ++-
>  fs/jbd/commit.c             |   28 +++++++++++++++++++++++++---
>  include/linux/blk_types.h   |    3 ++-
>  include/linux/buffer_head.h |    1 +
>  include/linux/jbd.h         |    1 +
>  include/uapi/linux/fs.h     |    1 -
>  mm/bounce.c                 |   21 +--------------------
>  mm/page-writeback.c         |    4 ----
>  9 files changed, 40 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b4dcb34..71578d6 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
>  	}
>  }
>  
> -int submit_bh(int rw, struct buffer_head * bh)
> +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
>  {
>  	struct bio *bio;
>  	int ret = 0;
> @@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
>  
>  	bio->bi_end_io = end_bio_bh_io_sync;
>  	bio->bi_private = bh;
> +	bio->bi_flags |= bio_flags;
>  
>  	/* Take care of bh's that straddle the end of the device */
>  	guard_bh_eod(rw, bio, bh);
> @@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
>  	bio_put(bio);
>  	return ret;
>  }
> +EXPORT_SYMBOL_GPL(_submit_bh);
> +
> +int submit_bh(int rw, struct buffer_head *bh)
> +{
> +	return _submit_bh(rw, bh, 0);
> +}
>  EXPORT_SYMBOL(submit_bh);
>  
>  /**
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 1d6e2ed..e845b6de 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -2063,11 +2063,12 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
>  		ext3_mark_recovery_complete(sb, es);
>  		ext3_msg(sb, KERN_INFO, "recovery complete");
>  	}
> +	if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA)
> +		EXT3_SB(sb)->s_journal->j_flags |= JFS_JOURNALS_DATA;
  Sadly this isn't enough. You can have inodes which journal data (there's
an inode flag for this) in data=ordered mode. So what you have to do is to
flag journal_heads (or buffer_heads) as containing journalled data. Or you
can actually use PageChecked flag for this (it is going to be set on all
write-enabled pages with journalled data). But it definitely requires also
some playing with ->page_mkwrite() (calling ext3_journal_get_write_access()
from there) and generally I'd rather postpone that to a separate commit. So
just keep it simple and always set the bio flag as you did in the previous
version. I'll write an optimization (mostly because ext4 needs it as well)
and send it to you for testing.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
@ 2013-03-18 17:41                   ` Jan Kara
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-03-18 17:41 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrew Morton, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Jan Kara

On Fri 15-03-13 16:28:16, Darrick J. Wong wrote:
> Walking a bio's page mappings has proved problematic, so create a new bio flag
> to indicate that a bio's data needs to be snapshotted in order to guarantee
> stable pages during writeback.  Next, for the one user (ext3/jbd) of
> snapshotting, hook all the places where writes can be initiated without
> PG_writeback set, and set BIO_SNAP_STABLE there.  We must also flag journal
> "metadata" bios for stable writeout if data=journal, since file data is written
> through the journal.  Finally, the MS_SNAP_STABLE mount flag (only used by
> ext3) is now superfluous, so get rid of it.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> [darrick.wong@oracle.com: Fold in a couple of small cleanups from akpm]
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>  fs/buffer.c                 |    9 ++++++++-
>  fs/ext3/super.c             |    3 ++-
>  fs/jbd/commit.c             |   28 +++++++++++++++++++++++++---
>  include/linux/blk_types.h   |    3 ++-
>  include/linux/buffer_head.h |    1 +
>  include/linux/jbd.h         |    1 +
>  include/uapi/linux/fs.h     |    1 -
>  mm/bounce.c                 |   21 +--------------------
>  mm/page-writeback.c         |    4 ----
>  9 files changed, 40 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b4dcb34..71578d6 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
>  	}
>  }
>  
> -int submit_bh(int rw, struct buffer_head * bh)
> +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
>  {
>  	struct bio *bio;
>  	int ret = 0;
> @@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
>  
>  	bio->bi_end_io = end_bio_bh_io_sync;
>  	bio->bi_private = bh;
> +	bio->bi_flags |= bio_flags;
>  
>  	/* Take care of bh's that straddle the end of the device */
>  	guard_bh_eod(rw, bio, bh);
> @@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
>  	bio_put(bio);
>  	return ret;
>  }
> +EXPORT_SYMBOL_GPL(_submit_bh);
> +
> +int submit_bh(int rw, struct buffer_head *bh)
> +{
> +	return _submit_bh(rw, bh, 0);
> +}
>  EXPORT_SYMBOL(submit_bh);
>  
>  /**
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 1d6e2ed..e845b6de 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -2063,11 +2063,12 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
>  		ext3_mark_recovery_complete(sb, es);
>  		ext3_msg(sb, KERN_INFO, "recovery complete");
>  	}
> +	if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA)
> +		EXT3_SB(sb)->s_journal->j_flags |= JFS_JOURNALS_DATA;
  Sadly this isn't enough. You can have inodes which journal data (there's
an inode flag for this) in data=ordered mode. So what you have to do is to
flag journal_heads (or buffer_heads) as containing journalled data. Or you
can actually use PageChecked flag for this (it is going to be set on all
write-enabled pages with journalled data). But it definitely requires also
some playing with ->page_mkwrite() (calling ext3_journal_get_write_access()
from there) and generally I'd rather postpone that to a separate commit. So
just keep it simple and always set the bio flag as you did in the previous
version. I'll write an optimization (mostly because ext4 needs it as well)
and send it to you for testing.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
@ 2013-03-18 17:41                   ` Jan Kara
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-03-18 17:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri 15-03-13 16:28:16, Darrick J. Wong wrote:
> Walking a bio's page mappings has proved problematic, so create a new bio flag
> to indicate that a bio's data needs to be snapshotted in order to guarantee
> stable pages during writeback.  Next, for the one user (ext3/jbd) of
> snapshotting, hook all the places where writes can be initiated without
> PG_writeback set, and set BIO_SNAP_STABLE there.  We must also flag journal
> "metadata" bios for stable writeout if data=journal, since file data is written
> through the journal.  Finally, the MS_SNAP_STABLE mount flag (only used by
> ext3) is now superfluous, so get rid of it.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> [darrick.wong at oracle.com: Fold in a couple of small cleanups from akpm]
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>  fs/buffer.c                 |    9 ++++++++-
>  fs/ext3/super.c             |    3 ++-
>  fs/jbd/commit.c             |   28 +++++++++++++++++++++++++---
>  include/linux/blk_types.h   |    3 ++-
>  include/linux/buffer_head.h |    1 +
>  include/linux/jbd.h         |    1 +
>  include/uapi/linux/fs.h     |    1 -
>  mm/bounce.c                 |   21 +--------------------
>  mm/page-writeback.c         |    4 ----
>  9 files changed, 40 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b4dcb34..71578d6 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
>  	}
>  }
>  
> -int submit_bh(int rw, struct buffer_head * bh)
> +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
>  {
>  	struct bio *bio;
>  	int ret = 0;
> @@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
>  
>  	bio->bi_end_io = end_bio_bh_io_sync;
>  	bio->bi_private = bh;
> +	bio->bi_flags |= bio_flags;
>  
>  	/* Take care of bh's that straddle the end of the device */
>  	guard_bh_eod(rw, bio, bh);
> @@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
>  	bio_put(bio);
>  	return ret;
>  }
> +EXPORT_SYMBOL_GPL(_submit_bh);
> +
> +int submit_bh(int rw, struct buffer_head *bh)
> +{
> +	return _submit_bh(rw, bh, 0);
> +}
>  EXPORT_SYMBOL(submit_bh);
>  
>  /**
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 1d6e2ed..e845b6de 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -2063,11 +2063,12 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
>  		ext3_mark_recovery_complete(sb, es);
>  		ext3_msg(sb, KERN_INFO, "recovery complete");
>  	}
> +	if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA)
> +		EXT3_SB(sb)->s_journal->j_flags |= JFS_JOURNALS_DATA;
  Sadly this isn't enough. You can have inodes which journal data (there's
an inode flag for this) in data=ordered mode. So what you have to do is to
flag journal_heads (or buffer_heads) as containing journalled data. Or you
can actually use PageChecked flag for this (it is going to be set on all
write-enabled pages with journalled data). But it definitely requires also
some playing with ->page_mkwrite() (calling ext3_journal_get_write_access()
from there) and generally I'd rather postpone that to a separate commit. So
just keep it simple and always set the bio flag as you did in the previous
version. I'll write an optimization (mostly because ext4 needs it as well)
and send it to you for testing.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
  2013-03-18 17:41                   ` Jan Kara
  (?)
@ 2013-03-18 23:01                     ` Darrick J. Wong
  -1 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-18 23:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel

On Mon, Mar 18, 2013 at 06:41:34PM +0100, Jan Kara wrote:
> On Fri 15-03-13 16:28:16, Darrick J. Wong wrote:
> > Walking a bio's page mappings has proved problematic, so create a new bio flag
> > to indicate that a bio's data needs to be snapshotted in order to guarantee
> > stable pages during writeback.  Next, for the one user (ext3/jbd) of
> > snapshotting, hook all the places where writes can be initiated without
> > PG_writeback set, and set BIO_SNAP_STABLE there.  We must also flag journal
> > "metadata" bios for stable writeout if data=journal, since file data is written
> > through the journal.  Finally, the MS_SNAP_STABLE mount flag (only used by
> > ext3) is now superfluous, so get rid of it.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > [darrick.wong@oracle.com: Fold in a couple of small cleanups from akpm]
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > ---
> >  fs/buffer.c                 |    9 ++++++++-
> >  fs/ext3/super.c             |    3 ++-
> >  fs/jbd/commit.c             |   28 +++++++++++++++++++++++++---
> >  include/linux/blk_types.h   |    3 ++-
> >  include/linux/buffer_head.h |    1 +
> >  include/linux/jbd.h         |    1 +
> >  include/uapi/linux/fs.h     |    1 -
> >  mm/bounce.c                 |   21 +--------------------
> >  mm/page-writeback.c         |    4 ----
> >  9 files changed, 40 insertions(+), 31 deletions(-)
> > 
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index b4dcb34..71578d6 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
> >  	}
> >  }
> >  
> > -int submit_bh(int rw, struct buffer_head * bh)
> > +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
> >  {
> >  	struct bio *bio;
> >  	int ret = 0;
> > @@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
> >  
> >  	bio->bi_end_io = end_bio_bh_io_sync;
> >  	bio->bi_private = bh;
> > +	bio->bi_flags |= bio_flags;
> >  
> >  	/* Take care of bh's that straddle the end of the device */
> >  	guard_bh_eod(rw, bio, bh);
> > @@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
> >  	bio_put(bio);
> >  	return ret;
> >  }
> > +EXPORT_SYMBOL_GPL(_submit_bh);
> > +
> > +int submit_bh(int rw, struct buffer_head *bh)
> > +{
> > +	return _submit_bh(rw, bh, 0);
> > +}
> >  EXPORT_SYMBOL(submit_bh);
> >  
> >  /**
> > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > index 1d6e2ed..e845b6de 100644
> > --- a/fs/ext3/super.c
> > +++ b/fs/ext3/super.c
> > @@ -2063,11 +2063,12 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
> >  		ext3_mark_recovery_complete(sb, es);
> >  		ext3_msg(sb, KERN_INFO, "recovery complete");
> >  	}
> > +	if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA)
> > +		EXT3_SB(sb)->s_journal->j_flags |= JFS_JOURNALS_DATA;
>   Sadly this isn't enough. You can have inodes which journal data (there's
> an inode flag for this) in data=ordered mode. So what you have to do is to

Arrrgh, I forgot that you can do that per-inode. :(

> flag journal_heads (or buffer_heads) as containing journalled data. Or you
> can actually use PageChecked flag for this (it is going to be set on all
> write-enabled pages with journalled data). But it definitely requires also
> some playing with ->page_mkwrite() (calling ext3_journal_get_write_access()
> from there) and generally I'd rather postpone that to a separate commit. So
> just keep it simple and always set the bio flag as you did in the previous
> version. I'll write an optimization (mostly because ext4 needs it as well)
> and send it to you for testing.

Yeah, this is getting a bit complicated for a single patch.

--D
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
@ 2013-03-18 23:01                     ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-18 23:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel

On Mon, Mar 18, 2013 at 06:41:34PM +0100, Jan Kara wrote:
> On Fri 15-03-13 16:28:16, Darrick J. Wong wrote:
> > Walking a bio's page mappings has proved problematic, so create a new bio flag
> > to indicate that a bio's data needs to be snapshotted in order to guarantee
> > stable pages during writeback.  Next, for the one user (ext3/jbd) of
> > snapshotting, hook all the places where writes can be initiated without
> > PG_writeback set, and set BIO_SNAP_STABLE there.  We must also flag journal
> > "metadata" bios for stable writeout if data=journal, since file data is written
> > through the journal.  Finally, the MS_SNAP_STABLE mount flag (only used by
> > ext3) is now superfluous, so get rid of it.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > [darrick.wong@oracle.com: Fold in a couple of small cleanups from akpm]
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > ---
> >  fs/buffer.c                 |    9 ++++++++-
> >  fs/ext3/super.c             |    3 ++-
> >  fs/jbd/commit.c             |   28 +++++++++++++++++++++++++---
> >  include/linux/blk_types.h   |    3 ++-
> >  include/linux/buffer_head.h |    1 +
> >  include/linux/jbd.h         |    1 +
> >  include/uapi/linux/fs.h     |    1 -
> >  mm/bounce.c                 |   21 +--------------------
> >  mm/page-writeback.c         |    4 ----
> >  9 files changed, 40 insertions(+), 31 deletions(-)
> > 
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index b4dcb34..71578d6 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
> >  	}
> >  }
> >  
> > -int submit_bh(int rw, struct buffer_head * bh)
> > +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
> >  {
> >  	struct bio *bio;
> >  	int ret = 0;
> > @@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
> >  
> >  	bio->bi_end_io = end_bio_bh_io_sync;
> >  	bio->bi_private = bh;
> > +	bio->bi_flags |= bio_flags;
> >  
> >  	/* Take care of bh's that straddle the end of the device */
> >  	guard_bh_eod(rw, bio, bh);
> > @@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
> >  	bio_put(bio);
> >  	return ret;
> >  }
> > +EXPORT_SYMBOL_GPL(_submit_bh);
> > +
> > +int submit_bh(int rw, struct buffer_head *bh)
> > +{
> > +	return _submit_bh(rw, bh, 0);
> > +}
> >  EXPORT_SYMBOL(submit_bh);
> >  
> >  /**
> > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > index 1d6e2ed..e845b6de 100644
> > --- a/fs/ext3/super.c
> > +++ b/fs/ext3/super.c
> > @@ -2063,11 +2063,12 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
> >  		ext3_mark_recovery_complete(sb, es);
> >  		ext3_msg(sb, KERN_INFO, "recovery complete");
> >  	}
> > +	if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA)
> > +		EXT3_SB(sb)->s_journal->j_flags |= JFS_JOURNALS_DATA;
>   Sadly this isn't enough. You can have inodes which journal data (there's
> an inode flag for this) in data=ordered mode. So what you have to do is to

Arrrgh, I forgot that you can do that per-inode. :(

> flag journal_heads (or buffer_heads) as containing journalled data. Or you
> can actually use PageChecked flag for this (it is going to be set on all
> write-enabled pages with journalled data). But it definitely requires also
> some playing with ->page_mkwrite() (calling ext3_journal_get_write_access()
> from there) and generally I'd rather postpone that to a separate commit. So
> just keep it simple and always set the bio flag as you did in the previous
> version. I'll write an optimization (mostly because ext4 needs it as well)
> and send it to you for testing.

Yeah, this is getting a bit complicated for a single patch.

--D
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation
@ 2013-03-18 23:01                     ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-18 23:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 18, 2013 at 06:41:34PM +0100, Jan Kara wrote:
> On Fri 15-03-13 16:28:16, Darrick J. Wong wrote:
> > Walking a bio's page mappings has proved problematic, so create a new bio flag
> > to indicate that a bio's data needs to be snapshotted in order to guarantee
> > stable pages during writeback.  Next, for the one user (ext3/jbd) of
> > snapshotting, hook all the places where writes can be initiated without
> > PG_writeback set, and set BIO_SNAP_STABLE there.  We must also flag journal
> > "metadata" bios for stable writeout if data=journal, since file data is written
> > through the journal.  Finally, the MS_SNAP_STABLE mount flag (only used by
> > ext3) is now superfluous, so get rid of it.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > [darrick.wong at oracle.com: Fold in a couple of small cleanups from akpm]
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > ---
> >  fs/buffer.c                 |    9 ++++++++-
> >  fs/ext3/super.c             |    3 ++-
> >  fs/jbd/commit.c             |   28 +++++++++++++++++++++++++---
> >  include/linux/blk_types.h   |    3 ++-
> >  include/linux/buffer_head.h |    1 +
> >  include/linux/jbd.h         |    1 +
> >  include/uapi/linux/fs.h     |    1 -
> >  mm/bounce.c                 |   21 +--------------------
> >  mm/page-writeback.c         |    4 ----
> >  9 files changed, 40 insertions(+), 31 deletions(-)
> > 
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index b4dcb34..71578d6 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
> >  	}
> >  }
> >  
> > -int submit_bh(int rw, struct buffer_head * bh)
> > +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
> >  {
> >  	struct bio *bio;
> >  	int ret = 0;
> > @@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
> >  
> >  	bio->bi_end_io = end_bio_bh_io_sync;
> >  	bio->bi_private = bh;
> > +	bio->bi_flags |= bio_flags;
> >  
> >  	/* Take care of bh's that straddle the end of the device */
> >  	guard_bh_eod(rw, bio, bh);
> > @@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
> >  	bio_put(bio);
> >  	return ret;
> >  }
> > +EXPORT_SYMBOL_GPL(_submit_bh);
> > +
> > +int submit_bh(int rw, struct buffer_head *bh)
> > +{
> > +	return _submit_bh(rw, bh, 0);
> > +}
> >  EXPORT_SYMBOL(submit_bh);
> >  
> >  /**
> > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > index 1d6e2ed..e845b6de 100644
> > --- a/fs/ext3/super.c
> > +++ b/fs/ext3/super.c
> > @@ -2063,11 +2063,12 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
> >  		ext3_mark_recovery_complete(sb, es);
> >  		ext3_msg(sb, KERN_INFO, "recovery complete");
> >  	}
> > +	if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA)
> > +		EXT3_SB(sb)->s_journal->j_flags |= JFS_JOURNALS_DATA;
>   Sadly this isn't enough. You can have inodes which journal data (there's
> an inode flag for this) in data=ordered mode. So what you have to do is to

Arrrgh, I forgot that you can do that per-inode. :(

> flag journal_heads (or buffer_heads) as containing journalled data. Or you
> can actually use PageChecked flag for this (it is going to be set on all
> write-enabled pages with journalled data). But it definitely requires also
> some playing with ->page_mkwrite() (calling ext3_journal_get_write_access()
> from there) and generally I'd rather postpone that to a separate commit. So
> just keep it simple and always set the bio flag as you did in the previous
> version. I'll write an optimization (mostly because ext4 needs it as well)
> and send it to you for testing.

Yeah, this is getting a bit complicated for a single patch.

--D
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3] mm: Make snapshotting pages for stable writes a per-bio operation
  2013-03-18 17:41                   ` Jan Kara
  (?)
@ 2013-03-18 23:02                     ` Darrick J. Wong
  -1 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-18 23:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel

Walking a bio's page mappings has proved problematic, so create a new bio flag
to indicate that a bio's data needs to be snapshotted in order to guarantee
stable pages during writeback.  Next, for the one user (ext3/jbd) of
snapshotting, hook all the places where writes can be initiated without
PG_writeback set, and set BIO_SNAP_STABLE there.  We must also flag journal
"metadata" bios for stable writeout, since file data can be written through the
journal.  Finally, the MS_SNAP_STABLE mount flag (only used by ext3) is now
superfluous, so get rid of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

[darrick.wong@oracle.com: Fold in a couple of small cleanups from akpm]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 fs/buffer.c                 |    9 ++++++++-
 fs/ext3/super.c             |    1 -
 fs/jbd/commit.c             |   25 ++++++++++++++++++++++---
 include/linux/blk_types.h   |    3 ++-
 include/linux/buffer_head.h |    1 +
 include/uapi/linux/fs.h     |    1 -
 mm/bounce.c                 |   21 +--------------------
 mm/page-writeback.c         |    4 ----
 8 files changed, 34 insertions(+), 31 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b4dcb34..71578d6 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
 	}
 }
 
-int submit_bh(int rw, struct buffer_head * bh)
+int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
 {
 	struct bio *bio;
 	int ret = 0;
@@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
 
 	bio->bi_end_io = end_bio_bh_io_sync;
 	bio->bi_private = bh;
+	bio->bi_flags |= bio_flags;
 
 	/* Take care of bh's that straddle the end of the device */
 	guard_bh_eod(rw, bio, bh);
@@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
 	bio_put(bio);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(_submit_bh);
+
+int submit_bh(int rw, struct buffer_head *bh)
+{
+	return _submit_bh(rw, bh, 0);
+}
 EXPORT_SYMBOL(submit_bh);
 
 /**
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index fb5120a..3dc48cc 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2067,7 +2067,6 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
 		"writeback");
-	sb->s_flags |= MS_SNAP_STABLE;
 
 	return 0;
 
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 86b39b1..11bb11f 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -162,8 +162,17 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
 
 	for (i = 0; i < bufs; i++) {
 		wbuf[i]->b_end_io = end_buffer_write_sync;
-		/* We use-up our safety reference in submit_bh() */
-		submit_bh(write_op, wbuf[i]);
+		/*
+		 * Here we write back pagecache data that may be mmaped. Since
+		 * we cannot afford to clean the page and set PageWriteback
+		 * here due to lock ordering (page lock ranks above transaction
+		 * start), the data can change while IO is in flight. Tell the
+		 * block layer it should bounce the bio pages if stable data
+		 * during write is required.
+		 *
+		 * We use up our safety reference in submit_bh().
+		 */
+		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
 	}
 }
 
@@ -667,7 +676,17 @@ start_journal_io:
 				clear_buffer_dirty(bh);
 				set_buffer_uptodate(bh);
 				bh->b_end_io = journal_end_buffer_io_sync;
-				submit_bh(write_op, bh);
+				/*
+				 * In data=journal mode, here we can end up
+				 * writing pagecache data that might be
+				 * mmapped. Since we can't afford to clean the
+				 * page and set PageWriteback (see the comment
+				 * near the other use of _submit_bh()), the
+				 * data can change while the write is in
+				 * flight.  Tell the block layer to bounce the
+				 * bio pages if stable pages are required.
+				 */
+				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
 			}
 			cond_resched();
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cdf1119..22990cf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -111,12 +111,13 @@ struct bio {
 #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
 #define BIO_QUIET	10	/* Make BIO Quiet */
 #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
+#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
 
 /*
  * Flags starting here get preserved by bio_reset() - this includes
  * BIO_POOL_IDX()
  */
-#define BIO_RESET_BITS	12
+#define BIO_RESET_BITS	13
 
 #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
 
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 5afc4f9..4c16c4a 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
 int sync_dirty_buffer(struct buffer_head *bh);
 int __sync_dirty_buffer(struct buffer_head *bh, int rw);
 void write_dirty_buffer(struct buffer_head *bh, int rw);
+int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags);
 int submit_bh(int, struct buffer_head *);
 void write_boundary_block(struct block_device *bdev,
 			sector_t bblock, unsigned blocksize);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index c7fc1e6..a4ed56c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -88,7 +88,6 @@ struct inodes_stat_t {
 #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
 
 /* These sb flags are internal to the kernel */
-#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
 #define MS_NOSEC	(1<<28)
 #define MS_BORN		(1<<29)
 #define MS_ACTIVE	(1<<30)
diff --git a/mm/bounce.c b/mm/bounce.c
index 5f89017..a5c2ec3 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
 #ifdef CONFIG_NEED_BOUNCE_POOL
 static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
 {
-	struct page *page;
-	struct backing_dev_info *bdi;
-	struct address_space *mapping;
-	struct bio_vec *from;
-	int i;
-
 	if (bio_data_dir(bio) != WRITE)
 		return 0;
 
 	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
 		return 0;
 
-	/*
-	 * Based on the first page that has a valid mapping, decide whether or
-	 * not we have to employ bounce buffering to guarantee stable pages.
-	 */
-	bio_for_each_segment(from, bio, i) {
-		page = from->bv_page;
-		mapping = page_mapping(page);
-		if (!mapping)
-			continue;
-		bdi = mapping->backing_dev_info;
-		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
-	}
-
-	return 0;
+	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
 }
 #else
 static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index efe6814..4514ad7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
 
 	if (!bdi_cap_stable_pages_required(bdi))
 		return;
-#ifdef CONFIG_NEED_BOUNCE_POOL
-	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
-		return;
-#endif /* CONFIG_NEED_BOUNCE_POOL */
 
 	wait_on_page_writeback(page);
 }

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v3] mm: Make snapshotting pages for stable writes a per-bio operation
@ 2013-03-18 23:02                     ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-18 23:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel

Walking a bio's page mappings has proved problematic, so create a new bio flag
to indicate that a bio's data needs to be snapshotted in order to guarantee
stable pages during writeback.  Next, for the one user (ext3/jbd) of
snapshotting, hook all the places where writes can be initiated without
PG_writeback set, and set BIO_SNAP_STABLE there.  We must also flag journal
"metadata" bios for stable writeout, since file data can be written through the
journal.  Finally, the MS_SNAP_STABLE mount flag (only used by ext3) is now
superfluous, so get rid of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

[darrick.wong@oracle.com: Fold in a couple of small cleanups from akpm]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 fs/buffer.c                 |    9 ++++++++-
 fs/ext3/super.c             |    1 -
 fs/jbd/commit.c             |   25 ++++++++++++++++++++++---
 include/linux/blk_types.h   |    3 ++-
 include/linux/buffer_head.h |    1 +
 include/uapi/linux/fs.h     |    1 -
 mm/bounce.c                 |   21 +--------------------
 mm/page-writeback.c         |    4 ----
 8 files changed, 34 insertions(+), 31 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b4dcb34..71578d6 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
 	}
 }
 
-int submit_bh(int rw, struct buffer_head * bh)
+int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
 {
 	struct bio *bio;
 	int ret = 0;
@@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
 
 	bio->bi_end_io = end_bio_bh_io_sync;
 	bio->bi_private = bh;
+	bio->bi_flags |= bio_flags;
 
 	/* Take care of bh's that straddle the end of the device */
 	guard_bh_eod(rw, bio, bh);
@@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
 	bio_put(bio);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(_submit_bh);
+
+int submit_bh(int rw, struct buffer_head *bh)
+{
+	return _submit_bh(rw, bh, 0);
+}
 EXPORT_SYMBOL(submit_bh);
 
 /**
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index fb5120a..3dc48cc 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2067,7 +2067,6 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
 		"writeback");
-	sb->s_flags |= MS_SNAP_STABLE;
 
 	return 0;
 
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 86b39b1..11bb11f 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -162,8 +162,17 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
 
 	for (i = 0; i < bufs; i++) {
 		wbuf[i]->b_end_io = end_buffer_write_sync;
-		/* We use-up our safety reference in submit_bh() */
-		submit_bh(write_op, wbuf[i]);
+		/*
+		 * Here we write back pagecache data that may be mmaped. Since
+		 * we cannot afford to clean the page and set PageWriteback
+		 * here due to lock ordering (page lock ranks above transaction
+		 * start), the data can change while IO is in flight. Tell the
+		 * block layer it should bounce the bio pages if stable data
+		 * during write is required.
+		 *
+		 * We use up our safety reference in submit_bh().
+		 */
+		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
 	}
 }
 
@@ -667,7 +676,17 @@ start_journal_io:
 				clear_buffer_dirty(bh);
 				set_buffer_uptodate(bh);
 				bh->b_end_io = journal_end_buffer_io_sync;
-				submit_bh(write_op, bh);
+				/*
+				 * In data=journal mode, here we can end up
+				 * writing pagecache data that might be
+				 * mmapped. Since we can't afford to clean the
+				 * page and set PageWriteback (see the comment
+				 * near the other use of _submit_bh()), the
+				 * data can change while the write is in
+				 * flight.  Tell the block layer to bounce the
+				 * bio pages if stable pages are required.
+				 */
+				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
 			}
 			cond_resched();
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cdf1119..22990cf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -111,12 +111,13 @@ struct bio {
 #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
 #define BIO_QUIET	10	/* Make BIO Quiet */
 #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
+#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
 
 /*
  * Flags starting here get preserved by bio_reset() - this includes
  * BIO_POOL_IDX()
  */
-#define BIO_RESET_BITS	12
+#define BIO_RESET_BITS	13
 
 #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
 
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 5afc4f9..4c16c4a 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
 int sync_dirty_buffer(struct buffer_head *bh);
 int __sync_dirty_buffer(struct buffer_head *bh, int rw);
 void write_dirty_buffer(struct buffer_head *bh, int rw);
+int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags);
 int submit_bh(int, struct buffer_head *);
 void write_boundary_block(struct block_device *bdev,
 			sector_t bblock, unsigned blocksize);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index c7fc1e6..a4ed56c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -88,7 +88,6 @@ struct inodes_stat_t {
 #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
 
 /* These sb flags are internal to the kernel */
-#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
 #define MS_NOSEC	(1<<28)
 #define MS_BORN		(1<<29)
 #define MS_ACTIVE	(1<<30)
diff --git a/mm/bounce.c b/mm/bounce.c
index 5f89017..a5c2ec3 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
 #ifdef CONFIG_NEED_BOUNCE_POOL
 static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
 {
-	struct page *page;
-	struct backing_dev_info *bdi;
-	struct address_space *mapping;
-	struct bio_vec *from;
-	int i;
-
 	if (bio_data_dir(bio) != WRITE)
 		return 0;
 
 	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
 		return 0;
 
-	/*
-	 * Based on the first page that has a valid mapping, decide whether or
-	 * not we have to employ bounce buffering to guarantee stable pages.
-	 */
-	bio_for_each_segment(from, bio, i) {
-		page = from->bv_page;
-		mapping = page_mapping(page);
-		if (!mapping)
-			continue;
-		bdi = mapping->backing_dev_info;
-		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
-	}
-
-	return 0;
+	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
 }
 #else
 static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index efe6814..4514ad7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
 
 	if (!bdi_cap_stable_pages_required(bdi))
 		return;
-#ifdef CONFIG_NEED_BOUNCE_POOL
-	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
-		return;
-#endif /* CONFIG_NEED_BOUNCE_POOL */
 
 	wait_on_page_writeback(page);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v3] mm: Make snapshotting pages for stable writes a per-bio operation
@ 2013-03-18 23:02                     ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-03-18 23:02 UTC (permalink / raw)
  To: linux-arm-kernel

Walking a bio's page mappings has proved problematic, so create a new bio flag
to indicate that a bio's data needs to be snapshotted in order to guarantee
stable pages during writeback.  Next, for the one user (ext3/jbd) of
snapshotting, hook all the places where writes can be initiated without
PG_writeback set, and set BIO_SNAP_STABLE there.  We must also flag journal
"metadata" bios for stable writeout, since file data can be written through the
journal.  Finally, the MS_SNAP_STABLE mount flag (only used by ext3) is now
superfluous, so get rid of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

[darrick.wong at oracle.com: Fold in a couple of small cleanups from akpm]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 fs/buffer.c                 |    9 ++++++++-
 fs/ext3/super.c             |    1 -
 fs/jbd/commit.c             |   25 ++++++++++++++++++++++---
 include/linux/blk_types.h   |    3 ++-
 include/linux/buffer_head.h |    1 +
 include/uapi/linux/fs.h     |    1 -
 mm/bounce.c                 |   21 +--------------------
 mm/page-writeback.c         |    4 ----
 8 files changed, 34 insertions(+), 31 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b4dcb34..71578d6 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
 	}
 }
 
-int submit_bh(int rw, struct buffer_head * bh)
+int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
 {
 	struct bio *bio;
 	int ret = 0;
@@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
 
 	bio->bi_end_io = end_bio_bh_io_sync;
 	bio->bi_private = bh;
+	bio->bi_flags |= bio_flags;
 
 	/* Take care of bh's that straddle the end of the device */
 	guard_bh_eod(rw, bio, bh);
@@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
 	bio_put(bio);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(_submit_bh);
+
+int submit_bh(int rw, struct buffer_head *bh)
+{
+	return _submit_bh(rw, bh, 0);
+}
 EXPORT_SYMBOL(submit_bh);
 
 /**
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index fb5120a..3dc48cc 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2067,7 +2067,6 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
 		"writeback");
-	sb->s_flags |= MS_SNAP_STABLE;
 
 	return 0;
 
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 86b39b1..11bb11f 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -162,8 +162,17 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
 
 	for (i = 0; i < bufs; i++) {
 		wbuf[i]->b_end_io = end_buffer_write_sync;
-		/* We use-up our safety reference in submit_bh() */
-		submit_bh(write_op, wbuf[i]);
+		/*
+		 * Here we write back pagecache data that may be mmaped. Since
+		 * we cannot afford to clean the page and set PageWriteback
+		 * here due to lock ordering (page lock ranks above transaction
+		 * start), the data can change while IO is in flight. Tell the
+		 * block layer it should bounce the bio pages if stable data
+		 * during write is required.
+		 *
+		 * We use up our safety reference in submit_bh().
+		 */
+		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
 	}
 }
 
@@ -667,7 +676,17 @@ start_journal_io:
 				clear_buffer_dirty(bh);
 				set_buffer_uptodate(bh);
 				bh->b_end_io = journal_end_buffer_io_sync;
-				submit_bh(write_op, bh);
+				/*
+				 * In data=journal mode, here we can end up
+				 * writing pagecache data that might be
+				 * mmapped. Since we can't afford to clean the
+				 * page and set PageWriteback (see the comment
+				 * near the other use of _submit_bh()), the
+				 * data can change while the write is in
+				 * flight.  Tell the block layer to bounce the
+				 * bio pages if stable pages are required.
+				 */
+				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
 			}
 			cond_resched();
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cdf1119..22990cf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -111,12 +111,13 @@ struct bio {
 #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
 #define BIO_QUIET	10	/* Make BIO Quiet */
 #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
+#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
 
 /*
  * Flags starting here get preserved by bio_reset() - this includes
  * BIO_POOL_IDX()
  */
-#define BIO_RESET_BITS	12
+#define BIO_RESET_BITS	13
 
 #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
 
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 5afc4f9..4c16c4a 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
 int sync_dirty_buffer(struct buffer_head *bh);
 int __sync_dirty_buffer(struct buffer_head *bh, int rw);
 void write_dirty_buffer(struct buffer_head *bh, int rw);
+int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags);
 int submit_bh(int, struct buffer_head *);
 void write_boundary_block(struct block_device *bdev,
 			sector_t bblock, unsigned blocksize);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index c7fc1e6..a4ed56c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -88,7 +88,6 @@ struct inodes_stat_t {
 #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
 
 /* These sb flags are internal to the kernel */
-#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
 #define MS_NOSEC	(1<<28)
 #define MS_BORN		(1<<29)
 #define MS_ACTIVE	(1<<30)
diff --git a/mm/bounce.c b/mm/bounce.c
index 5f89017..a5c2ec3 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
 #ifdef CONFIG_NEED_BOUNCE_POOL
 static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
 {
-	struct page *page;
-	struct backing_dev_info *bdi;
-	struct address_space *mapping;
-	struct bio_vec *from;
-	int i;
-
 	if (bio_data_dir(bio) != WRITE)
 		return 0;
 
 	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
 		return 0;
 
-	/*
-	 * Based on the first page that has a valid mapping, decide whether or
-	 * not we have to employ bounce buffering to guarantee stable pages.
-	 */
-	bio_for_each_segment(from, bio, i) {
-		page = from->bv_page;
-		mapping = page_mapping(page);
-		if (!mapping)
-			continue;
-		bdi = mapping->backing_dev_info;
-		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
-	}
-
-	return 0;
+	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
 }
 #else
 static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index efe6814..4514ad7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
 
 	if (!bdi_cap_stable_pages_required(bdi))
 		return;
-#ifdef CONFIG_NEED_BOUNCE_POOL
-	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
-		return;
-#endif /* CONFIG_NEED_BOUNCE_POOL */
 
 	wait_on_page_writeback(page);
 }

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v3] mm: Make snapshotting pages for stable writes a per-bio operation
  2013-03-18 23:02                     ` Darrick J. Wong
  (?)
@ 2013-03-19  8:54                       ` Jan Kara
  -1 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-03-19  8:54 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, Andrew Morton, Shuge, linux-kernel, linux-mm,
	linux-ext4, Kevin, Theodore Ts'o, Jens Axboe,
	Catalin Marinas, Will Deacon, linux-arm-kernel

On Mon 18-03-13 16:02:59, Darrick J. Wong wrote:
> Walking a bio's page mappings has proved problematic, so create a new bio flag
> to indicate that a bio's data needs to be snapshotted in order to guarantee
> stable pages during writeback.  Next, for the one user (ext3/jbd) of
> snapshotting, hook all the places where writes can be initiated without
> PG_writeback set, and set BIO_SNAP_STABLE there.  We must also flag journal
> "metadata" bios for stable writeout, since file data can be written through the
> journal.  Finally, the MS_SNAP_STABLE mount flag (only used by ext3) is now
> superfluous, so get rid of it.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> [darrick.wong@oracle.com: Fold in a couple of small cleanups from akpm]
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
  OK, now I'm happy with the patch :) You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/buffer.c                 |    9 ++++++++-
>  fs/ext3/super.c             |    1 -
>  fs/jbd/commit.c             |   25 ++++++++++++++++++++++---
>  include/linux/blk_types.h   |    3 ++-
>  include/linux/buffer_head.h |    1 +
>  include/uapi/linux/fs.h     |    1 -
>  mm/bounce.c                 |   21 +--------------------
>  mm/page-writeback.c         |    4 ----
>  8 files changed, 34 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b4dcb34..71578d6 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
>  	}
>  }
>  
> -int submit_bh(int rw, struct buffer_head * bh)
> +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
>  {
>  	struct bio *bio;
>  	int ret = 0;
> @@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
>  
>  	bio->bi_end_io = end_bio_bh_io_sync;
>  	bio->bi_private = bh;
> +	bio->bi_flags |= bio_flags;
>  
>  	/* Take care of bh's that straddle the end of the device */
>  	guard_bh_eod(rw, bio, bh);
> @@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
>  	bio_put(bio);
>  	return ret;
>  }
> +EXPORT_SYMBOL_GPL(_submit_bh);
> +
> +int submit_bh(int rw, struct buffer_head *bh)
> +{
> +	return _submit_bh(rw, bh, 0);
> +}
>  EXPORT_SYMBOL(submit_bh);
>  
>  /**
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index fb5120a..3dc48cc 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -2067,7 +2067,6 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
>  		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
>  		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
>  		"writeback");
> -	sb->s_flags |= MS_SNAP_STABLE;
>  
>  	return 0;
>  
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index 86b39b1..11bb11f 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -162,8 +162,17 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
>  
>  	for (i = 0; i < bufs; i++) {
>  		wbuf[i]->b_end_io = end_buffer_write_sync;
> -		/* We use-up our safety reference in submit_bh() */
> -		submit_bh(write_op, wbuf[i]);
> +		/*
> +		 * Here we write back pagecache data that may be mmaped. Since
> +		 * we cannot afford to clean the page and set PageWriteback
> +		 * here due to lock ordering (page lock ranks above transaction
> +		 * start), the data can change while IO is in flight. Tell the
> +		 * block layer it should bounce the bio pages if stable data
> +		 * during write is required.
> +		 *
> +		 * We use up our safety reference in submit_bh().
> +		 */
> +		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
>  	}
>  }
>  
> @@ -667,7 +676,17 @@ start_journal_io:
>  				clear_buffer_dirty(bh);
>  				set_buffer_uptodate(bh);
>  				bh->b_end_io = journal_end_buffer_io_sync;
> -				submit_bh(write_op, bh);
> +				/*
> +				 * In data=journal mode, here we can end up
> +				 * writing pagecache data that might be
> +				 * mmapped. Since we can't afford to clean the
> +				 * page and set PageWriteback (see the comment
> +				 * near the other use of _submit_bh()), the
> +				 * data can change while the write is in
> +				 * flight.  Tell the block layer to bounce the
> +				 * bio pages if stable pages are required.
> +				 */
> +				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
>  			}
>  			cond_resched();
>  
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index cdf1119..22990cf 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -111,12 +111,13 @@ struct bio {
>  #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
>  #define BIO_QUIET	10	/* Make BIO Quiet */
>  #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
> +#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
>  
>  /*
>   * Flags starting here get preserved by bio_reset() - this includes
>   * BIO_POOL_IDX()
>   */
> -#define BIO_RESET_BITS	12
> +#define BIO_RESET_BITS	13
>  
>  #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
>  
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 5afc4f9..4c16c4a 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
>  int sync_dirty_buffer(struct buffer_head *bh);
>  int __sync_dirty_buffer(struct buffer_head *bh, int rw);
>  void write_dirty_buffer(struct buffer_head *bh, int rw);
> +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags);
>  int submit_bh(int, struct buffer_head *);
>  void write_boundary_block(struct block_device *bdev,
>  			sector_t bblock, unsigned blocksize);
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index c7fc1e6..a4ed56c 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -88,7 +88,6 @@ struct inodes_stat_t {
>  #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
>  
>  /* These sb flags are internal to the kernel */
> -#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
>  #define MS_NOSEC	(1<<28)
>  #define MS_BORN		(1<<29)
>  #define MS_ACTIVE	(1<<30)
> diff --git a/mm/bounce.c b/mm/bounce.c
> index 5f89017..a5c2ec3 100644
> --- a/mm/bounce.c
> +++ b/mm/bounce.c
> @@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
>  #ifdef CONFIG_NEED_BOUNCE_POOL
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
>  {
> -	struct page *page;
> -	struct backing_dev_info *bdi;
> -	struct address_space *mapping;
> -	struct bio_vec *from;
> -	int i;
> -
>  	if (bio_data_dir(bio) != WRITE)
>  		return 0;
>  
>  	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
>  		return 0;
>  
> -	/*
> -	 * Based on the first page that has a valid mapping, decide whether or
> -	 * not we have to employ bounce buffering to guarantee stable pages.
> -	 */
> -	bio_for_each_segment(from, bio, i) {
> -		page = from->bv_page;
> -		mapping = page_mapping(page);
> -		if (!mapping)
> -			continue;
> -		bdi = mapping->backing_dev_info;
> -		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
> -	}
> -
> -	return 0;
> +	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
>  }
>  #else
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index efe6814..4514ad7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
>  
>  	if (!bdi_cap_stable_pages_required(bdi))
>  		return;
> -#ifdef CONFIG_NEED_BOUNCE_POOL
> -	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
> -		return;
> -#endif /* CONFIG_NEED_BOUNCE_POOL */
>  
>  	wait_on_page_writeback(page);
>  }
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3] mm: Make snapshotting pages for stable writes a per-bio operation
@ 2013-03-19  8:54                       ` Jan Kara
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-03-19  8:54 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, Andrew Morton, Shuge, linux-kernel, linux-mm,
	linux-ext4, Kevin, Theodore Ts'o, Jens Axboe,
	Catalin Marinas, Will Deacon, linux-arm-kernel

On Mon 18-03-13 16:02:59, Darrick J. Wong wrote:
> Walking a bio's page mappings has proved problematic, so create a new bio flag
> to indicate that a bio's data needs to be snapshotted in order to guarantee
> stable pages during writeback.  Next, for the one user (ext3/jbd) of
> snapshotting, hook all the places where writes can be initiated without
> PG_writeback set, and set BIO_SNAP_STABLE there.  We must also flag journal
> "metadata" bios for stable writeout, since file data can be written through the
> journal.  Finally, the MS_SNAP_STABLE mount flag (only used by ext3) is now
> superfluous, so get rid of it.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> [darrick.wong@oracle.com: Fold in a couple of small cleanups from akpm]
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
  OK, now I'm happy with the patch :) You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/buffer.c                 |    9 ++++++++-
>  fs/ext3/super.c             |    1 -
>  fs/jbd/commit.c             |   25 ++++++++++++++++++++++---
>  include/linux/blk_types.h   |    3 ++-
>  include/linux/buffer_head.h |    1 +
>  include/uapi/linux/fs.h     |    1 -
>  mm/bounce.c                 |   21 +--------------------
>  mm/page-writeback.c         |    4 ----
>  8 files changed, 34 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b4dcb34..71578d6 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
>  	}
>  }
>  
> -int submit_bh(int rw, struct buffer_head * bh)
> +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
>  {
>  	struct bio *bio;
>  	int ret = 0;
> @@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
>  
>  	bio->bi_end_io = end_bio_bh_io_sync;
>  	bio->bi_private = bh;
> +	bio->bi_flags |= bio_flags;
>  
>  	/* Take care of bh's that straddle the end of the device */
>  	guard_bh_eod(rw, bio, bh);
> @@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
>  	bio_put(bio);
>  	return ret;
>  }
> +EXPORT_SYMBOL_GPL(_submit_bh);
> +
> +int submit_bh(int rw, struct buffer_head *bh)
> +{
> +	return _submit_bh(rw, bh, 0);
> +}
>  EXPORT_SYMBOL(submit_bh);
>  
>  /**
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index fb5120a..3dc48cc 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -2067,7 +2067,6 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
>  		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
>  		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
>  		"writeback");
> -	sb->s_flags |= MS_SNAP_STABLE;
>  
>  	return 0;
>  
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index 86b39b1..11bb11f 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -162,8 +162,17 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
>  
>  	for (i = 0; i < bufs; i++) {
>  		wbuf[i]->b_end_io = end_buffer_write_sync;
> -		/* We use-up our safety reference in submit_bh() */
> -		submit_bh(write_op, wbuf[i]);
> +		/*
> +		 * Here we write back pagecache data that may be mmaped. Since
> +		 * we cannot afford to clean the page and set PageWriteback
> +		 * here due to lock ordering (page lock ranks above transaction
> +		 * start), the data can change while IO is in flight. Tell the
> +		 * block layer it should bounce the bio pages if stable data
> +		 * during write is required.
> +		 *
> +		 * We use up our safety reference in submit_bh().
> +		 */
> +		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
>  	}
>  }
>  
> @@ -667,7 +676,17 @@ start_journal_io:
>  				clear_buffer_dirty(bh);
>  				set_buffer_uptodate(bh);
>  				bh->b_end_io = journal_end_buffer_io_sync;
> -				submit_bh(write_op, bh);
> +				/*
> +				 * In data=journal mode, here we can end up
> +				 * writing pagecache data that might be
> +				 * mmapped. Since we can't afford to clean the
> +				 * page and set PageWriteback (see the comment
> +				 * near the other use of _submit_bh()), the
> +				 * data can change while the write is in
> +				 * flight.  Tell the block layer to bounce the
> +				 * bio pages if stable pages are required.
> +				 */
> +				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
>  			}
>  			cond_resched();
>  
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index cdf1119..22990cf 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -111,12 +111,13 @@ struct bio {
>  #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
>  #define BIO_QUIET	10	/* Make BIO Quiet */
>  #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
> +#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
>  
>  /*
>   * Flags starting here get preserved by bio_reset() - this includes
>   * BIO_POOL_IDX()
>   */
> -#define BIO_RESET_BITS	12
> +#define BIO_RESET_BITS	13
>  
>  #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
>  
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 5afc4f9..4c16c4a 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
>  int sync_dirty_buffer(struct buffer_head *bh);
>  int __sync_dirty_buffer(struct buffer_head *bh, int rw);
>  void write_dirty_buffer(struct buffer_head *bh, int rw);
> +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags);
>  int submit_bh(int, struct buffer_head *);
>  void write_boundary_block(struct block_device *bdev,
>  			sector_t bblock, unsigned blocksize);
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index c7fc1e6..a4ed56c 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -88,7 +88,6 @@ struct inodes_stat_t {
>  #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
>  
>  /* These sb flags are internal to the kernel */
> -#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
>  #define MS_NOSEC	(1<<28)
>  #define MS_BORN		(1<<29)
>  #define MS_ACTIVE	(1<<30)
> diff --git a/mm/bounce.c b/mm/bounce.c
> index 5f89017..a5c2ec3 100644
> --- a/mm/bounce.c
> +++ b/mm/bounce.c
> @@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
>  #ifdef CONFIG_NEED_BOUNCE_POOL
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
>  {
> -	struct page *page;
> -	struct backing_dev_info *bdi;
> -	struct address_space *mapping;
> -	struct bio_vec *from;
> -	int i;
> -
>  	if (bio_data_dir(bio) != WRITE)
>  		return 0;
>  
>  	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
>  		return 0;
>  
> -	/*
> -	 * Based on the first page that has a valid mapping, decide whether or
> -	 * not we have to employ bounce buffering to guarantee stable pages.
> -	 */
> -	bio_for_each_segment(from, bio, i) {
> -		page = from->bv_page;
> -		mapping = page_mapping(page);
> -		if (!mapping)
> -			continue;
> -		bdi = mapping->backing_dev_info;
> -		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
> -	}
> -
> -	return 0;
> +	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
>  }
>  #else
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index efe6814..4514ad7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
>  
>  	if (!bdi_cap_stable_pages_required(bdi))
>  		return;
> -#ifdef CONFIG_NEED_BOUNCE_POOL
> -	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
> -		return;
> -#endif /* CONFIG_NEED_BOUNCE_POOL */
>  
>  	wait_on_page_writeback(page);
>  }
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3] mm: Make snapshotting pages for stable writes a per-bio operation
@ 2013-03-19  8:54                       ` Jan Kara
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-03-19  8:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon 18-03-13 16:02:59, Darrick J. Wong wrote:
> Walking a bio's page mappings has proved problematic, so create a new bio flag
> to indicate that a bio's data needs to be snapshotted in order to guarantee
> stable pages during writeback.  Next, for the one user (ext3/jbd) of
> snapshotting, hook all the places where writes can be initiated without
> PG_writeback set, and set BIO_SNAP_STABLE there.  We must also flag journal
> "metadata" bios for stable writeout, since file data can be written through the
> journal.  Finally, the MS_SNAP_STABLE mount flag (only used by ext3) is now
> superfluous, so get rid of it.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> [darrick.wong at oracle.com: Fold in a couple of small cleanups from akpm]
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
  OK, now I'm happy with the patch :) You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/buffer.c                 |    9 ++++++++-
>  fs/ext3/super.c             |    1 -
>  fs/jbd/commit.c             |   25 ++++++++++++++++++++++---
>  include/linux/blk_types.h   |    3 ++-
>  include/linux/buffer_head.h |    1 +
>  include/uapi/linux/fs.h     |    1 -
>  mm/bounce.c                 |   21 +--------------------
>  mm/page-writeback.c         |    4 ----
>  8 files changed, 34 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b4dcb34..71578d6 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
>  	}
>  }
>  
> -int submit_bh(int rw, struct buffer_head * bh)
> +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
>  {
>  	struct bio *bio;
>  	int ret = 0;
> @@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
>  
>  	bio->bi_end_io = end_bio_bh_io_sync;
>  	bio->bi_private = bh;
> +	bio->bi_flags |= bio_flags;
>  
>  	/* Take care of bh's that straddle the end of the device */
>  	guard_bh_eod(rw, bio, bh);
> @@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
>  	bio_put(bio);
>  	return ret;
>  }
> +EXPORT_SYMBOL_GPL(_submit_bh);
> +
> +int submit_bh(int rw, struct buffer_head *bh)
> +{
> +	return _submit_bh(rw, bh, 0);
> +}
>  EXPORT_SYMBOL(submit_bh);
>  
>  /**
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index fb5120a..3dc48cc 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -2067,7 +2067,6 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
>  		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
>  		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
>  		"writeback");
> -	sb->s_flags |= MS_SNAP_STABLE;
>  
>  	return 0;
>  
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index 86b39b1..11bb11f 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -162,8 +162,17 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
>  
>  	for (i = 0; i < bufs; i++) {
>  		wbuf[i]->b_end_io = end_buffer_write_sync;
> -		/* We use-up our safety reference in submit_bh() */
> -		submit_bh(write_op, wbuf[i]);
> +		/*
> +		 * Here we write back pagecache data that may be mmaped. Since
> +		 * we cannot afford to clean the page and set PageWriteback
> +		 * here due to lock ordering (page lock ranks above transaction
> +		 * start), the data can change while IO is in flight. Tell the
> +		 * block layer it should bounce the bio pages if stable data
> +		 * during write is required.
> +		 *
> +		 * We use up our safety reference in submit_bh().
> +		 */
> +		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
>  	}
>  }
>  
> @@ -667,7 +676,17 @@ start_journal_io:
>  				clear_buffer_dirty(bh);
>  				set_buffer_uptodate(bh);
>  				bh->b_end_io = journal_end_buffer_io_sync;
> -				submit_bh(write_op, bh);
> +				/*
> +				 * In data=journal mode, here we can end up
> +				 * writing pagecache data that might be
> +				 * mmapped. Since we can't afford to clean the
> +				 * page and set PageWriteback (see the comment
> +				 * near the other use of _submit_bh()), the
> +				 * data can change while the write is in
> +				 * flight.  Tell the block layer to bounce the
> +				 * bio pages if stable pages are required.
> +				 */
> +				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
>  			}
>  			cond_resched();
>  
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index cdf1119..22990cf 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -111,12 +111,13 @@ struct bio {
>  #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
>  #define BIO_QUIET	10	/* Make BIO Quiet */
>  #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
> +#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
>  
>  /*
>   * Flags starting here get preserved by bio_reset() - this includes
>   * BIO_POOL_IDX()
>   */
> -#define BIO_RESET_BITS	12
> +#define BIO_RESET_BITS	13
>  
>  #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
>  
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 5afc4f9..4c16c4a 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
>  int sync_dirty_buffer(struct buffer_head *bh);
>  int __sync_dirty_buffer(struct buffer_head *bh, int rw);
>  void write_dirty_buffer(struct buffer_head *bh, int rw);
> +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags);
>  int submit_bh(int, struct buffer_head *);
>  void write_boundary_block(struct block_device *bdev,
>  			sector_t bblock, unsigned blocksize);
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index c7fc1e6..a4ed56c 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -88,7 +88,6 @@ struct inodes_stat_t {
>  #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
>  
>  /* These sb flags are internal to the kernel */
> -#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
>  #define MS_NOSEC	(1<<28)
>  #define MS_BORN		(1<<29)
>  #define MS_ACTIVE	(1<<30)
> diff --git a/mm/bounce.c b/mm/bounce.c
> index 5f89017..a5c2ec3 100644
> --- a/mm/bounce.c
> +++ b/mm/bounce.c
> @@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
>  #ifdef CONFIG_NEED_BOUNCE_POOL
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
>  {
> -	struct page *page;
> -	struct backing_dev_info *bdi;
> -	struct address_space *mapping;
> -	struct bio_vec *from;
> -	int i;
> -
>  	if (bio_data_dir(bio) != WRITE)
>  		return 0;
>  
>  	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
>  		return 0;
>  
> -	/*
> -	 * Based on the first page that has a valid mapping, decide whether or
> -	 * not we have to employ bounce buffering to guarantee stable pages.
> -	 */
> -	bio_for_each_segment(from, bio, i) {
> -		page = from->bv_page;
> -		mapping = page_mapping(page);
> -		if (!mapping)
> -			continue;
> -		bdi = mapping->backing_dev_info;
> -		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
> -	}
> -
> -	return 0;
> +	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
>  }
>  #else
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index efe6814..4514ad7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
>  
>  	if (!bdi_cap_stable_pages_required(bdi))
>  		return;
> -#ifdef CONFIG_NEED_BOUNCE_POOL
> -	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
> -		return;
> -#endif /* CONFIG_NEED_BOUNCE_POOL */
>  
>  	wait_on_page_writeback(page);
>  }
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3] mm: Make snapshotting pages for stable writes a per-bio operation
  2013-03-18 23:02                     ` Darrick J. Wong
  (?)
  (?)
@ 2013-04-02 17:01                       ` Darrick J. Wong
  -1 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-04-02 17:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Jan Kara, Andrew Morton, Shuge, linux-kernel,
	linux-mm, linux-ext4, Kevin, Theodore Ts'o, Jens Axboe,
	Catalin Marinas, Will Deacon, linux-arm-kernel

Hi,

A couple of weeks have gone by without further comments about this patch.

Are you interested in the minor cleanups and added comments, or is the v2 patch
in -next good enough?

Apparently Mel Gorman's interested in this patchset too.  Mel: Most of stable
pages part 2 are already in upstream for 3.9... except this piece.  Are you
interested in having this piece in 3.9 also?  Or is 3.10 good enough for
everyone?

--D

Mon, Mar 18, 2013 at 04:02:59PM -0700, Darrick J. Wong wrote:
> Walking a bio's page mappings has proved problematic, so create a new bio flag
> to indicate that a bio's data needs to be snapshotted in order to guarantee
> stable pages during writeback.  Next, for the one user (ext3/jbd) of
> snapshotting, hook all the places where writes can be initiated without
> PG_writeback set, and set BIO_SNAP_STABLE there.  We must also flag journal
> "metadata" bios for stable writeout, since file data can be written through the
> journal.  Finally, the MS_SNAP_STABLE mount flag (only used by ext3) is now
> superfluous, so get rid of it.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> [darrick.wong@oracle.com: Fold in a couple of small cleanups from akpm]
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>  fs/buffer.c                 |    9 ++++++++-
>  fs/ext3/super.c             |    1 -
>  fs/jbd/commit.c             |   25 ++++++++++++++++++++++---
>  include/linux/blk_types.h   |    3 ++-
>  include/linux/buffer_head.h |    1 +
>  include/uapi/linux/fs.h     |    1 -
>  mm/bounce.c                 |   21 +--------------------
>  mm/page-writeback.c         |    4 ----
>  8 files changed, 34 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b4dcb34..71578d6 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
>  	}
>  }
>  
> -int submit_bh(int rw, struct buffer_head * bh)
> +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
>  {
>  	struct bio *bio;
>  	int ret = 0;
> @@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
>  
>  	bio->bi_end_io = end_bio_bh_io_sync;
>  	bio->bi_private = bh;
> +	bio->bi_flags |= bio_flags;
>  
>  	/* Take care of bh's that straddle the end of the device */
>  	guard_bh_eod(rw, bio, bh);
> @@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
>  	bio_put(bio);
>  	return ret;
>  }
> +EXPORT_SYMBOL_GPL(_submit_bh);
> +
> +int submit_bh(int rw, struct buffer_head *bh)
> +{
> +	return _submit_bh(rw, bh, 0);
> +}
>  EXPORT_SYMBOL(submit_bh);
>  
>  /**
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index fb5120a..3dc48cc 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -2067,7 +2067,6 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
>  		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
>  		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
>  		"writeback");
> -	sb->s_flags |= MS_SNAP_STABLE;
>  
>  	return 0;
>  
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index 86b39b1..11bb11f 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -162,8 +162,17 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
>  
>  	for (i = 0; i < bufs; i++) {
>  		wbuf[i]->b_end_io = end_buffer_write_sync;
> -		/* We use-up our safety reference in submit_bh() */
> -		submit_bh(write_op, wbuf[i]);
> +		/*
> +		 * Here we write back pagecache data that may be mmaped. Since
> +		 * we cannot afford to clean the page and set PageWriteback
> +		 * here due to lock ordering (page lock ranks above transaction
> +		 * start), the data can change while IO is in flight. Tell the
> +		 * block layer it should bounce the bio pages if stable data
> +		 * during write is required.
> +		 *
> +		 * We use up our safety reference in submit_bh().
> +		 */
> +		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
>  	}
>  }
>  
> @@ -667,7 +676,17 @@ start_journal_io:
>  				clear_buffer_dirty(bh);
>  				set_buffer_uptodate(bh);
>  				bh->b_end_io = journal_end_buffer_io_sync;
> -				submit_bh(write_op, bh);
> +				/*
> +				 * In data=journal mode, here we can end up
> +				 * writing pagecache data that might be
> +				 * mmapped. Since we can't afford to clean the
> +				 * page and set PageWriteback (see the comment
> +				 * near the other use of _submit_bh()), the
> +				 * data can change while the write is in
> +				 * flight.  Tell the block layer to bounce the
> +				 * bio pages if stable pages are required.
> +				 */
> +				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
>  			}
>  			cond_resched();
>  
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index cdf1119..22990cf 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -111,12 +111,13 @@ struct bio {
>  #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
>  #define BIO_QUIET	10	/* Make BIO Quiet */
>  #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
> +#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
>  
>  /*
>   * Flags starting here get preserved by bio_reset() - this includes
>   * BIO_POOL_IDX()
>   */
> -#define BIO_RESET_BITS	12
> +#define BIO_RESET_BITS	13
>  
>  #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
>  
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 5afc4f9..4c16c4a 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
>  int sync_dirty_buffer(struct buffer_head *bh);
>  int __sync_dirty_buffer(struct buffer_head *bh, int rw);
>  void write_dirty_buffer(struct buffer_head *bh, int rw);
> +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags);
>  int submit_bh(int, struct buffer_head *);
>  void write_boundary_block(struct block_device *bdev,
>  			sector_t bblock, unsigned blocksize);
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index c7fc1e6..a4ed56c 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -88,7 +88,6 @@ struct inodes_stat_t {
>  #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
>  
>  /* These sb flags are internal to the kernel */
> -#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
>  #define MS_NOSEC	(1<<28)
>  #define MS_BORN		(1<<29)
>  #define MS_ACTIVE	(1<<30)
> diff --git a/mm/bounce.c b/mm/bounce.c
> index 5f89017..a5c2ec3 100644
> --- a/mm/bounce.c
> +++ b/mm/bounce.c
> @@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
>  #ifdef CONFIG_NEED_BOUNCE_POOL
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
>  {
> -	struct page *page;
> -	struct backing_dev_info *bdi;
> -	struct address_space *mapping;
> -	struct bio_vec *from;
> -	int i;
> -
>  	if (bio_data_dir(bio) != WRITE)
>  		return 0;
>  
>  	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
>  		return 0;
>  
> -	/*
> -	 * Based on the first page that has a valid mapping, decide whether or
> -	 * not we have to employ bounce buffering to guarantee stable pages.
> -	 */
> -	bio_for_each_segment(from, bio, i) {
> -		page = from->bv_page;
> -		mapping = page_mapping(page);
> -		if (!mapping)
> -			continue;
> -		bdi = mapping->backing_dev_info;
> -		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
> -	}
> -
> -	return 0;
> +	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
>  }
>  #else
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index efe6814..4514ad7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
>  
>  	if (!bdi_cap_stable_pages_required(bdi))
>  		return;
> -#ifdef CONFIG_NEED_BOUNCE_POOL
> -	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
> -		return;
> -#endif /* CONFIG_NEED_BOUNCE_POOL */
>  
>  	wait_on_page_writeback(page);
>  }

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3] mm: Make snapshotting pages for stable writes a per-bio operation
@ 2013-04-02 17:01                       ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-04-02 17:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Jan Kara, Andrew Morton, Shuge, linux-kernel,
	linux-mm, linux-ext4, Kevin, Theodore Ts'o, Jens Axboe,
	Catalin Marinas, Will Deacon, linux-arm-kernel

Hi,

A couple of weeks have gone by without further comments about this patch.

Are you interested in the minor cleanups and added comments, or is the v2 patch
in -next good enough?

Apparently Mel Gorman's interested in this patchset too.  Mel: Most of stable
pages part 2 are already in upstream for 3.9... except this piece.  Are you
interested in having this piece in 3.9 also?  Or is 3.10 good enough for
everyone?

--D

Mon, Mar 18, 2013 at 04:02:59PM -0700, Darrick J. Wong wrote:
> Walking a bio's page mappings has proved problematic, so create a new bio flag
> to indicate that a bio's data needs to be snapshotted in order to guarantee
> stable pages during writeback.  Next, for the one user (ext3/jbd) of
> snapshotting, hook all the places where writes can be initiated without
> PG_writeback set, and set BIO_SNAP_STABLE there.  We must also flag journal
> "metadata" bios for stable writeout, since file data can be written through the
> journal.  Finally, the MS_SNAP_STABLE mount flag (only used by ext3) is now
> superfluous, so get rid of it.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> [darrick.wong@oracle.com: Fold in a couple of small cleanups from akpm]
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>  fs/buffer.c                 |    9 ++++++++-
>  fs/ext3/super.c             |    1 -
>  fs/jbd/commit.c             |   25 ++++++++++++++++++++++---
>  include/linux/blk_types.h   |    3 ++-
>  include/linux/buffer_head.h |    1 +
>  include/uapi/linux/fs.h     |    1 -
>  mm/bounce.c                 |   21 +--------------------
>  mm/page-writeback.c         |    4 ----
>  8 files changed, 34 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b4dcb34..71578d6 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
>  	}
>  }
>  
> -int submit_bh(int rw, struct buffer_head * bh)
> +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
>  {
>  	struct bio *bio;
>  	int ret = 0;
> @@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
>  
>  	bio->bi_end_io = end_bio_bh_io_sync;
>  	bio->bi_private = bh;
> +	bio->bi_flags |= bio_flags;
>  
>  	/* Take care of bh's that straddle the end of the device */
>  	guard_bh_eod(rw, bio, bh);
> @@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
>  	bio_put(bio);
>  	return ret;
>  }
> +EXPORT_SYMBOL_GPL(_submit_bh);
> +
> +int submit_bh(int rw, struct buffer_head *bh)
> +{
> +	return _submit_bh(rw, bh, 0);
> +}
>  EXPORT_SYMBOL(submit_bh);
>  
>  /**
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index fb5120a..3dc48cc 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -2067,7 +2067,6 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
>  		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
>  		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
>  		"writeback");
> -	sb->s_flags |= MS_SNAP_STABLE;
>  
>  	return 0;
>  
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index 86b39b1..11bb11f 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -162,8 +162,17 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
>  
>  	for (i = 0; i < bufs; i++) {
>  		wbuf[i]->b_end_io = end_buffer_write_sync;
> -		/* We use-up our safety reference in submit_bh() */
> -		submit_bh(write_op, wbuf[i]);
> +		/*
> +		 * Here we write back pagecache data that may be mmaped. Since
> +		 * we cannot afford to clean the page and set PageWriteback
> +		 * here due to lock ordering (page lock ranks above transaction
> +		 * start), the data can change while IO is in flight. Tell the
> +		 * block layer it should bounce the bio pages if stable data
> +		 * during write is required.
> +		 *
> +		 * We use up our safety reference in submit_bh().
> +		 */
> +		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
>  	}
>  }
>  
> @@ -667,7 +676,17 @@ start_journal_io:
>  				clear_buffer_dirty(bh);
>  				set_buffer_uptodate(bh);
>  				bh->b_end_io = journal_end_buffer_io_sync;
> -				submit_bh(write_op, bh);
> +				/*
> +				 * In data=journal mode, here we can end up
> +				 * writing pagecache data that might be
> +				 * mmapped. Since we can't afford to clean the
> +				 * page and set PageWriteback (see the comment
> +				 * near the other use of _submit_bh()), the
> +				 * data can change while the write is in
> +				 * flight.  Tell the block layer to bounce the
> +				 * bio pages if stable pages are required.
> +				 */
> +				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
>  			}
>  			cond_resched();
>  
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index cdf1119..22990cf 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -111,12 +111,13 @@ struct bio {
>  #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
>  #define BIO_QUIET	10	/* Make BIO Quiet */
>  #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
> +#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
>  
>  /*
>   * Flags starting here get preserved by bio_reset() - this includes
>   * BIO_POOL_IDX()
>   */
> -#define BIO_RESET_BITS	12
> +#define BIO_RESET_BITS	13
>  
>  #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
>  
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 5afc4f9..4c16c4a 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
>  int sync_dirty_buffer(struct buffer_head *bh);
>  int __sync_dirty_buffer(struct buffer_head *bh, int rw);
>  void write_dirty_buffer(struct buffer_head *bh, int rw);
> +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags);
>  int submit_bh(int, struct buffer_head *);
>  void write_boundary_block(struct block_device *bdev,
>  			sector_t bblock, unsigned blocksize);
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index c7fc1e6..a4ed56c 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -88,7 +88,6 @@ struct inodes_stat_t {
>  #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
>  
>  /* These sb flags are internal to the kernel */
> -#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
>  #define MS_NOSEC	(1<<28)
>  #define MS_BORN		(1<<29)
>  #define MS_ACTIVE	(1<<30)
> diff --git a/mm/bounce.c b/mm/bounce.c
> index 5f89017..a5c2ec3 100644
> --- a/mm/bounce.c
> +++ b/mm/bounce.c
> @@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
>  #ifdef CONFIG_NEED_BOUNCE_POOL
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
>  {
> -	struct page *page;
> -	struct backing_dev_info *bdi;
> -	struct address_space *mapping;
> -	struct bio_vec *from;
> -	int i;
> -
>  	if (bio_data_dir(bio) != WRITE)
>  		return 0;
>  
>  	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
>  		return 0;
>  
> -	/*
> -	 * Based on the first page that has a valid mapping, decide whether or
> -	 * not we have to employ bounce buffering to guarantee stable pages.
> -	 */
> -	bio_for_each_segment(from, bio, i) {
> -		page = from->bv_page;
> -		mapping = page_mapping(page);
> -		if (!mapping)
> -			continue;
> -		bdi = mapping->backing_dev_info;
> -		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
> -	}
> -
> -	return 0;
> +	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
>  }
>  #else
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index efe6814..4514ad7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
>  
>  	if (!bdi_cap_stable_pages_required(bdi))
>  		return;
> -#ifdef CONFIG_NEED_BOUNCE_POOL
> -	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
> -		return;
> -#endif /* CONFIG_NEED_BOUNCE_POOL */
>  
>  	wait_on_page_writeback(page);
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3] mm: Make snapshotting pages for stable writes a per-bio operation
@ 2013-04-02 17:01                       ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-04-02 17:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Jan Kara, Shuge, linux-kernel, linux-mm, linux-ext4,
	Kevin, Theodore Ts'o, Jens Axboe, Catalin Marinas,
	Will Deacon, linux-arm-kernel

Hi,

A couple of weeks have gone by without further comments about this patch.

Are you interested in the minor cleanups and added comments, or is the v2 patch
in -next good enough?

Apparently Mel Gorman's interested in this patchset too.  Mel: Most of stable
pages part 2 are already in upstream for 3.9... except this piece.  Are you
interested in having this piece in 3.9 also?  Or is 3.10 good enough for
everyone?

--D

Mon, Mar 18, 2013 at 04:02:59PM -0700, Darrick J. Wong wrote:
> Walking a bio's page mappings has proved problematic, so create a new bio flag
> to indicate that a bio's data needs to be snapshotted in order to guarantee
> stable pages during writeback.  Next, for the one user (ext3/jbd) of
> snapshotting, hook all the places where writes can be initiated without
> PG_writeback set, and set BIO_SNAP_STABLE there.  We must also flag journal
> "metadata" bios for stable writeout, since file data can be written through the
> journal.  Finally, the MS_SNAP_STABLE mount flag (only used by ext3) is now
> superfluous, so get rid of it.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> [darrick.wong@oracle.com: Fold in a couple of small cleanups from akpm]
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>  fs/buffer.c                 |    9 ++++++++-
>  fs/ext3/super.c             |    1 -
>  fs/jbd/commit.c             |   25 ++++++++++++++++++++++---
>  include/linux/blk_types.h   |    3 ++-
>  include/linux/buffer_head.h |    1 +
>  include/uapi/linux/fs.h     |    1 -
>  mm/bounce.c                 |   21 +--------------------
>  mm/page-writeback.c         |    4 ----
>  8 files changed, 34 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b4dcb34..71578d6 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
>  	}
>  }
>  
> -int submit_bh(int rw, struct buffer_head * bh)
> +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
>  {
>  	struct bio *bio;
>  	int ret = 0;
> @@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
>  
>  	bio->bi_end_io = end_bio_bh_io_sync;
>  	bio->bi_private = bh;
> +	bio->bi_flags |= bio_flags;
>  
>  	/* Take care of bh's that straddle the end of the device */
>  	guard_bh_eod(rw, bio, bh);
> @@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
>  	bio_put(bio);
>  	return ret;
>  }
> +EXPORT_SYMBOL_GPL(_submit_bh);
> +
> +int submit_bh(int rw, struct buffer_head *bh)
> +{
> +	return _submit_bh(rw, bh, 0);
> +}
>  EXPORT_SYMBOL(submit_bh);
>  
>  /**
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index fb5120a..3dc48cc 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -2067,7 +2067,6 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
>  		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
>  		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
>  		"writeback");
> -	sb->s_flags |= MS_SNAP_STABLE;
>  
>  	return 0;
>  
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index 86b39b1..11bb11f 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -162,8 +162,17 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
>  
>  	for (i = 0; i < bufs; i++) {
>  		wbuf[i]->b_end_io = end_buffer_write_sync;
> -		/* We use-up our safety reference in submit_bh() */
> -		submit_bh(write_op, wbuf[i]);
> +		/*
> +		 * Here we write back pagecache data that may be mmaped. Since
> +		 * we cannot afford to clean the page and set PageWriteback
> +		 * here due to lock ordering (page lock ranks above transaction
> +		 * start), the data can change while IO is in flight. Tell the
> +		 * block layer it should bounce the bio pages if stable data
> +		 * during write is required.
> +		 *
> +		 * We use up our safety reference in submit_bh().
> +		 */
> +		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
>  	}
>  }
>  
> @@ -667,7 +676,17 @@ start_journal_io:
>  				clear_buffer_dirty(bh);
>  				set_buffer_uptodate(bh);
>  				bh->b_end_io = journal_end_buffer_io_sync;
> -				submit_bh(write_op, bh);
> +				/*
> +				 * In data=journal mode, here we can end up
> +				 * writing pagecache data that might be
> +				 * mmapped. Since we can't afford to clean the
> +				 * page and set PageWriteback (see the comment
> +				 * near the other use of _submit_bh()), the
> +				 * data can change while the write is in
> +				 * flight.  Tell the block layer to bounce the
> +				 * bio pages if stable pages are required.
> +				 */
> +				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
>  			}
>  			cond_resched();
>  
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index cdf1119..22990cf 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -111,12 +111,13 @@ struct bio {
>  #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
>  #define BIO_QUIET	10	/* Make BIO Quiet */
>  #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
> +#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
>  
>  /*
>   * Flags starting here get preserved by bio_reset() - this includes
>   * BIO_POOL_IDX()
>   */
> -#define BIO_RESET_BITS	12
> +#define BIO_RESET_BITS	13
>  
>  #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
>  
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 5afc4f9..4c16c4a 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
>  int sync_dirty_buffer(struct buffer_head *bh);
>  int __sync_dirty_buffer(struct buffer_head *bh, int rw);
>  void write_dirty_buffer(struct buffer_head *bh, int rw);
> +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags);
>  int submit_bh(int, struct buffer_head *);
>  void write_boundary_block(struct block_device *bdev,
>  			sector_t bblock, unsigned blocksize);
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index c7fc1e6..a4ed56c 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -88,7 +88,6 @@ struct inodes_stat_t {
>  #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
>  
>  /* These sb flags are internal to the kernel */
> -#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
>  #define MS_NOSEC	(1<<28)
>  #define MS_BORN		(1<<29)
>  #define MS_ACTIVE	(1<<30)
> diff --git a/mm/bounce.c b/mm/bounce.c
> index 5f89017..a5c2ec3 100644
> --- a/mm/bounce.c
> +++ b/mm/bounce.c
> @@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
>  #ifdef CONFIG_NEED_BOUNCE_POOL
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
>  {
> -	struct page *page;
> -	struct backing_dev_info *bdi;
> -	struct address_space *mapping;
> -	struct bio_vec *from;
> -	int i;
> -
>  	if (bio_data_dir(bio) != WRITE)
>  		return 0;
>  
>  	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
>  		return 0;
>  
> -	/*
> -	 * Based on the first page that has a valid mapping, decide whether or
> -	 * not we have to employ bounce buffering to guarantee stable pages.
> -	 */
> -	bio_for_each_segment(from, bio, i) {
> -		page = from->bv_page;
> -		mapping = page_mapping(page);
> -		if (!mapping)
> -			continue;
> -		bdi = mapping->backing_dev_info;
> -		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
> -	}
> -
> -	return 0;
> +	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
>  }
>  #else
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index efe6814..4514ad7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
>  
>  	if (!bdi_cap_stable_pages_required(bdi))
>  		return;
> -#ifdef CONFIG_NEED_BOUNCE_POOL
> -	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
> -		return;
> -#endif /* CONFIG_NEED_BOUNCE_POOL */
>  
>  	wait_on_page_writeback(page);
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3] mm: Make snapshotting pages for stable writes a per-bio operation
@ 2013-04-02 17:01                       ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-04-02 17:01 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

A couple of weeks have gone by without further comments about this patch.

Are you interested in the minor cleanups and added comments, or is the v2 patch
in -next good enough?

Apparently Mel Gorman's interested in this patchset too.  Mel: Most of stable
pages part 2 are already in upstream for 3.9... except this piece.  Are you
interested in having this piece in 3.9 also?  Or is 3.10 good enough for
everyone?

--D

Mon, Mar 18, 2013 at 04:02:59PM -0700, Darrick J. Wong wrote:
> Walking a bio's page mappings has proved problematic, so create a new bio flag
> to indicate that a bio's data needs to be snapshotted in order to guarantee
> stable pages during writeback.  Next, for the one user (ext3/jbd) of
> snapshotting, hook all the places where writes can be initiated without
> PG_writeback set, and set BIO_SNAP_STABLE there.  We must also flag journal
> "metadata" bios for stable writeout, since file data can be written through the
> journal.  Finally, the MS_SNAP_STABLE mount flag (only used by ext3) is now
> superfluous, so get rid of it.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> [darrick.wong at oracle.com: Fold in a couple of small cleanups from akpm]
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>  fs/buffer.c                 |    9 ++++++++-
>  fs/ext3/super.c             |    1 -
>  fs/jbd/commit.c             |   25 ++++++++++++++++++++++---
>  include/linux/blk_types.h   |    3 ++-
>  include/linux/buffer_head.h |    1 +
>  include/uapi/linux/fs.h     |    1 -
>  mm/bounce.c                 |   21 +--------------------
>  mm/page-writeback.c         |    4 ----
>  8 files changed, 34 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b4dcb34..71578d6 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2949,7 +2949,7 @@ static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
>  	}
>  }
>  
> -int submit_bh(int rw, struct buffer_head * bh)
> +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
>  {
>  	struct bio *bio;
>  	int ret = 0;
> @@ -2984,6 +2984,7 @@ int submit_bh(int rw, struct buffer_head * bh)
>  
>  	bio->bi_end_io = end_bio_bh_io_sync;
>  	bio->bi_private = bh;
> +	bio->bi_flags |= bio_flags;
>  
>  	/* Take care of bh's that straddle the end of the device */
>  	guard_bh_eod(rw, bio, bh);
> @@ -2997,6 +2998,12 @@ int submit_bh(int rw, struct buffer_head * bh)
>  	bio_put(bio);
>  	return ret;
>  }
> +EXPORT_SYMBOL_GPL(_submit_bh);
> +
> +int submit_bh(int rw, struct buffer_head *bh)
> +{
> +	return _submit_bh(rw, bh, 0);
> +}
>  EXPORT_SYMBOL(submit_bh);
>  
>  /**
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index fb5120a..3dc48cc 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -2067,7 +2067,6 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
>  		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
>  		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
>  		"writeback");
> -	sb->s_flags |= MS_SNAP_STABLE;
>  
>  	return 0;
>  
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index 86b39b1..11bb11f 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -162,8 +162,17 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
>  
>  	for (i = 0; i < bufs; i++) {
>  		wbuf[i]->b_end_io = end_buffer_write_sync;
> -		/* We use-up our safety reference in submit_bh() */
> -		submit_bh(write_op, wbuf[i]);
> +		/*
> +		 * Here we write back pagecache data that may be mmaped. Since
> +		 * we cannot afford to clean the page and set PageWriteback
> +		 * here due to lock ordering (page lock ranks above transaction
> +		 * start), the data can change while IO is in flight. Tell the
> +		 * block layer it should bounce the bio pages if stable data
> +		 * during write is required.
> +		 *
> +		 * We use up our safety reference in submit_bh().
> +		 */
> +		_submit_bh(write_op, wbuf[i], 1 << BIO_SNAP_STABLE);
>  	}
>  }
>  
> @@ -667,7 +676,17 @@ start_journal_io:
>  				clear_buffer_dirty(bh);
>  				set_buffer_uptodate(bh);
>  				bh->b_end_io = journal_end_buffer_io_sync;
> -				submit_bh(write_op, bh);
> +				/*
> +				 * In data=journal mode, here we can end up
> +				 * writing pagecache data that might be
> +				 * mmapped. Since we can't afford to clean the
> +				 * page and set PageWriteback (see the comment
> +				 * near the other use of _submit_bh()), the
> +				 * data can change while the write is in
> +				 * flight.  Tell the block layer to bounce the
> +				 * bio pages if stable pages are required.
> +				 */
> +				_submit_bh(write_op, bh, 1 << BIO_SNAP_STABLE);
>  			}
>  			cond_resched();
>  
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index cdf1119..22990cf 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -111,12 +111,13 @@ struct bio {
>  #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
>  #define BIO_QUIET	10	/* Make BIO Quiet */
>  #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
> +#define BIO_SNAP_STABLE	12	/* bio data must be snapshotted during write */
>  
>  /*
>   * Flags starting here get preserved by bio_reset() - this includes
>   * BIO_POOL_IDX()
>   */
> -#define BIO_RESET_BITS	12
> +#define BIO_RESET_BITS	13
>  
>  #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
>  
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 5afc4f9..4c16c4a 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -181,6 +181,7 @@ void ll_rw_block(int, int, struct buffer_head * bh[]);
>  int sync_dirty_buffer(struct buffer_head *bh);
>  int __sync_dirty_buffer(struct buffer_head *bh, int rw);
>  void write_dirty_buffer(struct buffer_head *bh, int rw);
> +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags);
>  int submit_bh(int, struct buffer_head *);
>  void write_boundary_block(struct block_device *bdev,
>  			sector_t bblock, unsigned blocksize);
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index c7fc1e6..a4ed56c 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -88,7 +88,6 @@ struct inodes_stat_t {
>  #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
>  
>  /* These sb flags are internal to the kernel */
> -#define MS_SNAP_STABLE	(1<<27) /* Snapshot pages during writeback, if needed */
>  #define MS_NOSEC	(1<<28)
>  #define MS_BORN		(1<<29)
>  #define MS_ACTIVE	(1<<30)
> diff --git a/mm/bounce.c b/mm/bounce.c
> index 5f89017..a5c2ec3 100644
> --- a/mm/bounce.c
> +++ b/mm/bounce.c
> @@ -181,32 +181,13 @@ static void bounce_end_io_read_isa(struct bio *bio, int err)
>  #ifdef CONFIG_NEED_BOUNCE_POOL
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
>  {
> -	struct page *page;
> -	struct backing_dev_info *bdi;
> -	struct address_space *mapping;
> -	struct bio_vec *from;
> -	int i;
> -
>  	if (bio_data_dir(bio) != WRITE)
>  		return 0;
>  
>  	if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
>  		return 0;
>  
> -	/*
> -	 * Based on the first page that has a valid mapping, decide whether or
> -	 * not we have to employ bounce buffering to guarantee stable pages.
> -	 */
> -	bio_for_each_segment(from, bio, i) {
> -		page = from->bv_page;
> -		mapping = page_mapping(page);
> -		if (!mapping)
> -			continue;
> -		bdi = mapping->backing_dev_info;
> -		return mapping->host->i_sb->s_flags & MS_SNAP_STABLE;
> -	}
> -
> -	return 0;
> +	return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
>  }
>  #else
>  static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index efe6814..4514ad7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2311,10 +2311,6 @@ void wait_for_stable_page(struct page *page)
>  
>  	if (!bdi_cap_stable_pages_required(bdi))
>  		return;
> -#ifdef CONFIG_NEED_BOUNCE_POOL
> -	if (mapping->host->i_sb->s_flags & MS_SNAP_STABLE)
> -		return;
> -#endif /* CONFIG_NEED_BOUNCE_POOL */
>  
>  	wait_on_page_writeback(page);
>  }

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3] mm: Make snapshotting pages for stable writes a per-bio operation
  2013-04-02 17:01                       ` Darrick J. Wong
  (?)
@ 2013-04-03 14:20                         ` Mel Gorman
  -1 siblings, 0 replies; 74+ messages in thread
From: Mel Gorman @ 2013-04-03 14:20 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrew Morton, Jan Kara, Shuge, linux-kernel, linux-mm,
	linux-ext4, Kevin, Theodore Ts'o, Jens Axboe,
	Catalin Marinas, Will Deacon, linux-arm-kernel

On Tue, Apr 02, 2013 at 10:01:43AM -0700, Darrick J. Wong wrote:
> Hi,
> 
> A couple of weeks have gone by without further comments about this patch.
> 
> Are you interested in the minor cleanups and added comments, or is the v2 patch
> in -next good enough?
> 
> Apparently Mel Gorman's interested in this patchset too.  Mel: Most of stable
> pages part 2 are already in upstream for 3.9... except this piece.  Are you
> interested in having this piece in 3.9 also?  Or is 3.10 good enough for
> everyone?
> 

My understanding is that it only affects ARM and DEBUG_VM so there is a
relatively small chance of this generating spurious bug reports.  However,
3.9 is still far enough away that I see no good reason to delay this patch
until 3.10 either.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3] mm: Make snapshotting pages for stable writes a per-bio operation
@ 2013-04-03 14:20                         ` Mel Gorman
  0 siblings, 0 replies; 74+ messages in thread
From: Mel Gorman @ 2013-04-03 14:20 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrew Morton, Jan Kara, Shuge, linux-kernel, linux-mm,
	linux-ext4, Kevin, Theodore Ts'o, Jens Axboe,
	Catalin Marinas, Will Deacon, linux-arm-kernel

On Tue, Apr 02, 2013 at 10:01:43AM -0700, Darrick J. Wong wrote:
> Hi,
> 
> A couple of weeks have gone by without further comments about this patch.
> 
> Are you interested in the minor cleanups and added comments, or is the v2 patch
> in -next good enough?
> 
> Apparently Mel Gorman's interested in this patchset too.  Mel: Most of stable
> pages part 2 are already in upstream for 3.9... except this piece.  Are you
> interested in having this piece in 3.9 also?  Or is 3.10 good enough for
> everyone?
> 

My understanding is that it only affects ARM and DEBUG_VM so there is a
relatively small chance of this generating spurious bug reports.  However,
3.9 is still far enough away that I see no good reason to delay this patch
until 3.10 either.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3] mm: Make snapshotting pages for stable writes a per-bio operation
@ 2013-04-03 14:20                         ` Mel Gorman
  0 siblings, 0 replies; 74+ messages in thread
From: Mel Gorman @ 2013-04-03 14:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Apr 02, 2013 at 10:01:43AM -0700, Darrick J. Wong wrote:
> Hi,
> 
> A couple of weeks have gone by without further comments about this patch.
> 
> Are you interested in the minor cleanups and added comments, or is the v2 patch
> in -next good enough?
> 
> Apparently Mel Gorman's interested in this patchset too.  Mel: Most of stable
> pages part 2 are already in upstream for 3.9... except this piece.  Are you
> interested in having this piece in 3.9 also?  Or is 3.10 good enough for
> everyone?
> 

My understanding is that it only affects ARM and DEBUG_VM so there is a
relatively small chance of this generating spurious bug reports.  However,
3.9 is still far enough away that I see no good reason to delay this patch
until 3.10 either.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3] mm: Make snapshotting pages for stable writes a per-bio operation
  2013-04-03 14:20                         ` Mel Gorman
  (?)
@ 2013-04-03 14:42                           ` Jan Kara
  -1 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-04-03 14:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Darrick J. Wong, Andrew Morton, Jan Kara, Shuge, linux-kernel,
	linux-mm, linux-ext4, Kevin, Theodore Ts'o, Jens Axboe,
	Catalin Marinas, Will Deacon, linux-arm-kernel

On Wed 03-04-13 15:20:19, Mel Gorman wrote:
> On Tue, Apr 02, 2013 at 10:01:43AM -0700, Darrick J. Wong wrote:
> > Hi,
> > 
> > A couple of weeks have gone by without further comments about this patch.
> > 
> > Are you interested in the minor cleanups and added comments, or is the v2 patch
> > in -next good enough?
> > 
> > Apparently Mel Gorman's interested in this patchset too.  Mel: Most of stable
> > pages part 2 are already in upstream for 3.9... except this piece.  Are you
> > interested in having this piece in 3.9 also?  Or is 3.10 good enough for
> > everyone?
> > 
> 
> My understanding is that it only affects ARM and DEBUG_VM so there is a
> relatively small chance of this generating spurious bug reports.  However,
> 3.9 is still far enough away that I see no good reason to delay this patch
> until 3.10 either.
  No, actually with direct IO, anything that needs stable pages is going to
blow up quickly because pages attached to bio needn't be from page cache. So
I think it should better make it into 3.9.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3] mm: Make snapshotting pages for stable writes a per-bio operation
@ 2013-04-03 14:42                           ` Jan Kara
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-04-03 14:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Darrick J. Wong, Andrew Morton, Jan Kara, Shuge, linux-kernel,
	linux-mm, linux-ext4, Kevin, Theodore Ts'o, Jens Axboe,
	Catalin Marinas, Will Deacon, linux-arm-kernel

On Wed 03-04-13 15:20:19, Mel Gorman wrote:
> On Tue, Apr 02, 2013 at 10:01:43AM -0700, Darrick J. Wong wrote:
> > Hi,
> > 
> > A couple of weeks have gone by without further comments about this patch.
> > 
> > Are you interested in the minor cleanups and added comments, or is the v2 patch
> > in -next good enough?
> > 
> > Apparently Mel Gorman's interested in this patchset too.  Mel: Most of stable
> > pages part 2 are already in upstream for 3.9... except this piece.  Are you
> > interested in having this piece in 3.9 also?  Or is 3.10 good enough for
> > everyone?
> > 
> 
> My understanding is that it only affects ARM and DEBUG_VM so there is a
> relatively small chance of this generating spurious bug reports.  However,
> 3.9 is still far enough away that I see no good reason to delay this patch
> until 3.10 either.
  No, actually with direct IO, anything that needs stable pages is going to
blow up quickly because pages attached to bio needn't be from page cache. So
I think it should better make it into 3.9.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3] mm: Make snapshotting pages for stable writes a per-bio operation
@ 2013-04-03 14:42                           ` Jan Kara
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Kara @ 2013-04-03 14:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed 03-04-13 15:20:19, Mel Gorman wrote:
> On Tue, Apr 02, 2013 at 10:01:43AM -0700, Darrick J. Wong wrote:
> > Hi,
> > 
> > A couple of weeks have gone by without further comments about this patch.
> > 
> > Are you interested in the minor cleanups and added comments, or is the v2 patch
> > in -next good enough?
> > 
> > Apparently Mel Gorman's interested in this patchset too.  Mel: Most of stable
> > pages part 2 are already in upstream for 3.9... except this piece.  Are you
> > interested in having this piece in 3.9 also?  Or is 3.10 good enough for
> > everyone?
> > 
> 
> My understanding is that it only affects ARM and DEBUG_VM so there is a
> relatively small chance of this generating spurious bug reports.  However,
> 3.9 is still far enough away that I see no good reason to delay this patch
> until 3.10 either.
  No, actually with direct IO, anything that needs stable pages is going to
blow up quickly because pages attached to bio needn't be from page cache. So
I think it should better make it into 3.9.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3] mm: Make snapshotting pages for stable writes a per-bio operation
  2013-04-03 14:42                           ` Jan Kara
  (?)
@ 2013-04-09 18:03                             ` Darrick J. Wong
  -1 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-04-09 18:03 UTC (permalink / raw)
  To: Jan Kara, Andrew Morton
  Cc: Mel Gorman, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel

On Wed, Apr 03, 2013 at 04:42:44PM +0200, Jan Kara wrote:
> On Wed 03-04-13 15:20:19, Mel Gorman wrote:
> > On Tue, Apr 02, 2013 at 10:01:43AM -0700, Darrick J. Wong wrote:
> > > Hi,
> > > 
> > > A couple of weeks have gone by without further comments about this patch.
> > > 
> > > Are you interested in the minor cleanups and added comments, or is the v2 patch
> > > in -next good enough?
> > > 
> > > Apparently Mel Gorman's interested in this patchset too.  Mel: Most of stable
> > > pages part 2 are already in upstream for 3.9... except this piece.  Are you
> > > interested in having this piece in 3.9 also?  Or is 3.10 good enough for
> > > everyone?
> > > 
> > 
> > My understanding is that it only affects ARM and DEBUG_VM so there is a
> > relatively small chance of this generating spurious bug reports.  However,
> > 3.9 is still far enough away that I see no good reason to delay this patch
> > until 3.10 either.
>   No, actually with direct IO, anything that needs stable pages is going to
> blow up quickly because pages attached to bio needn't be from page cache. So
> I think it should better make it into 3.9.

Hmm.  The previous version of this patch has been hanging around in -next for a
few weeks without problems (afaik).  With just a raw 3.9-rc[56] I haven't been
able to produce a failed checksum or kernel crash when running with O_DIRECT,
either with the write-after-checksum reproducer or even a simple dd
oflag=direct.  But maybe I've gotten lucky on x86?

So... Andrew: Would you like to pick up the patch with more descriptive
comments?  And, is it too late to push it for 3.9?  Jan seems to think we might
have a bug (though I haven't encountered it).

I'll resend the patch just in case it got eaten.

--D
> 
> 									Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3] mm: Make snapshotting pages for stable writes a per-bio operation
@ 2013-04-09 18:03                             ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-04-09 18:03 UTC (permalink / raw)
  To: Jan Kara, Andrew Morton
  Cc: Mel Gorman, Shuge, linux-kernel, linux-mm, linux-ext4, Kevin,
	Theodore Ts'o, Jens Axboe, Catalin Marinas, Will Deacon,
	linux-arm-kernel

On Wed, Apr 03, 2013 at 04:42:44PM +0200, Jan Kara wrote:
> On Wed 03-04-13 15:20:19, Mel Gorman wrote:
> > On Tue, Apr 02, 2013 at 10:01:43AM -0700, Darrick J. Wong wrote:
> > > Hi,
> > > 
> > > A couple of weeks have gone by without further comments about this patch.
> > > 
> > > Are you interested in the minor cleanups and added comments, or is the v2 patch
> > > in -next good enough?
> > > 
> > > Apparently Mel Gorman's interested in this patchset too.  Mel: Most of stable
> > > pages part 2 are already in upstream for 3.9... except this piece.  Are you
> > > interested in having this piece in 3.9 also?  Or is 3.10 good enough for
> > > everyone?
> > > 
> > 
> > My understanding is that it only affects ARM and DEBUG_VM so there is a
> > relatively small chance of this generating spurious bug reports.  However,
> > 3.9 is still far enough away that I see no good reason to delay this patch
> > until 3.10 either.
>   No, actually with direct IO, anything that needs stable pages is going to
> blow up quickly because pages attached to bio needn't be from page cache. So
> I think it should better make it into 3.9.

Hmm.  The previous version of this patch has been hanging around in -next for a
few weeks without problems (afaik).  With just a raw 3.9-rc[56] I haven't been
able to produce a failed checksum or kernel crash when running with O_DIRECT,
either with the write-after-checksum reproducer or even a simple dd
oflag=direct.  But maybe I've gotten lucky on x86?

So... Andrew: Would you like to pick up the patch with more descriptive
comments?  And, is it too late to push it for 3.9?  Jan seems to think we might
have a bug (though I haven't encountered it).

I'll resend the patch just in case it got eaten.

--D
> 
> 									Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3] mm: Make snapshotting pages for stable writes a per-bio operation
@ 2013-04-09 18:03                             ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2013-04-09 18:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Apr 03, 2013 at 04:42:44PM +0200, Jan Kara wrote:
> On Wed 03-04-13 15:20:19, Mel Gorman wrote:
> > On Tue, Apr 02, 2013 at 10:01:43AM -0700, Darrick J. Wong wrote:
> > > Hi,
> > > 
> > > A couple of weeks have gone by without further comments about this patch.
> > > 
> > > Are you interested in the minor cleanups and added comments, or is the v2 patch
> > > in -next good enough?
> > > 
> > > Apparently Mel Gorman's interested in this patchset too.  Mel: Most of stable
> > > pages part 2 are already in upstream for 3.9... except this piece.  Are you
> > > interested in having this piece in 3.9 also?  Or is 3.10 good enough for
> > > everyone?
> > > 
> > 
> > My understanding is that it only affects ARM and DEBUG_VM so there is a
> > relatively small chance of this generating spurious bug reports.  However,
> > 3.9 is still far enough away that I see no good reason to delay this patch
> > until 3.10 either.
>   No, actually with direct IO, anything that needs stable pages is going to
> blow up quickly because pages attached to bio needn't be from page cache. So
> I think it should better make it into 3.9.

Hmm.  The previous version of this patch has been hanging around in -next for a
few weeks without problems (afaik).  With just a raw 3.9-rc[56] I haven't been
able to produce a failed checksum or kernel crash when running with O_DIRECT,
either with the write-after-checksum reproducer or even a simple dd
oflag=direct.  But maybe I've gotten lucky on x86?

So... Andrew: Would you like to pick up the patch with more descriptive
comments?  And, is it too late to push it for 3.9?  Jan seems to think we might
have a bug (though I haven't encountered it).

I'll resend the patch just in case it got eaten.

--D
> 
> 									Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2013-04-09 18:05 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-08 12:37 [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2 Shuge
2013-03-08 12:37 ` Shuge
2013-03-12 22:32 ` Andrew Morton
2013-03-12 22:32   ` Andrew Morton
2013-03-13  1:10   ` Darrick J. Wong
2013-03-13  1:10     ` Darrick J. Wong
2013-03-13  1:10     ` Darrick J. Wong
2013-03-13  3:35     ` Shuge
2013-03-13  3:35       ` Shuge
2013-03-13  3:35       ` Shuge
2013-03-13  4:11       ` Andrew Morton
2013-03-13  4:11         ` Andrew Morton
2013-03-13  4:11         ` Andrew Morton
2013-03-13  9:42         ` Russell King - ARM Linux
2013-03-13  9:42           ` Russell King - ARM Linux
2013-03-13  9:42           ` Russell King - ARM Linux
2013-03-13  8:50     ` Jan Kara
2013-03-13  8:50       ` Jan Kara
2013-03-13  8:50       ` Jan Kara
2013-03-13 19:44       ` Darrick J. Wong
2013-03-13 19:44         ` Darrick J. Wong
2013-03-13 19:44         ` Darrick J. Wong
2013-03-13 21:02         ` Jan Kara
2013-03-13 21:02           ` Jan Kara
2013-03-13 21:02           ` Jan Kara
2013-03-14 22:42           ` Darrick J. Wong
2013-03-14 22:42             ` Darrick J. Wong
2013-03-14 22:42             ` Darrick J. Wong
2013-03-14 23:01             ` Andrew Morton
2013-03-14 23:01               ` Andrew Morton
2013-03-14 23:01               ` Andrew Morton
2013-03-15 10:01             ` Jan Kara
2013-03-15 10:01               ` Jan Kara
2013-03-15 10:01               ` Jan Kara
2013-03-15 17:54               ` Darrick J. Wong
2013-03-15 17:54                 ` Darrick J. Wong
2013-03-15 17:54                 ` Darrick J. Wong
2013-03-18 17:32                 ` Jan Kara
2013-03-18 17:32                   ` Jan Kara
2013-03-18 17:32                   ` Jan Kara
2013-03-15 23:28               ` [PATCH] mm: Make snapshotting pages for stable writes a per-bio operation Darrick J. Wong
2013-03-15 23:28                 ` Darrick J. Wong
2013-03-15 23:28                 ` Darrick J. Wong
2013-03-18 17:41                 ` Jan Kara
2013-03-18 17:41                   ` Jan Kara
2013-03-18 17:41                   ` Jan Kara
2013-03-18 23:01                   ` Darrick J. Wong
2013-03-18 23:01                     ` Darrick J. Wong
2013-03-18 23:01                     ` Darrick J. Wong
2013-03-18 23:02                   ` [PATCH v3] " Darrick J. Wong
2013-03-18 23:02                     ` Darrick J. Wong
2013-03-18 23:02                     ` Darrick J. Wong
2013-03-19  8:54                     ` Jan Kara
2013-03-19  8:54                       ` Jan Kara
2013-03-19  8:54                       ` Jan Kara
2013-04-02 17:01                     ` Darrick J. Wong
2013-04-02 17:01                       ` Darrick J. Wong
2013-04-02 17:01                       ` Darrick J. Wong
2013-04-02 17:01                       ` Darrick J. Wong
2013-04-03 14:20                       ` Mel Gorman
2013-04-03 14:20                         ` Mel Gorman
2013-04-03 14:20                         ` Mel Gorman
2013-04-03 14:42                         ` Jan Kara
2013-04-03 14:42                           ` Jan Kara
2013-04-03 14:42                           ` Jan Kara
2013-04-09 18:03                           ` Darrick J. Wong
2013-04-09 18:03                             ` Darrick J. Wong
2013-04-09 18:03                             ` Darrick J. Wong
2013-03-14 22:46           ` [PATCH] bounce:fix bug, avoid to flush dcache on slab page from jbd2 Andrew Morton
2013-03-14 22:46             ` Andrew Morton
2013-03-14 22:46             ` Andrew Morton
2013-03-14 23:27             ` Darrick J. Wong
2013-03-14 23:27               ` Darrick J. Wong
2013-03-14 23:27               ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.