All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] ext4: fix data integrity sync in ordered mode
@ 2014-04-30 10:02 Namjae Jeon
  2014-04-30 16:01 ` Jan Kara
  0 siblings, 1 reply; 5+ messages in thread
From: Namjae Jeon @ 2014-04-30 10:02 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, Ashish Sangwan, 'Jan kara'

When we perform a data integrity sync we tag all the dirty pages with
PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages.
Later we check for this tag in write_cache_pages_da and creates a
struct mpage_da_data containing contiguously indexed pages tagged with this
tag and sync these pages with a call to mpage_da_map_and_submit.
This process is done in while loop until all the PAGECACHE_TAG_TOWRITE pages
are synced. We also do journal start and stop in each iteration.
journal_stop could initiate journal commit which would call ext4_writepage
which in turn will call ext4_bio_write_page even for delayed OR unwritten
buffers. When ext4_bio_write_page is called for such buffers, even though it
does not sync them but it clears the PAGECACHE_TAG_TOWRITE of the corresponding
page and hence these pages are also not synced by the currently running data
integrity sync. We will end up with dirty pages although sync is completed.

This could cause a potential data loss when the sync call is followed by a
truncate_pagecache call, which is exactly the case in collapse_range.
(It will cause generic/127 failure in xfstests)

Cc: stable@vger.kernel.org
Cc: Jan kara <jack@suse.de>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
---
 fs/ext4/inode.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b1dc334..bd85712 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1865,12 +1865,19 @@ static int ext4_writepage(struct page *page,
 	if (ext4_walk_page_buffers(NULL, page_bufs, 0, len, NULL,
 				   ext4_bh_delay_or_unwritten)) {
 		redirty_page_for_writepage(wbc, page);
-		if (current->flags & PF_MEMALLOC) {
+		if ((current->flags & PF_MEMALLOC) || 
+		     radix_tree_tag_get(&page->mapping->page_tree,
+					page->index, PAGECACHE_TAG_TOWRITE)) {
 			/*
 			 * For memory cleaning there's no point in writing only
 			 * some buffers. So just bail out. Warn if we came here
 			 * from direct reclaim.
-			 */
+			 * We should also bail out when a journal commit happen
+			 * during an integrity sync operation because calling
+			 * ext4_bio_write_page in this case will clear 
+			 * PAGECACHE_TAG_TOWRITE and we could end up with 
+			 * dirty pages even after completion of a sync call.
+			 */ 
 			WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD))
 							== PF_MEMALLOC);
 			unlock_page(page);
-- 
1.7.11-rc0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] ext4: fix data integrity sync in ordered mode
  2014-04-30 10:02 [PATCH] ext4: fix data integrity sync in ordered mode Namjae Jeon
@ 2014-04-30 16:01 ` Jan Kara
  2014-05-02 11:35   ` Namjae Jeon
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2014-04-30 16:01 UTC (permalink / raw)
  To: Namjae Jeon
  Cc: Theodore Ts'o, linux-ext4, Ashish Sangwan, 'Jan kara'

  Hello,

On Wed 30-04-14 19:02:14, Namjae Jeon wrote:
> When we perform a data integrity sync we tag all the dirty pages with
> PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages.
> Later we check for this tag in write_cache_pages_da and creates a
> struct mpage_da_data containing contiguously indexed pages tagged with this
> tag and sync these pages with a call to mpage_da_map_and_submit.
> This process is done in while loop until all the PAGECACHE_TAG_TOWRITE pages
> are synced. We also do journal start and stop in each iteration.
> journal_stop could initiate journal commit which would call ext4_writepage
> which in turn will call ext4_bio_write_page even for delayed OR unwritten
> buffers. When ext4_bio_write_page is called for such buffers, even though it
> does not sync them but it clears the PAGECACHE_TAG_TOWRITE of the corresponding
> page and hence these pages are also not synced by the currently running data
> integrity sync. We will end up with dirty pages although sync is completed.
> 
> This could cause a potential data loss when the sync call is followed by a
> truncate_pagecache call, which is exactly the case in collapse_range.
> (It will cause generic/127 failure in xfstests)
  This is well spotted. Thanks for finding this bug. See my comment below
regarding the fix.

> Cc: stable@vger.kernel.org
> Cc: Jan kara <jack@suse.de>
> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
> ---
>  fs/ext4/inode.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b1dc334..bd85712 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1865,12 +1865,19 @@ static int ext4_writepage(struct page *page,
>  	if (ext4_walk_page_buffers(NULL, page_bufs, 0, len, NULL,
>  				   ext4_bh_delay_or_unwritten)) {
>  		redirty_page_for_writepage(wbc, page);
> -		if (current->flags & PF_MEMALLOC) {
> +		if ((current->flags & PF_MEMALLOC) || 
> +		     radix_tree_tag_get(&page->mapping->page_tree,
> +					page->index, PAGECACHE_TAG_TOWRITE)) {
  I don't think your fix is correct. journal_submit_inode_data_buffers()
uses WB_SYNC_ALL mode to write the pages and thus all the pages you'll see
in ext4_writepage() are going to have TOWRITE tag set. And even if that
wasn't the case you'll have problems when blocksize < pagesize. Because in
data=ordered mode we want to writeout allocated (mapped) blocks in the page
to avoid exposure of uninitialized data after a crash (e.g. in case we have
allocated some blocks in the current transaction but not yet finished
writing them out and there are other blocks underlying the page which
aren't allocated yet). Fixing this isn't easy I'm afraid.

What we could do is to create a variant of set_page_writeback() which
doesn't clear TOWRITE tag and use that in ext4_bio_write_page() if we are
writing out just some buffers in a page and leaving other dirty buffers
behind. It would have a down side that we would be leaving TOWRITE tagged
pages behind in case when we actually don't race with other writeback but
I don't see that causing any real problems.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] ext4: fix data integrity sync in ordered mode
  2014-04-30 16:01 ` Jan Kara
@ 2014-05-02 11:35   ` Namjae Jeon
  2014-05-05 17:16     ` Jan Kara
  0 siblings, 1 reply; 5+ messages in thread
From: Namjae Jeon @ 2014-05-02 11:35 UTC (permalink / raw)
  To: 'Jan Kara'
  Cc: 'Theodore Ts'o', 'linux-ext4',
	'Ashish Sangwan'

> 
>   Hello,
> 
> On Wed 30-04-14 19:02:14, Namjae Jeon wrote:
> > When we perform a data integrity sync we tag all the dirty pages with
> > PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages.
> > Later we check for this tag in write_cache_pages_da and creates a
> > struct mpage_da_data containing contiguously indexed pages tagged with this
> > tag and sync these pages with a call to mpage_da_map_and_submit.
> > This process is done in while loop until all the PAGECACHE_TAG_TOWRITE pages
> > are synced. We also do journal start and stop in each iteration.
> > journal_stop could initiate journal commit which would call ext4_writepage
> > which in turn will call ext4_bio_write_page even for delayed OR unwritten
> > buffers. When ext4_bio_write_page is called for such buffers, even though it
> > does not sync them but it clears the PAGECACHE_TAG_TOWRITE of the corresponding
> > page and hence these pages are also not synced by the currently running data
> > integrity sync. We will end up with dirty pages although sync is completed.
> >
> > This could cause a potential data loss when the sync call is followed by a
> > truncate_pagecache call, which is exactly the case in collapse_range.
> > (It will cause generic/127 failure in xfstests)
>   This is well spotted. Thanks for finding this bug. See my comment below
> regarding the fix.
> 
> > Cc: stable@vger.kernel.org
> > Cc: Jan kara <jack@suse.de>
> > Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
> > Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
> > ---
> >  fs/ext4/inode.c | 11 +++++++++--
> >  1 file changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index b1dc334..bd85712 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -1865,12 +1865,19 @@ static int ext4_writepage(struct page *page,
> >  	if (ext4_walk_page_buffers(NULL, page_bufs, 0, len, NULL,
> >  				   ext4_bh_delay_or_unwritten)) {
> >  		redirty_page_for_writepage(wbc, page);
> > -		if (current->flags & PF_MEMALLOC) {
> > +		if ((current->flags & PF_MEMALLOC) ||
> > +		     radix_tree_tag_get(&page->mapping->page_tree,
> > +					page->index, PAGECACHE_TAG_TOWRITE)) {
>   I don't think your fix is correct. journal_submit_inode_data_buffers()
> uses WB_SYNC_ALL mode to write the pages and thus all the pages you'll see
> in ext4_writepage() are going to have TOWRITE tag set. And even if that
> wasn't the case you'll have problems when blocksize < pagesize. Because in
> data=ordered mode we want to writeout allocated (mapped) blocks in the page
> to avoid exposure of uninitialized data after a crash (e.g. in case we have
> allocated some blocks in the current transaction but not yet finished
> writing them out and there are other blocks underlying the page which
> aren't allocated yet). Fixing this isn't easy I'm afraid.
> 
> What we could do is to create a variant of set_page_writeback() which
> doesn't clear TOWRITE tag and use that in ext4_bio_write_page() if we are
> writing out just some buffers in a page and leaving other dirty buffers
> behind. It would have a down side that we would be leaving TOWRITE tagged
> pages behind in case when we actually don't race with other writeback but
> I don't see that causing any real problems.

Hi Jan.
Thanks for your reply.

I agree about your opinion. But set_page_writeback is used on many place.
So I think it is expected to change too much if set_page_writeback is modified.

How about change like this ?

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 4acf1f7..680f12f 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -373,14 +373,14 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 	unsigned block_start, blocksize;
 	struct buffer_head *bh, *head;
 	int ret = 0;
-	int nr_submitted = 0;
+	int nr_submitted = 0, dirty_buffers =0, unmapped_dirty_buffers = 0;
+	bool needs_tag_towrite = 0;
 
 	blocksize = 1 << inode->i_blkbits;
 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(PageWriteback(page));
 
-	set_page_writeback(page);
 	ClearPageError(page);
 
 	/*
@@ -418,6 +418,8 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 				clear_buffer_dirty(bh);
 			if (io->io_bio)
 				ext4_io_submit(io);
+			if ((buffer_delay(bh) || buffer_unwritten(bh)) && buffer_dirty(bh))
+				unmapped_dirty_buffers++;
 			continue;
 		}
 		if (buffer_new(bh)) {
@@ -425,8 +427,21 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 			unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
 		}
 		set_buffer_async_write(bh);
+		dirty_buffers++;
 	} while ((bh = bh->b_this_page) != head);
 
+	if (!dirty_buffers) {
+		unlock_page(page);
+		return ret;
+	}
+
+	if (unmapped_dirty_buffers &&
+	    radix_tree_tag_get(&page->mapping->page_tree, page->index,
+			       PAGECACHE_TAG_TOWRITE))
+		needs_tag_towrite = 1;
+
+	set_page_writeback(page);
+
 	/* Now submit buffers to write */
 	bh = head = page_buffers(page);
 	do {
@@ -457,5 +472,10 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 	/* Nothing submitted - we have to end page writeback */
 	if (!nr_submitted)
 		end_page_writeback(page);
+
+	if (needs_tag_towrite)
+		tag_pages_for_writeback(page->mapping, page->index,
+					page->index);
+	
 	return ret;
}

Thanks!
> 
> 								Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] ext4: fix data integrity sync in ordered mode
  2014-05-02 11:35   ` Namjae Jeon
@ 2014-05-05 17:16     ` Jan Kara
  2014-05-06  5:19       ` Namjae Jeon
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2014-05-05 17:16 UTC (permalink / raw)
  To: Namjae Jeon
  Cc: 'Jan Kara', 'Theodore Ts'o',
	'linux-ext4', 'Ashish Sangwan'

  Hello,

On Fri 02-05-14 20:35:56, Namjae Jeon wrote:
> > On Wed 30-04-14 19:02:14, Namjae Jeon wrote:
> > > When we perform a data integrity sync we tag all the dirty pages with
> > > PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages.
> > > Later we check for this tag in write_cache_pages_da and creates a
> > > struct mpage_da_data containing contiguously indexed pages tagged with this
> > > tag and sync these pages with a call to mpage_da_map_and_submit.
> > > This process is done in while loop until all the PAGECACHE_TAG_TOWRITE pages
> > > are synced. We also do journal start and stop in each iteration.
> > > journal_stop could initiate journal commit which would call ext4_writepage
> > > which in turn will call ext4_bio_write_page even for delayed OR unwritten
> > > buffers. When ext4_bio_write_page is called for such buffers, even though it
> > > does not sync them but it clears the PAGECACHE_TAG_TOWRITE of the corresponding
> > > page and hence these pages are also not synced by the currently running data
> > > integrity sync. We will end up with dirty pages although sync is completed.
> > >
> > > This could cause a potential data loss when the sync call is followed by a
> > > truncate_pagecache call, which is exactly the case in collapse_range.
> > > (It will cause generic/127 failure in xfstests)
> >   This is well spotted. Thanks for finding this bug. See my comment below
> > regarding the fix.
> > 
> > > Cc: stable@vger.kernel.org
> > > Cc: Jan kara <jack@suse.de>
> > > Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
> > > Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
> > > ---
> > >  fs/ext4/inode.c | 11 +++++++++--
> > >  1 file changed, 9 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index b1dc334..bd85712 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -1865,12 +1865,19 @@ static int ext4_writepage(struct page *page,
> > >  	if (ext4_walk_page_buffers(NULL, page_bufs, 0, len, NULL,
> > >  				   ext4_bh_delay_or_unwritten)) {
> > >  		redirty_page_for_writepage(wbc, page);
> > > -		if (current->flags & PF_MEMALLOC) {
> > > +		if ((current->flags & PF_MEMALLOC) ||
> > > +		     radix_tree_tag_get(&page->mapping->page_tree,
> > > +					page->index, PAGECACHE_TAG_TOWRITE)) {
> >   I don't think your fix is correct. journal_submit_inode_data_buffers()
> > uses WB_SYNC_ALL mode to write the pages and thus all the pages you'll see
> > in ext4_writepage() are going to have TOWRITE tag set. And even if that
> > wasn't the case you'll have problems when blocksize < pagesize. Because in
> > data=ordered mode we want to writeout allocated (mapped) blocks in the page
> > to avoid exposure of uninitialized data after a crash (e.g. in case we have
> > allocated some blocks in the current transaction but not yet finished
> > writing them out and there are other blocks underlying the page which
> > aren't allocated yet). Fixing this isn't easy I'm afraid.
> > 
> > What we could do is to create a variant of set_page_writeback() which
> > doesn't clear TOWRITE tag and use that in ext4_bio_write_page() if we are
> > writing out just some buffers in a page and leaving other dirty buffers
> > behind. It would have a down side that we would be leaving TOWRITE tagged
> > pages behind in case when we actually don't race with other writeback but
> > I don't see that causing any real problems.
> 
> I agree about your opinion. But set_page_writeback is used on many place.
> So I think it is expected to change too much if set_page_writeback is
> modified.
  I meant we would create a new variant of set_page_writeback() which would
not clear TOWRITE tag (something like set_page_writeback_keepwrite()) and
then use this variant from ext4_writepage() during writeback from JBD2.

Regarding your patch:
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 4acf1f7..680f12f 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
...
> @@ -425,8 +427,21 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
>  			unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
>  		}
>  		set_buffer_async_write(bh);
> +		dirty_buffers++;
>  	} while ((bh = bh->b_this_page) != head);
>  
> +	if (!dirty_buffers) {
> +		unlock_page(page);
> +		return ret;
> +	}
> +
> +	if (unmapped_dirty_buffers &&
> +	    radix_tree_tag_get(&page->mapping->page_tree, page->index,
> +			       PAGECACHE_TAG_TOWRITE))
> +		needs_tag_towrite = 1;
> +
> +	set_page_writeback(page);
  You cannot call set_page_writeback() here. There might be bios against
this page already in flight at this moment and so IO completion could race
with set_page_writeback(). 
  
>  	/* Now submit buffers to write */
>  	bh = head = page_buffers(page);
>  	do {
> @@ -457,5 +472,10 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
>  	/* Nothing submitted - we have to end page writeback */
>  	if (!nr_submitted)
>  		end_page_writeback(page);
> +
> +	if (needs_tag_towrite)
> +		tag_pages_for_writeback(page->mapping, page->index,
> +					page->index);
> +	
  And this is racy. Data integrity sync can do tagged lookup just after
set_page_writeback() cleared the tag and so it won't find the dirty page.
Really the only race free way is not to clear the tag in set_page_writeback().

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] ext4: fix data integrity sync in ordered mode
  2014-05-05 17:16     ` Jan Kara
@ 2014-05-06  5:19       ` Namjae Jeon
  0 siblings, 0 replies; 5+ messages in thread
From: Namjae Jeon @ 2014-05-06  5:19 UTC (permalink / raw)
  To: 'Jan Kara'
  Cc: 'Theodore Ts'o', 'linux-ext4',
	'Ashish Sangwan'

>   Hello,
> 
> On Fri 02-05-14 20:35:56, Namjae Jeon wrote:
> > > On Wed 30-04-14 19:02:14, Namjae Jeon wrote:
> > > > When we perform a data integrity sync we tag all the dirty pages with
> > > > PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages.
> > > > Later we check for this tag in write_cache_pages_da and creates a
> > > > struct mpage_da_data containing contiguously indexed pages tagged with this
> > > > tag and sync these pages with a call to mpage_da_map_and_submit.
> > > > This process is done in while loop until all the PAGECACHE_TAG_TOWRITE pages
> > > > are synced. We also do journal start and stop in each iteration.
> > > > journal_stop could initiate journal commit which would call ext4_writepage
> > > > which in turn will call ext4_bio_write_page even for delayed OR unwritten
> > > > buffers. When ext4_bio_write_page is called for such buffers, even though it
> > > > does not sync them but it clears the PAGECACHE_TAG_TOWRITE of the corresponding
> > > > page and hence these pages are also not synced by the currently running data
> > > > integrity sync. We will end up with dirty pages although sync is completed.
> > > >
> > > > This could cause a potential data loss when the sync call is followed by a
> > > > truncate_pagecache call, which is exactly the case in collapse_range.
> > > > (It will cause generic/127 failure in xfstests)
> > >   This is well spotted. Thanks for finding this bug. See my comment below
> > > regarding the fix.
> > >
> > > > Cc: stable@vger.kernel.org
> > > > Cc: Jan kara <jack@suse.de>
> > > > Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
> > > > Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
> > > > ---
> > > >  fs/ext4/inode.c | 11 +++++++++--
> > > >  1 file changed, 9 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > index b1dc334..bd85712 100644
> > > > --- a/fs/ext4/inode.c
> > > > +++ b/fs/ext4/inode.c
> > > > @@ -1865,12 +1865,19 @@ static int ext4_writepage(struct page *page,
> > > >  	if (ext4_walk_page_buffers(NULL, page_bufs, 0, len, NULL,
> > > >  				   ext4_bh_delay_or_unwritten)) {
> > > >  		redirty_page_for_writepage(wbc, page);
> > > > -		if (current->flags & PF_MEMALLOC) {
> > > > +		if ((current->flags & PF_MEMALLOC) ||
> > > > +		     radix_tree_tag_get(&page->mapping->page_tree,
> > > > +					page->index, PAGECACHE_TAG_TOWRITE)) {
> > >   I don't think your fix is correct. journal_submit_inode_data_buffers()
> > > uses WB_SYNC_ALL mode to write the pages and thus all the pages you'll see
> > > in ext4_writepage() are going to have TOWRITE tag set. And even if that
> > > wasn't the case you'll have problems when blocksize < pagesize. Because in
> > > data=ordered mode we want to writeout allocated (mapped) blocks in the page
> > > to avoid exposure of uninitialized data after a crash (e.g. in case we have
> > > allocated some blocks in the current transaction but not yet finished
> > > writing them out and there are other blocks underlying the page which
> > > aren't allocated yet). Fixing this isn't easy I'm afraid.
> > >
> > > What we could do is to create a variant of set_page_writeback() which
> > > doesn't clear TOWRITE tag and use that in ext4_bio_write_page() if we are
> > > writing out just some buffers in a page and leaving other dirty buffers
> > > behind. It would have a down side that we would be leaving TOWRITE tagged
> > > pages behind in case when we actually don't race with other writeback but
> > > I don't see that causing any real problems.
> >
> > I agree about your opinion. But set_page_writeback is used on many place.
> > So I think it is expected to change too much if set_page_writeback is
> > modified.
>   I meant we would create a new variant of set_page_writeback() which would
> not clear TOWRITE tag (something like set_page_writeback_keepwrite()) and
> then use this variant from ext4_writepage() during writeback from JBD2.
> 
> Regarding your patch:
> > diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> > index 4acf1f7..680f12f 100644
> > --- a/fs/ext4/page-io.c
> > +++ b/fs/ext4/page-io.c
> ...
> > @@ -425,8 +427,21 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
> >  			unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
> >  		}
> >  		set_buffer_async_write(bh);
> > +		dirty_buffers++;
> >  	} while ((bh = bh->b_this_page) != head);
> >
> > +	if (!dirty_buffers) {
> > +		unlock_page(page);
> > +		return ret;
> > +	}
> > +
> > +	if (unmapped_dirty_buffers &&
> > +	    radix_tree_tag_get(&page->mapping->page_tree, page->index,
> > +			       PAGECACHE_TAG_TOWRITE))
> > +		needs_tag_towrite = 1;
> > +
> > +	set_page_writeback(page);
>   You cannot call set_page_writeback() here. There might be bios against
> this page already in flight at this moment and so IO completion could race
> with set_page_writeback().
> 
> >  	/* Now submit buffers to write */
> >  	bh = head = page_buffers(page);
> >  	do {
> > @@ -457,5 +472,10 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
> >  	/* Nothing submitted - we have to end page writeback */
> >  	if (!nr_submitted)
> >  		end_page_writeback(page);
> > +
> > +	if (needs_tag_towrite)
> > +		tag_pages_for_writeback(page->mapping, page->index,
> > +					page->index);
> > +
>   And this is racy. Data integrity sync can do tagged lookup just after
> set_page_writeback() cleared the tag and so it won't find the dirty page.
> Really the only race free way is not to clear the tag in set_page_writeback().
Okay, I will send v2 patch as you suggested.

Thanks for review!
> 
> 								Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-05-06  5:19 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-30 10:02 [PATCH] ext4: fix data integrity sync in ordered mode Namjae Jeon
2014-04-30 16:01 ` Jan Kara
2014-05-02 11:35   ` Namjae Jeon
2014-05-05 17:16     ` Jan Kara
2014-05-06  5:19       ` Namjae Jeon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.