All of lore.kernel.org
 help / color / mirror / Atom feed
From: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
To: Jan Kara <jack@suse.cz>
Cc: "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	"Li, Michael" <huayil@qti.qualcomm.com>
Subject: RE: ext4 out of order when use cfq scheduler
Date: Mon, 11 Jan 2016 09:05:20 +0000	[thread overview]
Message-ID: <97be88d3acae4317ba1b6443f859ada0@SGPMBX1004.APAC.bosch.com> (raw)
In-Reply-To: 20160107114736.GC8380@quack.suse.cz

[-- Attachment #1: Type: text/plain, Size: 2201 bytes --]


> -----Original Message-----
> From: HUANG Weller (CM/ESW12-CN)
> Sent: Friday, January 08, 2016 8:47 AM
> To: 'Jan Kara' <jack@suse.cz>
> Cc: linux-ext4@vger.kernel.org; Li, Michael <huayil@qti.qualcomm.com>
> Subject: RE: ext4 out of order when use cfq scheduler
> 
> > >
> > > >
> > > > OK, so I was looking into the code and indeed, reality is correct
> > > > and my mental model was wrong! ;) I thought that inode gets added
> > > > to the list of inodes for which we need to wait for data IO
> > > > completion during transaction commit during block allocation. And I was
> wrong.
> > > > It used to happen in
> > > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove
> > > > calls to
> > > > ext4_jbd2_file_inode() from delalloc write path) where it got
> > > > removed. And that was wrong because although we submit data writes
> > > > before dropping handle for allocating transaction and updating
> > > > i_size, nobody guarantees that data IO is not delayed in the block
> > > > layer until
> > transaction commit.
> > > > Which seems to happen in your case. I'll send a fix. Thanks for
> > > > your report and persistence!
> > > >
> > >
> > > Thanks a lot for your feedback :-)
> > > Because I am not familiar with the detail of the ext4 internal code.
> > > I will try to
> > understand your explanation which you describe above.  And have a look
> > on related funcations.
> > > Could you send the fix in this mail ?
> > > And whether the kernel 3.14 also have such issue, right ?
> >
> > The problem is in all kernels starting with 3.8. Attached is a patch
> > which should fix the issue. Can you test whether it fixes the problem for you?
> >
> > 								Honza
> > --
> 
> Yes, of course I will redo the test with the patch. And also give you feedback.

Test result:
Test on 2 targets with the kernel applied your patch. Both of them are OK after 5000 power failure tests. Our target test cycle is 10,000.
By the way, since your original patch can't be applied on 3.x kernel, The attached one  is based on yours and can applied on old kernel(mine is 3.10) directly.

> 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR

[-- Attachment #2: 0001-ext4-Fix-data-exposure-after-a-crash.patch --]
[-- Type: application/octet-stream, Size: 3185 bytes --]

From 98cf821e270748486132079ac8d1c9557c9adfb9 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 7 Jan 2016 12:21:25 +0100
Subject: [PATCH] ext4: Fix data exposure after a crash

Huang has reported that in his powerfail testing he is seeing stale
block contents in some of recently allocated blocks although he mounts
ext4 in data=ordered mode. After some investigation I have found out
that indeed when delayed allocation is used, we don't add inode to
transaction's list of inodes needing flushing before commit. Originally
we were doing that but commit f3b59291a69d removed the logic with a
flawed argument that it is not needed.

The problem is that although for delayed allocated blocks we write their
contents immediately after allocating them, there is no guarantee that
the IO scheduler or device doesn't reorder things and thus transaction
allocating blocks and attaching them to inode can reach stable storage
before actual block contents. Actually whenever we attach freshly
allocated blocks to inode using a written extent, we should add inode to
transaction's ordered inode list to make sure we properly wait for block
contents to be written before committing the transaction. So that is
what we do in this patch. This also handles other cases where stale data
exposure was possible - like filling hole via mmap in
data=ordered,nodelalloc mode.

The only exception to the above rule are extending direct IO writes where
blkdev_direct_IO() waits for IO to complete before increasing i_size and
thus stale data exposure is not possible. For now we don't complicate
the code with optimizing this special case since the overhead is pretty
low. In case this is observed to be a performance problem we can always
handle it using a special flag to ext4_map_blocks().

CC: stable@vger.kernel.org
Fixes: f3b59291a69d0b734be1fc8be489fef2dd846d3d
Reported-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Signed-off-by: Jan Kara <jack@suse.cz>
fix conflict with 3.10 kernel.
Signed-off-by: weller huang <weller.huang@cn.bosch.com>
---
 fs/ext4/inode.c |   20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e48bd5a..d5d6d6c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -751,6 +751,17 @@ has_zeroout:
 		int ret = check_block_validity(inode, map);
 		if (ret != 0)
 			return ret;
+
+		/*
+		 * Inodes with freshly allocated blocks where contents will be
+		 * visible after transaction commit must be on transaction's
+		 * ordered data list.
+		 */
+		if(ext4_should_order_data(inode)) {
+			ret = ext4_jbd2_file_inode(handle, inode);
+			if (ret)
+				return ret;
+		}
 	}
 	return retval;
 }
@@ -1110,15 +1121,6 @@ static int ext4_write_end(struct file *file,
 	int i_size_changed = 0;
 
 	trace_ext4_write_end(inode, pos, len, copied);
-	if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
-		ret = ext4_jbd2_file_inode(handle, inode);
-		if (ret) {
-			unlock_page(page);
-			page_cache_release(page);
-			goto errout;
-		}
-	}
-
 	if (ext4_has_inline_data(inode)) {
 		ret = ext4_write_inline_data_end(inode, pos, len,
 						 copied, page);
-- 
1.7.9.5


  parent reply	other threads:[~2016-01-11  9:05 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-22  6:24 ext4 out of order when use cfq scheduler HUANG Weller (CM/EPF1-CN)
2015-12-22 15:00 ` Jan Kara
     [not found]   ` <c67f356b63d94d35ad010a6e987b68f0@SGPMBX1004.APAC.bosch.com>
2016-01-05 15:30     ` Jan Kara
2016-01-06  2:39       ` HUANG Weller (CM/ESW12-CN)
2016-01-06 19:17         ` Andreas Dilger
2016-01-07  6:51           ` HUANG Weller (CM/ESW12-CN)
     [not found]         ` <20160106100621.GA24046@quack.suse.cz>
     [not found]           ` <3ab48fa47e434455b101251730e69bd2@SGPMBX1004.APAC.bosch.com>
2016-01-07 10:24             ` Jan Kara
2016-01-07 11:02               ` HUANG Weller (CM/ESW12-CN)
2016-01-07 11:47                 ` Jan Kara
2016-01-07 12:19                   ` Jan Kara
2016-01-08  2:18                     ` HUANG Weller (CM/ESW12-CN)
2016-01-08  0:46                   ` HUANG Weller (CM/ESW12-CN)
2016-01-11  9:05                   ` HUANG Weller (CM/ESW12-CN) [this message]
2016-01-11 10:21                     ` Jan Kara
2016-03-13  4:27                   ` Theodore Ts'o
2016-03-14  2:43                     ` HUANG Weller (CM/ESW12-CN)
2016-03-14  7:39                     ` Jan Kara
2016-03-14 14:36                       ` Theodore Ts'o
2016-03-15 10:46                         ` Jan Kara
2016-03-15 14:46                           ` Jan Kara
2016-03-15 20:09                             ` Jan Kara
2016-03-16  2:30                               ` HUANG Weller (CM/ESW12-CN)
2016-03-18  9:20                                 ` Jan Kara
2016-06-22 11:55                               ` FW: " HUANG Weller (CM/ESW12-CN)
2016-06-22 13:09                                 ` Jan Kara
2016-03-16  0:41                             ` HUANG Weller (CM/ESW12-CN)
2016-03-24 10:16                             ` HUANG Weller (CM/ESW12-CN)
2016-03-24 12:17                               ` Jan Kara
2016-01-28  8:02 ` Xiong Zhou
2016-02-03  6:08   ` HUANG Weller (CM/ESW12-CN)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=97be88d3acae4317ba1b6443f859ada0@SGPMBX1004.APAC.bosch.com \
    --to=weller.huang@cn.bosch.com \
    --cc=huayil@qti.qualcomm.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.