All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] ext4: Fix data corruption in inodes with journalled data
@ 2011-07-23  0:39 Jan Kara
  2011-07-23 13:21 ` Amir Goldstein
  2011-07-27  1:27 ` Ted Ts'o
  0 siblings, 2 replies; 6+ messages in thread
From: Jan Kara @ 2011-07-23  0:39 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

When journalling data for an inode (either because it is a symlink or
because the filesystem is mounted in data=journal mode), ext4_evict_inode()
can discard unwritten data by calling truncate_inode_pages(). This is
because we don't mark the buffer / page dirty when journalling data but only
add the buffer to the running transaction and thus mm does not know there
are still unwritten data.

Fix the problem by carefully tracking transaction containing inode's data,
committing this transaction, and writing uncheckpointed buffers when inode
should be reaped.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c |   29 +++++++++++++++++++++++++++++
 1 files changed, 29 insertions(+), 0 deletions(-)

  This is ext4 version of an ext3 fix I sent a while ago. It received only
light testing but I figured you might want get the patch earlier rather than
later given the merge window is open.

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e3126c0..019995b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -190,6 +190,33 @@ void ext4_evict_inode(struct inode *inode)
 
 	trace_ext4_evict_inode(inode);
 	if (inode->i_nlink) {
+		/*
+		 * When journalling data dirty buffers are tracked only in the
+		 * journal. So although mm thinks everything is clean and
+		 * ready for reaping the inode might still have some pages to
+		 * write in the running transaction or waiting to be
+		 * checkpointed. Thus calling jbd2_journal_invalidatepage()
+		 * (via truncate_inode_pages()) to discard these buffers can
+		 * cause data loss. Also even if we did not discard these
+		 * buffers, we would have no way to find them after the inode
+		 * is reaped and thus user could see stale data if he tries to
+		 * read them before the transaction is checkpointed. So be
+		 * careful and force everything to disk here... We use
+		 * ei->i_datasync_tid to store the newest transaction
+		 * containing inode's data.
+		 *
+		 * Note that directories do not have this problem because they
+		 * don't use page cache.
+		 */
+		if (ext4_should_journal_data(inode) &&
+		    (S_ISLNK(inode->i_mode) || S_ISREG(inode->i_mode))) {
+			journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+			tid_t commit_tid = EXT4_I(inode)->i_datasync_tid;
+
+			jbd2_log_start_commit(journal, commit_tid);
+			jbd2_log_wait_commit(journal, commit_tid);
+			filemap_write_and_wait(&inode->i_data);
+		}
 		truncate_inode_pages(&inode->i_data, 0);
 		goto no_delete;
 	}
@@ -1863,6 +1890,7 @@ static int ext4_journalled_write_end(struct file *file,
 	if (new_i_size > inode->i_size)
 		i_size_write(inode, pos+copied);
 	ext4_set_inode_state(inode, EXT4_STATE_JDATA);
+	EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
 	if (new_i_size > EXT4_I(inode)->i_disksize) {
 		ext4_update_i_disksize(inode, new_i_size);
 		ret2 = ext4_mark_inode_dirty(handle, inode);
@@ -2571,6 +2599,7 @@ static int __ext4_journalled_writepage(struct page *page,
 				write_end_fn);
 	if (ret == 0)
 		ret = err;
+	EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
 	err = ext4_journal_stop(handle);
 	if (!ret)
 		ret = err;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] ext4: Fix data corruption in inodes with journalled data
  2011-07-23  0:39 [PATCH] ext4: Fix data corruption in inodes with journalled data Jan Kara
@ 2011-07-23 13:21 ` Amir Goldstein
  2011-07-25 14:26   ` Jan Kara
  2011-07-27  1:27 ` Ted Ts'o
  1 sibling, 1 reply; 6+ messages in thread
From: Amir Goldstein @ 2011-07-23 13:21 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Sat, Jul 23, 2011 at 3:39 AM, Jan Kara <jack@suse.cz> wrote:
> When journalling data for an inode (either because it is a symlink or
> because the filesystem is mounted in data=journal mode), ext4_evict_inode()
> can discard unwritten data by calling truncate_inode_pages(). This is
> because we don't mark the buffer / page dirty when journalling data but only
> add the buffer to the running transaction and thus mm does not know there
> are still unwritten data.
>
> Fix the problem by carefully tracking transaction containing inode's data,
> committing this transaction, and writing uncheckpointed buffers when inode
> should be reaped.
>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/inode.c |   29 +++++++++++++++++++++++++++++
>  1 files changed, 29 insertions(+), 0 deletions(-)
>
>  This is ext4 version of an ext3 fix I sent a while ago. It received only
> light testing but I figured you might want get the patch earlier rather than
> later given the merge window is open.
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e3126c0..019995b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -190,6 +190,33 @@ void ext4_evict_inode(struct inode *inode)
>
>        trace_ext4_evict_inode(inode);
>        if (inode->i_nlink) {
> +               /*
> +                * When journalling data dirty buffers are tracked only in the
> +                * journal. So although mm thinks everything is clean and
> +                * ready for reaping the inode might still have some pages to
> +                * write in the running transaction or waiting to be
> +                * checkpointed. Thus calling jbd2_journal_invalidatepage()
> +                * (via truncate_inode_pages()) to discard these buffers can
> +                * cause data loss. Also even if we did not discard these
> +                * buffers, we would have no way to find them after the inode
> +                * is reaped and thus user could see stale data if he tries to
> +                * read them before the transaction is checkpointed. So be
> +                * careful and force everything to disk here... We use
> +                * ei->i_datasync_tid to store the newest transaction
> +                * containing inode's data.
> +                *
> +                * Note that directories do not have this problem because they
> +                * don't use page cache.
> +                */
> +               if (ext4_should_journal_data(inode) &&
> +                   (S_ISLNK(inode->i_mode) || S_ISREG(inode->i_mode))) {
> +                       journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> +                       tid_t commit_tid = EXT4_I(inode)->i_datasync_tid;
> +
> +                       jbd2_log_start_commit(journal, commit_tid);
> +                       jbd2_log_wait_commit(journal, commit_tid);
> +                       filemap_write_and_wait(&inode->i_data);
> +               }
>                truncate_inode_pages(&inode->i_data, 0);
>                goto no_delete;
>        }
> @@ -1863,6 +1890,7 @@ static int ext4_journalled_write_end(struct file *file,
>        if (new_i_size > inode->i_size)
>                i_size_write(inode, pos+copied);
>        ext4_set_inode_state(inode, EXT4_STATE_JDATA);
> +       EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
>        if (new_i_size > EXT4_I(inode)->i_disksize) {
>                ext4_update_i_disksize(inode, new_i_size);
>                ret2 = ext4_mark_inode_dirty(handle, inode);
> @@ -2571,6 +2599,7 @@ static int __ext4_journalled_writepage(struct page *page,
>                                write_end_fn);
>        if (ret == 0)
>                ret = err;
> +       EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
>        err = ext4_journal_stop(handle);
>        if (!ret)
>                ret = err;
> --
> 1.7.1
>

Hi Jan,

Patch looks correct to me, but I am uncomfortable with i_datasync_tid
being treated differently
in journalled write - that is, being set on different places in the write paths.

How about setting i_datasync_tid in a more generic place like
ext4_{,da_}write_begin()?
I know it's a bit redundant to setting dirty pages, but at least this
way i_datasync_tid can be
checked in all journal modes and have a consistent meaning.

Perhaps we can even use i_datasync_tid to optimize away things like
fiemap checks for dirty pages.
Just a though.

Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] ext4: Fix data corruption in inodes with journalled data
  2011-07-23 13:21 ` Amir Goldstein
@ 2011-07-25 14:26   ` Jan Kara
  2011-07-25 14:58     ` Amir Goldstein
  0 siblings, 1 reply; 6+ messages in thread
From: Jan Kara @ 2011-07-25 14:26 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Ted Tso, linux-ext4

  Hello Amir,

On Sat 23-07-11 16:21:55, Amir Goldstein wrote:
> On Sat, Jul 23, 2011 at 3:39 AM, Jan Kara <jack@suse.cz> wrote:
> > When journalling data for an inode (either because it is a symlink or
> > because the filesystem is mounted in data=journal mode),
> > ext4_evict_inode() can discard unwritten data by calling
> > truncate_inode_pages(). This is because we don't mark the buffer / page
> > dirty when journalling data but only add the buffer to the running
> > transaction and thus mm does not know there are still unwritten data.
> >
> > Fix the problem by carefully tracking transaction containing inode's
> > data, committing this transaction, and writing uncheckpointed buffers
> > when inode should be reaped.
> >
> > Signed-off-by: Jan Kara <jack@suse.cz> ---  fs/ext4/inode.c |   29
> > +++++++++++++++++++++++++++++  1 files changed, 29 insertions(+), 0
> > deletions(-)
> >
> >  This is ext4 version of an ext3 fix I sent a while ago. It received
> > only light testing but I figured you might want get the patch earlier
> > rather than later given the merge window is open.
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index e3126c0..019995b
> > 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -190,6 +190,33 @@
> > void ext4_evict_inode(struct inode *inode)
> >
> >        trace_ext4_evict_inode(inode);        if (inode->i_nlink) { +  
> >             /* +                * When journalling data dirty buffers
> > are tracked only in the +                * journal. So although mm
> > thinks everything is clean and +                * ready for reaping the
> > inode might still have some pages to +                * write in the
> > running transaction or waiting to be +                * checkpointed.
> > Thus calling jbd2_journal_invalidatepage() +                * (via
> > truncate_inode_pages()) to discard these buffers can +                *
> > cause data loss. Also even if we did not discard these +              
> >  * buffers, we would have no way to find them after the inode +        
> >        * is reaped and thus user could see stale data if he tries to +
> >                * read them before the transaction is checkpointed. So
> > be +                * careful and force everything to disk here... We
> > use +                * ei->i_datasync_tid to store the newest
> > transaction +                * containing inode's data.  +            
> >    * +                * Note that directories do not have this problem
> > because they +                * don't use page cache.  +              
> >  */ +               if (ext4_should_journal_data(inode) && +          
> >         (S_ISLNK(inode->i_mode) || S_ISREG(inode->i_mode))) { +        
> >               journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; +  
> >                     tid_t commit_tid = EXT4_I(inode)->i_datasync_tid; +
> > +                       jbd2_log_start_commit(journal, commit_tid); +  
> >                     jbd2_log_wait_commit(journal, commit_tid); +      
> >                 filemap_write_and_wait(&inode->i_data); +              
> > }                truncate_inode_pages(&inode->i_data, 0);              
> >  goto no_delete;        } @@ -1863,6 +1890,7 @@ static int
> > ext4_journalled_write_end(struct file *file,        if (new_i_size >
> > inode->i_size)                i_size_write(inode, pos+copied);      
> >  ext4_set_inode_state(inode, EXT4_STATE_JDATA); +      
> > EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;        if
> > (new_i_size > EXT4_I(inode)->i_disksize) {              
> >  ext4_update_i_disksize(inode, new_i_size);                ret2 =
> > ext4_mark_inode_dirty(handle, inode); @@ -2571,6 +2599,7 @@ static int
> > __ext4_journalled_writepage(struct page *page,                        
> >        write_end_fn);        if (ret == 0)                ret = err; +
> >       EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;    
> >    err = ext4_journal_stop(handle);        if (!ret)                ret
> > = err; -- 1.7.1
> >
> Patch looks correct to me, but I am uncomfortable with i_datasync_tid
> being treated differently in journalled write - that is, being set on
> different places in the write paths.
> 
> How about setting i_datasync_tid in a more generic place like
> ext4_{,da_}write_begin()?  I know it's a bit redundant to setting dirty
> pages, but at least this way i_datasync_tid can be checked in all journal
> modes and have a consistent meaning.
  Well, I kept the meaning that i_datasync_tid is ID of a transaction that
must be committed for a data of an inode to be safely on disk. It is true
that in data=journal mode, we need to update this number differently than
in other journaling modes but that's not important I think. Currently, we
just force commit in data=journal mode in every case and thus we do not
really care about the value of i_datasync_tid for fsync. In future we could
be more clever and avoid transaction commits for fsync in data=journal mode
in some cases.  So in fact I'd say the code is now *more* consistent than
it used to be.  The only thing that isn't quite consistent is that I didn't
bother with updating i_sync_tid because we currently do not use it. If
people want, that might be a useful cleanup which I can do.

> Perhaps we can even use i_datasync_tid to optimize away things like
> fiemap checks for dirty pages.
  Umm, I'm not sure which checks do you mean...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] ext4: Fix data corruption in inodes with journalled data
  2011-07-25 14:26   ` Jan Kara
@ 2011-07-25 14:58     ` Amir Goldstein
  2011-07-25 15:47       ` Jan Kara
  0 siblings, 1 reply; 6+ messages in thread
From: Amir Goldstein @ 2011-07-25 14:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Mon, Jul 25, 2011 at 5:26 PM, Jan Kara <jack@suse.cz> wrote:
>  Hello Amir,
>
> On Sat 23-07-11 16:21:55, Amir Goldstein wrote:
>> On Sat, Jul 23, 2011 at 3:39 AM, Jan Kara <jack@suse.cz> wrote:
>> > When journalling data for an inode (either because it is a symlink or
>> > because the filesystem is mounted in data=journal mode),
>> > ext4_evict_inode() can discard unwritten data by calling
>> > truncate_inode_pages(). This is because we don't mark the buffer / page
>> > dirty when journalling data but only add the buffer to the running
>> > transaction and thus mm does not know there are still unwritten data.
>> >
>> > Fix the problem by carefully tracking transaction containing inode's
>> > data, committing this transaction, and writing uncheckpointed buffers
>> > when inode should be reaped.
>> >
>> > Signed-off-by: Jan Kara <jack@suse.cz> ---  fs/ext4/inode.c |   29
>> > +++++++++++++++++++++++++++++  1 files changed, 29 insertions(+), 0
>> > deletions(-)
>> >
>> >  This is ext4 version of an ext3 fix I sent a while ago. It received
>> > only light testing but I figured you might want get the patch earlier
>> > rather than later given the merge window is open.
>> >
>> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index e3126c0..019995b
>> > 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -190,6 +190,33 @@
>> > void ext4_evict_inode(struct inode *inode)
>> >
>> >        trace_ext4_evict_inode(inode);        if (inode->i_nlink) { +
>> >             /* +                * When journalling data dirty buffers
>> > are tracked only in the +                * journal. So although mm
>> > thinks everything is clean and +                * ready for reaping the
>> > inode might still have some pages to +                * write in the
>> > running transaction or waiting to be +                * checkpointed.
>> > Thus calling jbd2_journal_invalidatepage() +                * (via
>> > truncate_inode_pages()) to discard these buffers can +                *
>> > cause data loss. Also even if we did not discard these +
>> >  * buffers, we would have no way to find them after the inode +
>> >        * is reaped and thus user could see stale data if he tries to +
>> >                * read them before the transaction is checkpointed. So
>> > be +                * careful and force everything to disk here... We
>> > use +                * ei->i_datasync_tid to store the newest
>> > transaction +                * containing inode's data.  +
>> >    * +                * Note that directories do not have this problem
>> > because they +                * don't use page cache.  +
>> >  */ +               if (ext4_should_journal_data(inode) && +
>> >         (S_ISLNK(inode->i_mode) || S_ISREG(inode->i_mode))) { +
>> >               journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; +
>> >                     tid_t commit_tid = EXT4_I(inode)->i_datasync_tid; +
>> > +                       jbd2_log_start_commit(journal, commit_tid); +
>> >                     jbd2_log_wait_commit(journal, commit_tid); +
>> >                 filemap_write_and_wait(&inode->i_data); +
>> > }                truncate_inode_pages(&inode->i_data, 0);
>> >  goto no_delete;        } @@ -1863,6 +1890,7 @@ static int
>> > ext4_journalled_write_end(struct file *file,        if (new_i_size >
>> > inode->i_size)                i_size_write(inode, pos+copied);
>> >  ext4_set_inode_state(inode, EXT4_STATE_JDATA); +
>> > EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;        if
>> > (new_i_size > EXT4_I(inode)->i_disksize) {
>> >  ext4_update_i_disksize(inode, new_i_size);                ret2 =
>> > ext4_mark_inode_dirty(handle, inode); @@ -2571,6 +2599,7 @@ static int
>> > __ext4_journalled_writepage(struct page *page,
>> >        write_end_fn);        if (ret == 0)                ret = err; +
>> >       EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
>> >    err = ext4_journal_stop(handle);        if (!ret)                ret
>> > = err; -- 1.7.1
>> >
>> Patch looks correct to me, but I am uncomfortable with i_datasync_tid
>> being treated differently in journalled write - that is, being set on
>> different places in the write paths.
>>
>> How about setting i_datasync_tid in a more generic place like
>> ext4_{,da_}write_begin()?  I know it's a bit redundant to setting dirty
>> pages, but at least this way i_datasync_tid can be checked in all journal
>> modes and have a consistent meaning.
>  Well, I kept the meaning that i_datasync_tid is ID of a transaction that
> must be committed for a data of an inode to be safely on disk. It is true
> that in data=journal mode, we need to update this number differently than
> in other journaling modes but that's not important I think. Currently, we
> just force commit in data=journal mode in every case and thus we do not
> really care about the value of i_datasync_tid for fsync. In future we could
> be more clever and avoid transaction commits for fsync in data=journal mode
> in some cases.  So in fact I'd say the code is now *more* consistent than
> it used to be.  The only thing that isn't quite consistent is that I didn't
> bother with updating i_sync_tid because we currently do not use it. If
> people want, that might be a useful cleanup which I can do.
>
>> Perhaps we can even use i_datasync_tid to optimize away things like
>> fiemap checks for dirty pages.
>  Umm, I'm not sure which checks do you mean...

I thought that ext4_ext_fiemap_cb() looks for dirty pages to display as delayed
allocation extents and that this lookup can be avoided if we know that the inode
data is not dirty, but I could have been wrong.

Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] ext4: Fix data corruption in inodes with journalled data
  2011-07-25 14:58     ` Amir Goldstein
@ 2011-07-25 15:47       ` Jan Kara
  0 siblings, 0 replies; 6+ messages in thread
From: Jan Kara @ 2011-07-25 15:47 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Ted Tso, linux-ext4

On Mon 25-07-11 17:58:03, Amir Goldstein wrote:
> On Mon, Jul 25, 2011 at 5:26 PM, Jan Kara <jack@suse.cz> wrote:
> >  Hello Amir,
> >
> > On Sat 23-07-11 16:21:55, Amir Goldstein wrote:
> >> On Sat, Jul 23, 2011 at 3:39 AM, Jan Kara <jack@suse.cz> wrote:
> >> > When journalling data for an inode (either because it is a symlink or
> >> > because the filesystem is mounted in data=journal mode),
> >> > ext4_evict_inode() can discard unwritten data by calling
> >> > truncate_inode_pages(). This is because we don't mark the buffer / page
> >> > dirty when journalling data but only add the buffer to the running
> >> > transaction and thus mm does not know there are still unwritten data.
> >> >
> >> > Fix the problem by carefully tracking transaction containing inode's
> >> > data, committing this transaction, and writing uncheckpointed buffers
> >> > when inode should be reaped.
> >> >
> >> > Signed-off-by: Jan Kara <jack@suse.cz> ---  fs/ext4/inode.c |   29
> >> > +++++++++++++++++++++++++++++  1 files changed, 29 insertions(+), 0
> >> > deletions(-)
> >> >
> >> >  This is ext4 version of an ext3 fix I sent a while ago. It received
> >> > only light testing but I figured you might want get the patch earlier
> >> > rather than later given the merge window is open.
> >> >
> >> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index e3126c0..019995b
> >> > 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -190,6 +190,33 @@
> >> > void ext4_evict_inode(struct inode *inode)
> >> >
> >> >        trace_ext4_evict_inode(inode);        if (inode->i_nlink) { +
> >> >             /* +                * When journalling data dirty buffers
> >> > are tracked only in the +                * journal. So although mm
> >> > thinks everything is clean and +                * ready for reaping the
> >> > inode might still have some pages to +                * write in the
> >> > running transaction or waiting to be +                * checkpointed.
> >> > Thus calling jbd2_journal_invalidatepage() +                * (via
> >> > truncate_inode_pages()) to discard these buffers can +                *
> >> > cause data loss. Also even if we did not discard these +
> >> >  * buffers, we would have no way to find them after the inode +
> >> >        * is reaped and thus user could see stale data if he tries to +
> >> >                * read them before the transaction is checkpointed. So
> >> > be +                * careful and force everything to disk here... We
> >> > use +                * ei->i_datasync_tid to store the newest
> >> > transaction +                * containing inode's data.  +
> >> >    * +                * Note that directories do not have this problem
> >> > because they +                * don't use page cache.  +
> >> >  */ +               if (ext4_should_journal_data(inode) && +
> >> >         (S_ISLNK(inode->i_mode) || S_ISREG(inode->i_mode))) { +
> >> >               journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; +
> >> >                     tid_t commit_tid = EXT4_I(inode)->i_datasync_tid; +
> >> > +                       jbd2_log_start_commit(journal, commit_tid); +
> >> >                     jbd2_log_wait_commit(journal, commit_tid); +
> >> >                 filemap_write_and_wait(&inode->i_data); +
> >> > }                truncate_inode_pages(&inode->i_data, 0);
> >> >  goto no_delete;        } @@ -1863,6 +1890,7 @@ static int
> >> > ext4_journalled_write_end(struct file *file,        if (new_i_size >
> >> > inode->i_size)                i_size_write(inode, pos+copied);
> >> >  ext4_set_inode_state(inode, EXT4_STATE_JDATA); +
> >> > EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;        if
> >> > (new_i_size > EXT4_I(inode)->i_disksize) {
> >> >  ext4_update_i_disksize(inode, new_i_size);                ret2 =
> >> > ext4_mark_inode_dirty(handle, inode); @@ -2571,6 +2599,7 @@ static int
> >> > __ext4_journalled_writepage(struct page *page,
> >> >        write_end_fn);        if (ret == 0)                ret = err; +
> >> >       EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
> >> >    err = ext4_journal_stop(handle);        if (!ret)                ret
> >> > = err; -- 1.7.1
> >> >
> >> Patch looks correct to me, but I am uncomfortable with i_datasync_tid
> >> being treated differently in journalled write - that is, being set on
> >> different places in the write paths.
> >>
> >> How about setting i_datasync_tid in a more generic place like
> >> ext4_{,da_}write_begin()?  I know it's a bit redundant to setting dirty
> >> pages, but at least this way i_datasync_tid can be checked in all journal
> >> modes and have a consistent meaning.
> >  Well, I kept the meaning that i_datasync_tid is ID of a transaction that
> > must be committed for a data of an inode to be safely on disk. It is true
> > that in data=journal mode, we need to update this number differently than
> > in other journaling modes but that's not important I think. Currently, we
> > just force commit in data=journal mode in every case and thus we do not
> > really care about the value of i_datasync_tid for fsync. In future we could
> > be more clever and avoid transaction commits for fsync in data=journal mode
> > in some cases.  So in fact I'd say the code is now *more* consistent than
> > it used to be.  The only thing that isn't quite consistent is that I didn't
> > bother with updating i_sync_tid because we currently do not use it. If
> > people want, that might be a useful cleanup which I can do.
> >
> >> Perhaps we can even use i_datasync_tid to optimize away things like
> >> fiemap checks for dirty pages.
> >  Umm, I'm not sure which checks do you mean...
> 
> I thought that ext4_ext_fiemap_cb() looks for dirty pages to display as delayed
> allocation extents and that this lookup can be avoided if we know that the inode
> data is not dirty, but I could have been wrong.
  No, that won't really work. We don't want to update i_datasync_tid for
delayed allocation write. We don't really have a valid tid to store in that
call path.

BTW, you made me look at the checks in ext4_ext_fiemap_cb() and that code
is just broken. It can oops in a number of ways when it races with page
reclaim. I'll write to Yongqiang who seems to an author of this...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] ext4: Fix data corruption in inodes with journalled data
  2011-07-23  0:39 [PATCH] ext4: Fix data corruption in inodes with journalled data Jan Kara
  2011-07-23 13:21 ` Amir Goldstein
@ 2011-07-27  1:27 ` Ted Ts'o
  1 sibling, 0 replies; 6+ messages in thread
From: Ted Ts'o @ 2011-07-27  1:27 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Sat, Jul 23, 2011 at 02:39:37AM +0200, Jan Kara wrote:
> When journalling data for an inode (either because it is a symlink or
> because the filesystem is mounted in data=journal mode), ext4_evict_inode()
> can discard unwritten data by calling truncate_inode_pages(). This is
> because we don't mark the buffer / page dirty when journalling data but only
> add the buffer to the running transaction and thus mm does not know there
> are still unwritten data.
> 
> Fix the problem by carefully tracking transaction containing inode's data,
> committing this transaction, and writing uncheckpointed buffers when inode
> should be reaped.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks, applied to the ext4 tree.

					- Ted

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-07-27  1:27 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-23  0:39 [PATCH] ext4: Fix data corruption in inodes with journalled data Jan Kara
2011-07-23 13:21 ` Amir Goldstein
2011-07-25 14:26   ` Jan Kara
2011-07-25 14:58     ` Amir Goldstein
2011-07-25 15:47       ` Jan Kara
2011-07-27  1:27 ` Ted Ts'o

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.