All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] ext4: Remove journal_checksum mount option and enable it by default
@ 2009-09-05 22:32 Theodore Ts'o
  2009-09-05 22:32 ` [PATCH 2/2] ext4: Automatically enable journal_async_commit on ext4 file systems Theodore Ts'o
  0 siblings, 1 reply; 13+ messages in thread
From: Theodore Ts'o @ 2009-09-05 22:32 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o

There's no real cost for the journal checksum feature, and we should
make sure it is enabled all the time.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 Documentation/filesystems/ext4.txt |    8 +-------
 fs/ext4/ext4.h                     |    1 -
 fs/ext4/super.c                    |   20 ++++++--------------
 3 files changed, 7 insertions(+), 22 deletions(-)

diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 7be02ac..3e329db 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -134,15 +134,9 @@ ro                   	Mount filesystem read only. Note that ext4 will
                      	mount options "ro,noload" can be used to prevent
 		     	writes to the filesystem.
 
-journal_checksum	Enable checksumming of the journal transactions.
-			This will allow the recovery code in e2fsck and the
-			kernel to detect corruption in the kernel.  It is a
-			compatible change and will be ignored by older kernels.
-
 journal_async_commit	Commit block can be written to disk without waiting
 			for descriptor blocks. If enabled older kernels cannot
-			mount the device. This will enable 'journal_checksum'
-			internally.
+			mount the device.
 
 journal=update		Update the ext4 file system's journal to the current
 			format.
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 815a804..c187b08 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -711,7 +711,6 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_QUOTA		0x80000 /* Some quota option set */
 #define EXT4_MOUNT_USRQUOTA		0x100000 /* "old" user quota */
 #define EXT4_MOUNT_GRPQUOTA		0x200000 /* "old" group quota */
-#define EXT4_MOUNT_JOURNAL_CHECKSUM	0x800000 /* Journal checksums */
 #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT	0x1000000 /* Journal Async Commit */
 #define EXT4_MOUNT_I_VERSION            0x2000000 /* i_version support */
 #define EXT4_MOUNT_DELALLOC		0x8000000 /* Delalloc support */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4037fe0..f1815d3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1280,11 +1280,9 @@ static int parse_options(char *options, struct super_block *sb,
 			*journal_devnum = option;
 			break;
 		case Opt_journal_checksum:
-			set_opt(sbi->s_mount_opt, JOURNAL_CHECKSUM);
-			break;
+			break;	/* Kept for backwards compatibility */
 		case Opt_journal_async_commit:
 			set_opt(sbi->s_mount_opt, JOURNAL_ASYNC_COMMIT);
-			set_opt(sbi->s_mount_opt, JOURNAL_CHECKSUM);
 			break;
 		case Opt_noload:
 			set_opt(sbi->s_mount_opt, NOLOAD);
@@ -2751,20 +2749,14 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		goto failed_mount4;
 	}
 
-	if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
-		jbd2_journal_set_features(sbi->s_journal,
-				JBD2_FEATURE_COMPAT_CHECKSUM, 0,
+	jbd2_journal_set_features(sbi->s_journal,
+				  JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0);
+	if (test_opt(sb, JOURNAL_ASYNC_COMMIT))
+		jbd2_journal_set_features(sbi->s_journal, 0, 0,
 				JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
-	} else if (test_opt(sb, JOURNAL_CHECKSUM)) {
-		jbd2_journal_set_features(sbi->s_journal,
-				JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0);
+	else
 		jbd2_journal_clear_features(sbi->s_journal, 0, 0,
 				JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
-	} else {
-		jbd2_journal_clear_features(sbi->s_journal,
-				JBD2_FEATURE_COMPAT_CHECKSUM, 0,
-				JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
-	}
 
 	/* We have now updated the journal if required, so we can
 	 * validate the data journaling mode. */
-- 
1.6.3.2.1.gb9f7d.dirty


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] ext4: Automatically enable journal_async_commit on ext4 file systems
  2009-09-05 22:32 [PATCH 1/2] ext4: Remove journal_checksum mount option and enable it by default Theodore Ts'o
@ 2009-09-05 22:32 ` Theodore Ts'o
  2009-09-05 22:57   ` Andreas Dilger
                     ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Theodore Ts'o @ 2009-09-05 22:32 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o

Now that we have cleaned up journal_async_commit, it's safe to enable
it all the time.  But we only want to do so if ext4-specific INCOMPAT
features are enabled, since otherwise we will prevent the filesystem
from being mounted using ext3.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 Documentation/filesystems/ext4.txt |    8 ++++++--
 fs/ext4/super.c                    |   26 ++++++++++++++++++++++++--
 2 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 3e329db..4f8d73a 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -135,8 +135,12 @@ ro                   	Mount filesystem read only. Note that ext4 will
 		     	writes to the filesystem.
 
 journal_async_commit	Commit block can be written to disk without waiting
-			for descriptor blocks. If enabled older kernels cannot
-			mount the device.
+			for descriptor blocks.  This mount option will be 
+			automatically enabled if ext4-specific INCOMPAT
+			features are present in the file system.
+
+nojournal_async_commit	Disable the journal_async_commit option, even
+			for ext4 filesystems.
 
 journal=update		Update the ext4 file system's journal to the current
 			format.
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f1815d3..f644a5c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -75,6 +75,15 @@ static int ext4_unfreeze(struct super_block *sb);
 static void ext4_write_super(struct super_block *sb);
 static int ext4_freeze(struct super_block *sb);
 
+/*
+ * If ext4 filesystem features are enabled, then enable async_commits
+ * by default.
+ */
+#define ASYNC_COMMIT_DEFAULT(sb) (EXT4_HAS_INCOMPAT_FEATURE(sb, \
+					(EXT4_FEATURE_INCOMPAT_EXTENTS| \
+					 EXT4_FEATURE_INCOMPAT_64BIT| \
+					 EXT4_FEATURE_INCOMPAT_FLEX_BG)))
+
 
 ext4_fsblk_t ext4_block_bitmap(struct super_block *sb,
 			       struct ext4_group_desc *bg)
@@ -845,8 +854,13 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
 	 */
 	seq_puts(seq, ",barrier=");
 	seq_puts(seq, test_opt(sb, BARRIER) ? "1" : "0");
-	if (test_opt(sb, JOURNAL_ASYNC_COMMIT))
-		seq_puts(seq, ",journal_async_commit");
+	if (ASYNC_COMMIT_DEFAULT(sb)) {
+		if (!test_opt(sb, JOURNAL_ASYNC_COMMIT))
+			seq_puts(seq, ",nojournal_async_commit");
+	} else {
+		if (test_opt(sb, JOURNAL_ASYNC_COMMIT))
+			seq_puts(seq, ",journal_async_commit");
+	}
 	if (test_opt(sb, NOBH))
 		seq_puts(seq, ",nobh");
 	if (test_opt(sb, I_VERSION))
@@ -1050,6 +1064,7 @@ enum {
 	Opt_commit, Opt_min_batch_time, Opt_max_batch_time,
 	Opt_journal_update, Opt_journal_dev,
 	Opt_journal_checksum, Opt_journal_async_commit,
+	Opt_nojournal_async_commit,
 	Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
 	Opt_data_err_abort, Opt_data_err_ignore, Opt_mb_history_length,
 	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
@@ -1092,6 +1107,7 @@ static const match_table_t tokens = {
 	{Opt_journal_dev, "journal_dev=%u"},
 	{Opt_journal_checksum, "journal_checksum"},
 	{Opt_journal_async_commit, "journal_async_commit"},
+	{Opt_nojournal_async_commit, "nojournal_async_commit"},
 	{Opt_abort, "abort"},
 	{Opt_data_journal, "data=journal"},
 	{Opt_data_ordered, "data=ordered"},
@@ -1284,6 +1300,9 @@ static int parse_options(char *options, struct super_block *sb,
 		case Opt_journal_async_commit:
 			set_opt(sbi->s_mount_opt, JOURNAL_ASYNC_COMMIT);
 			break;
+		case Opt_nojournal_async_commit:
+			clear_opt(sbi->s_mount_opt, JOURNAL_ASYNC_COMMIT);
+			break;
 		case Opt_noload:
 			set_opt(sbi->s_mount_opt, NOLOAD);
 			break;
@@ -2422,6 +2441,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	 */
 	set_opt(sbi->s_mount_opt, DELALLOC);
 
+	if (ASYNC_COMMIT_DEFAULT(sb))
+		set_opt(sbi->s_mount_opt, JOURNAL_ASYNC_COMMIT);
+
 	if (!parse_options((char *) data, sb, &journal_devnum,
 			   &journal_ioprio, NULL, 0))
 		goto failed_mount;
-- 
1.6.3.2.1.gb9f7d.dirty


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] ext4: Automatically enable journal_async_commit on ext4 file systems
  2009-09-05 22:32 ` [PATCH 2/2] ext4: Automatically enable journal_async_commit on ext4 file systems Theodore Ts'o
@ 2009-09-05 22:57   ` Andreas Dilger
  2009-09-06  1:32     ` Theodore Tso
  2009-09-06  2:57   ` Eric Sandeen
  2009-09-07 23:42   ` Ric Wheeler
  2 siblings, 1 reply; 13+ messages in thread
From: Andreas Dilger @ 2009-09-05 22:57 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List

On Sep 05, 2009  18:32 -0400, Theodore Ts'o wrote:
> Now that we have cleaned up journal_async_commit, it's safe to enable
> it all the time.  But we only want to do so if ext4-specific INCOMPAT
> features are enabled, since otherwise we will prevent the filesystem
> from being mounted using ext3.

So, the big question is what to do if not-the-last transaction in the
journal has a bad block in it?  This is fairly unlikely, and IMHO the
harm of aborting journal replay too early is likely far outweighed by
the benefit of not "recovering" garbage directly over the filesystem
metadata.

I had thought that you had rejected the e2fsck side of this patch for
that reason, but maybe my memory is faulty...  We still have some
test images for bad journal checksums that you can have if you want.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] ext4: Automatically enable journal_async_commit on ext4 file systems
  2009-09-05 22:57   ` Andreas Dilger
@ 2009-09-06  1:32     ` Theodore Tso
  0 siblings, 0 replies; 13+ messages in thread
From: Theodore Tso @ 2009-09-06  1:32 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Ext4 Developers List

On Sun, Sep 06, 2009 at 12:57:47AM +0200, Andreas Dilger wrote:
> On Sep 05, 2009  18:32 -0400, Theodore Ts'o wrote:
> > Now that we have cleaned up journal_async_commit, it's safe to enable
> > it all the time.  But we only want to do so if ext4-specific INCOMPAT
> > features are enabled, since otherwise we will prevent the filesystem
> > from being mounted using ext3.
> 
> So, the big question is what to do if not-the-last transaction in the
> journal has a bad block in it?  This is fairly unlikely, and IMHO the
> harm of aborting journal replay too early is likely far outweighed by
> the benefit of not "recovering" garbage directly over the filesystem
> metadata.
> 
> I had thought that you had rejected the e2fsck side of this patch for
> that reason, but maybe my memory is faulty...  We still have some
> test images for bad journal checksums that you can have if you want.

No, it's in e2fsck.  Right now, if we have a checksum failure, we
abort the journal replay dead in its tracks.  Whether or not that's
the right thing is actually highly questionable.  Yes, there's the
chance that we can recover garbage directly over the file system
metadata.  But the flip side is that if we abort the journal replay
too early, we can end up leaving the filesystem horribly corrupted.
In addition, if the it's a block which has been journalled multiple
time (which will is highly likely for block allocation blocks or inode
allocation blocks), an error in the middle of the journal is not a
disaster.

The one thing I have to check is to make sure that e2fsck forces a
filesystem check if it aborts a journal replay due to a checksum
error.  I'm pretty sure I did add that, but I need to make sure it's
there.

The other thing we might want to do is to add some code in ext4 is to
call jbd2_cleanup_journal_tail() a bit more aggressively.  If all of
the blocks in the transaction has been pushed out, then updating the
journal superblock frequently will reduce the number of transactions
that need to be replayed.  Right now, we often replay more transaction
that we strictly need to, out of a desire to reduce the need to update
the journal superblock.  But we are replaying transactions 23..30, but
we really only need to replay transactions 28 29 and 30 in order to
bring the filesystem into consistency, and we have a checksum failure
while reading some of the data blocks found in transaction 25, we'll
end up never replaying transactions 28--30, and we may end up losing
data, especially if we already started writing some (but not all) of
the blocks involved with transactions 28 and 29 to their final
location on disk.

					- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] ext4: Automatically enable journal_async_commit on ext4 file systems
  2009-09-05 22:32 ` [PATCH 2/2] ext4: Automatically enable journal_async_commit on ext4 file systems Theodore Ts'o
  2009-09-05 22:57   ` Andreas Dilger
@ 2009-09-06  2:57   ` Eric Sandeen
  2009-09-07 23:48     ` Ric Wheeler
  2009-09-07 23:42   ` Ric Wheeler
  2 siblings, 1 reply; 13+ messages in thread
From: Eric Sandeen @ 2009-09-06  2:57 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List

Theodore Ts'o wrote:
> Now that we have cleaned up journal_async_commit, it's safe to enable
> it all the time.  But we only want to do so if ext4-specific INCOMPAT
> features are enabled, since otherwise we will prevent the filesystem
> from being mounted using ext3.

Has anyone done some serious power-fail/crash (real or simulated) 
testing with this?  If not, I think we should before it's default...

-Eric

> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
>  Documentation/filesystems/ext4.txt |    8 ++++++--
>  fs/ext4/super.c                    |   26 ++++++++++++++++++++++++--
>  2 files changed, 30 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
> index 3e329db..4f8d73a 100644
> --- a/Documentation/filesystems/ext4.txt
> +++ b/Documentation/filesystems/ext4.txt
> @@ -135,8 +135,12 @@ ro                   	Mount filesystem read only. Note that ext4 will
>  		     	writes to the filesystem.
>  
>  journal_async_commit	Commit block can be written to disk without waiting
> -			for descriptor blocks. If enabled older kernels cannot
> -			mount the device.
> +			for descriptor blocks.  This mount option will be 
> +			automatically enabled if ext4-specific INCOMPAT
> +			features are present in the file system.
> +
> +nojournal_async_commit	Disable the journal_async_commit option, even
> +			for ext4 filesystems.
>  
>  journal=update		Update the ext4 file system's journal to the current
>  			format.
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index f1815d3..f644a5c 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -75,6 +75,15 @@ static int ext4_unfreeze(struct super_block *sb);
>  static void ext4_write_super(struct super_block *sb);
>  static int ext4_freeze(struct super_block *sb);
>  
> +/*
> + * If ext4 filesystem features are enabled, then enable async_commits
> + * by default.
> + */
> +#define ASYNC_COMMIT_DEFAULT(sb) (EXT4_HAS_INCOMPAT_FEATURE(sb, \
> +					(EXT4_FEATURE_INCOMPAT_EXTENTS| \
> +					 EXT4_FEATURE_INCOMPAT_64BIT| \
> +					 EXT4_FEATURE_INCOMPAT_FLEX_BG)))
> +
>  
>  ext4_fsblk_t ext4_block_bitmap(struct super_block *sb,
>  			       struct ext4_group_desc *bg)
> @@ -845,8 +854,13 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
>  	 */
>  	seq_puts(seq, ",barrier=");
>  	seq_puts(seq, test_opt(sb, BARRIER) ? "1" : "0");
> -	if (test_opt(sb, JOURNAL_ASYNC_COMMIT))
> -		seq_puts(seq, ",journal_async_commit");
> +	if (ASYNC_COMMIT_DEFAULT(sb)) {
> +		if (!test_opt(sb, JOURNAL_ASYNC_COMMIT))
> +			seq_puts(seq, ",nojournal_async_commit");
> +	} else {
> +		if (test_opt(sb, JOURNAL_ASYNC_COMMIT))
> +			seq_puts(seq, ",journal_async_commit");
> +	}
>  	if (test_opt(sb, NOBH))
>  		seq_puts(seq, ",nobh");
>  	if (test_opt(sb, I_VERSION))
> @@ -1050,6 +1064,7 @@ enum {
>  	Opt_commit, Opt_min_batch_time, Opt_max_batch_time,
>  	Opt_journal_update, Opt_journal_dev,
>  	Opt_journal_checksum, Opt_journal_async_commit,
> +	Opt_nojournal_async_commit,
>  	Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
>  	Opt_data_err_abort, Opt_data_err_ignore, Opt_mb_history_length,
>  	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
> @@ -1092,6 +1107,7 @@ static const match_table_t tokens = {
>  	{Opt_journal_dev, "journal_dev=%u"},
>  	{Opt_journal_checksum, "journal_checksum"},
>  	{Opt_journal_async_commit, "journal_async_commit"},
> +	{Opt_nojournal_async_commit, "nojournal_async_commit"},
>  	{Opt_abort, "abort"},
>  	{Opt_data_journal, "data=journal"},
>  	{Opt_data_ordered, "data=ordered"},
> @@ -1284,6 +1300,9 @@ static int parse_options(char *options, struct super_block *sb,
>  		case Opt_journal_async_commit:
>  			set_opt(sbi->s_mount_opt, JOURNAL_ASYNC_COMMIT);
>  			break;
> +		case Opt_nojournal_async_commit:
> +			clear_opt(sbi->s_mount_opt, JOURNAL_ASYNC_COMMIT);
> +			break;
>  		case Opt_noload:
>  			set_opt(sbi->s_mount_opt, NOLOAD);
>  			break;
> @@ -2422,6 +2441,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  	 */
>  	set_opt(sbi->s_mount_opt, DELALLOC);
>  
> +	if (ASYNC_COMMIT_DEFAULT(sb))
> +		set_opt(sbi->s_mount_opt, JOURNAL_ASYNC_COMMIT);
> +
>  	if (!parse_options((char *) data, sb, &journal_devnum,
>  			   &journal_ioprio, NULL, 0))
>  		goto failed_mount;


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] ext4: Automatically enable journal_async_commit on ext4 file systems
  2009-09-05 22:32 ` [PATCH 2/2] ext4: Automatically enable journal_async_commit on ext4 file systems Theodore Ts'o
  2009-09-05 22:57   ` Andreas Dilger
  2009-09-06  2:57   ` Eric Sandeen
@ 2009-09-07 23:42   ` Ric Wheeler
  2009-09-08  4:45     ` Theodore Tso
  2 siblings, 1 reply; 13+ messages in thread
From: Ric Wheeler @ 2009-09-07 23:42 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List

Theodore Ts'o wrote:
> Now that we have cleaned up journal_async_commit, it's safe to enable
> it all the time.  But we only want to do so if ext4-specific INCOMPAT
> features are enabled, since otherwise we will prevent the filesystem
> from being mounted using ext3.
>
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
>   

Hi Ted,

I am not sure that we are really good with ASYNC commit being on all of 
the time - I really worry that we will see lots of issues.

Can we leave it off by default until we get some run time on it?

Ric


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] ext4: Automatically enable journal_async_commit on ext4 file systems
  2009-09-06  2:57   ` Eric Sandeen
@ 2009-09-07 23:48     ` Ric Wheeler
  0 siblings, 0 replies; 13+ messages in thread
From: Ric Wheeler @ 2009-09-07 23:48 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Theodore Ts'o, Ext4 Developers List

Eric Sandeen wrote:
> Theodore Ts'o wrote:
>> Now that we have cleaned up journal_async_commit, it's safe to enable
>> it all the time.  But we only want to do so if ext4-specific INCOMPAT
>> features are enabled, since otherwise we will prevent the filesystem
>> from being mounted using ext3.
>
> Has anyone done some serious power-fail/crash (real or simulated) 
> testing with this?  If not, I think we should before it's default...
>
> -Eric

I think that we also need to see some very compelling performance 
numbers to justify the complexity. Journaling is complex enough, so if 
this is not a win, we should (in my opinion) go with the simplest 
technique :-)

ric

>
>> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>> ---
>>  Documentation/filesystems/ext4.txt |    8 ++++++--
>>  fs/ext4/super.c                    |   26 ++++++++++++++++++++++++--
>>  2 files changed, 30 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/filesystems/ext4.txt 
>> b/Documentation/filesystems/ext4.txt
>> index 3e329db..4f8d73a 100644
>> --- a/Documentation/filesystems/ext4.txt
>> +++ b/Documentation/filesystems/ext4.txt
>> @@ -135,8 +135,12 @@ ro                       Mount filesystem read 
>> only. Note that ext4 will
>>                   writes to the filesystem.
>>  
>>  journal_async_commit    Commit block can be written to disk without 
>> waiting
>> -            for descriptor blocks. If enabled older kernels cannot
>> -            mount the device.
>> +            for descriptor blocks.  This mount option will be 
>> +            automatically enabled if ext4-specific INCOMPAT
>> +            features are present in the file system.
>> +
>> +nojournal_async_commit    Disable the journal_async_commit option, even
>> +            for ext4 filesystems.
>>  
>>  journal=update        Update the ext4 file system's journal to the 
>> current
>>              format.
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index f1815d3..f644a5c 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -75,6 +75,15 @@ static int ext4_unfreeze(struct super_block *sb);
>>  static void ext4_write_super(struct super_block *sb);
>>  static int ext4_freeze(struct super_block *sb);
>>  
>> +/*
>> + * If ext4 filesystem features are enabled, then enable async_commits
>> + * by default.
>> + */
>> +#define ASYNC_COMMIT_DEFAULT(sb) (EXT4_HAS_INCOMPAT_FEATURE(sb, \
>> +                    (EXT4_FEATURE_INCOMPAT_EXTENTS| \
>> +                     EXT4_FEATURE_INCOMPAT_64BIT| \
>> +                     EXT4_FEATURE_INCOMPAT_FLEX_BG)))
>> +
>>  
>>  ext4_fsblk_t ext4_block_bitmap(struct super_block *sb,
>>                     struct ext4_group_desc *bg)
>> @@ -845,8 +854,13 @@ static int ext4_show_options(struct seq_file 
>> *seq, struct vfsmount *vfs)
>>       */
>>      seq_puts(seq, ",barrier=");
>>      seq_puts(seq, test_opt(sb, BARRIER) ? "1" : "0");
>> -    if (test_opt(sb, JOURNAL_ASYNC_COMMIT))
>> -        seq_puts(seq, ",journal_async_commit");
>> +    if (ASYNC_COMMIT_DEFAULT(sb)) {
>> +        if (!test_opt(sb, JOURNAL_ASYNC_COMMIT))
>> +            seq_puts(seq, ",nojournal_async_commit");
>> +    } else {
>> +        if (test_opt(sb, JOURNAL_ASYNC_COMMIT))
>> +            seq_puts(seq, ",journal_async_commit");
>> +    }
>>      if (test_opt(sb, NOBH))
>>          seq_puts(seq, ",nobh");
>>      if (test_opt(sb, I_VERSION))
>> @@ -1050,6 +1064,7 @@ enum {
>>      Opt_commit, Opt_min_batch_time, Opt_max_batch_time,
>>      Opt_journal_update, Opt_journal_dev,
>>      Opt_journal_checksum, Opt_journal_async_commit,
>> +    Opt_nojournal_async_commit,
>>      Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
>>      Opt_data_err_abort, Opt_data_err_ignore, Opt_mb_history_length,
>>      Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
>> @@ -1092,6 +1107,7 @@ static const match_table_t tokens = {
>>      {Opt_journal_dev, "journal_dev=%u"},
>>      {Opt_journal_checksum, "journal_checksum"},
>>      {Opt_journal_async_commit, "journal_async_commit"},
>> +    {Opt_nojournal_async_commit, "nojournal_async_commit"},
>>      {Opt_abort, "abort"},
>>      {Opt_data_journal, "data=journal"},
>>      {Opt_data_ordered, "data=ordered"},
>> @@ -1284,6 +1300,9 @@ static int parse_options(char *options, struct 
>> super_block *sb,
>>          case Opt_journal_async_commit:
>>              set_opt(sbi->s_mount_opt, JOURNAL_ASYNC_COMMIT);
>>              break;
>> +        case Opt_nojournal_async_commit:
>> +            clear_opt(sbi->s_mount_opt, JOURNAL_ASYNC_COMMIT);
>> +            break;
>>          case Opt_noload:
>>              set_opt(sbi->s_mount_opt, NOLOAD);
>>              break;
>> @@ -2422,6 +2441,9 @@ static int ext4_fill_super(struct super_block 
>> *sb, void *data, int silent)
>>       */
>>      set_opt(sbi->s_mount_opt, DELALLOC);
>>  
>> +    if (ASYNC_COMMIT_DEFAULT(sb))
>> +        set_opt(sbi->s_mount_opt, JOURNAL_ASYNC_COMMIT);
>> +
>>      if (!parse_options((char *) data, sb, &journal_devnum,
>>                 &journal_ioprio, NULL, 0))
>>          goto failed_mount;
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] ext4: Automatically enable journal_async_commit on ext4 file systems
  2009-09-07 23:42   ` Ric Wheeler
@ 2009-09-08  4:45     ` Theodore Tso
  2009-09-08 11:50       ` Ric Wheeler
  0 siblings, 1 reply; 13+ messages in thread
From: Theodore Tso @ 2009-09-08  4:45 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Ext4 Developers List

On Mon, Sep 07, 2009 at 07:42:58PM -0400, Ric Wheeler wrote:
>
> I am not sure that we are really good with ASYNC commit being on all of  
> the time - I really worry that we will see lots of issues.

There really isn't much difference between async commit and non-async
commit.  In fact, the name is really a bit of a misnomer at this
point.

So here's what we do on a non-async commit:

1)  Write the journal data, revoke, and descriptor blocks
2)  Wait for the block I/O layer to signal that all of these blocks 
    have been written out --- *without* a barrier
3)  Write the commit block with a barrier
4)  Wait for the I/O to commit block to be done

This is what we do with an async commit:

1)  Write the journal data, revoke, and descriptor blocks
2)  Write the commit block (with a checksum) with a barrier
3)  Wait for the I/O to in steps (1) and (2) to be done

That's the only difference at this point.  The fatal flaw with async
commit from before was this that we weren't writing the commit block
in step (2) with a barrier --- and that *was* disastrous, since it
meant the equivalent of mounting with barrier=0.

But now that it is fixed, this code path does make sense, and given
that we weren't inserting a barrier between steps 2 and 3, we were in
fact (theoretically) vulnerable to the commit block and the journal
blocks getting reordered in 2.6.30 and older kernels.  Turning on the
journal checksum (in the prior commit) helps solve that issue, but at
that point, we might as well write the commit block before we start
waiting on all of the journal blocks.

As far as the code complexity issue concern, it really wasn't that
complicated, and in fact we're not really changing the existing code
path that we've been using for over a year now by very much.  The only
difference in fact is where we call the function to write the commit
record.

						- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] ext4: Automatically enable journal_async_commit on ext4 file systems
  2009-09-08  4:45     ` Theodore Tso
@ 2009-09-08 11:50       ` Ric Wheeler
  2009-09-11  2:45         ` Theodore Tso
  0 siblings, 1 reply; 13+ messages in thread
From: Ric Wheeler @ 2009-09-08 11:50 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Ext4 Developers List

On 09/08/2009 12:45 AM, Theodore Tso wrote:
> On Mon, Sep 07, 2009 at 07:42:58PM -0400, Ric Wheeler wrote:
>    
>> I am not sure that we are really good with ASYNC commit being on all of
>> the time - I really worry that we will see lots of issues.
>>      
> There really isn't much difference between async commit and non-async
> commit.  In fact, the name is really a bit of a misnomer at this
> point.
>
> So here's what we do on a non-async commit:
>
> 1)  Write the journal data, revoke, and descriptor blocks
> 2)  Wait for the block I/O layer to signal that all of these blocks
>      have been written out --- *without* a barrier
> 3)  Write the commit block with a barrier
> 4)  Wait for the I/O to commit block to be done
>
> This is what we do with an async commit:
>
> 1)  Write the journal data, revoke, and descriptor blocks
> 2)  Write the commit block (with a checksum) with a barrier
> 3)  Wait for the I/O to in steps (1) and (2) to be done
>
> That's the only difference at this point.  The fatal flaw with async
> commit from before was this that we weren't writing the commit block
> in step (2) with a barrier --- and that *was* disastrous, since it
> meant the equivalent of mounting with barrier=0.
>    

I think that the difference is basically that in the original mode, 
waiting for stage (2) to finish means that our commit block will never 
hit the storage before the dependent data is committed. Remember that 
barriers are actually 2 CACHE_FLUSH_EXT commands - one before the 
flagged barrier IO is issued and one afterwards.

In effect, this means that we have little to no window where our commit 
block could be on persistent storage while we have the commit block on 
platter.

In the second scenario, it sounds like that data that would still be in 
flight is not going to get flushed by those barrier ops?

In any case, I think that we are opening a window here. The checksum 
should flag how often we end up with an invalid state, but I would still 
prefer to see a clear advantage in performance as well as testing (power 
fail, etc) to make sure that we are safe :-)

ric


> But now that it is fixed, this code path does make sense, and given
> that we weren't inserting a barrier between steps 2 and 3, we were in
> fact (theoretically) vulnerable to the commit block and the journal
> blocks getting reordered in 2.6.30 and older kernels.  Turning on the
> journal checksum (in the prior commit) helps solve that issue, but at
> that point, we might as well write the commit block before we start
> waiting on all of the journal blocks.
>
> As far as the code complexity issue concern, it really wasn't that
> complicated, and in fact we're not really changing the existing code
> path that we've been using for over a year now by very much.  The only
> difference in fact is where we call the function to write the commit
> record.
>
> 						- Ted
>    


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] ext4: Automatically enable journal_async_commit on ext4 file systems
  2009-09-08 11:50       ` Ric Wheeler
@ 2009-09-11  2:45         ` Theodore Tso
  2009-09-11 11:07           ` Ric Wheeler
  0 siblings, 1 reply; 13+ messages in thread
From: Theodore Tso @ 2009-09-11  2:45 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Ext4 Developers List

On Tue, Sep 08, 2009 at 07:50:35AM -0400, Ric Wheeler wrote:
>>
>> So here's what we do on a non-async commit:
>>
>> This is what we do with an async commit:
>>
>> That's the only difference at this point.  The fatal flaw with async
>> commit from before was this that we weren't writing the commit block
>> in step (2) with a barrier --- and that *was* disastrous, since it
>> meant the equivalent of mounting with barrier=0.
>
> I think that the difference is basically that in the original mode,  
> waiting for stage (2) to finish means that our commit block will never  
> hit the storage before the dependent data is committed. Remember that  
> barriers are actually 2 CACHE_FLUSH_EXT commands - one before the  
> flagged barrier IO is issued and one afterwards.

I didn't realize that doing an ordered write meant that we had a
barrier *before* and *after* the commit block; I didn't realiuze it
was quite that strong.  I thought an ordered write only put a barrier
*after* the commit block.  Looking more closely, you're right, and
that actually explains why I wasn't see that much of a difference with
and without journal_async_write.

The fact that an ordered write puts barriers before and after the
commit means that right now the two scenarios above are in fact
*identical*.

So here's a respin of the fix-async-journal patch that changes what we
do from:

1)  Write the journal data, revoke, and descriptor blocks
2)  Wait for the block I/O layer to signal that all of these blocks
     have been written out --- *without* a barrier
3)  Write the commit block in ordered mode
4)  Wait for the I/O to commit block to be done

To this (in journal_async_commit):

1)  Write the journal data, revoke, and descriptor blocks
2)  Write the commit block (with a checksum) without setting ordered mode
3)  Send an empty barrier bio (so we only send a *single* CACHE_FLUSH_EXT)
4)  Wait for the I/O to in steps (1) and (2) to be done

This *does* show significant improvements:

Using ./fs_mark  -d  /mnt  -s  10240  -n  1000 

W/o journal_async_commit:

FSUse%        Count         Size    Files/sec     App Overhead
     8         1000        10240         30.5            28242

w/ journal_async_commit:

     8         1000        10240         45.8            28620

w/ barrier=0

     8         1000        10240        320.0            27699


Since this patch is a bit more complicated, I'll hold off on making it
be the default for now, but if the testing goes well, I plan to make
it default in the next kernel release, since an increase of 50% of
fs_mark is something I think we all would agree counts as a "clear
performance advantage".  :-)

							- Ted

commit fd67d1cfd73f554bae6c37745222eac2723983c8
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Thu Sep 10 22:34:27 2009 -0400

    ext4: Fix async commit mode to be safe by using a barrier
    
    Previously the journal_async_commit mount option was equivalent to
    using barrier=0 (and just as unsafe).  This patch fixes it so that we
    eliminate the barrier before the commit block (by not using ordered
    mode), and explicitly issuing an empty barrier bio after writing the
    commit block.  Because of the journal checksum, it is safe to do this;
    if the journal blocks are not all written before a power failure, the
    checksum in the commit block will prevent the last transaction from
    being replayed.
    
    Using the fs_mark benchmark, using journal_async_commit shows a 50%
    improvement:
    
    FSUse%        Count         Size    Files/sec     App Overhead
         8         1000        10240         30.5            28242
    
    vs.
    
    FSUse%        Count         Size    Files/sec     App Overhead
         8         1000        10240         45.8            28620
    
    
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 7b4088b..d6f4763 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -25,6 +25,7 @@
 #include <linux/writeback.h>
 #include <linux/backing-dev.h>
 #include <linux/bio.h>
+#include <linux/blkdev.h>
 #include <trace/events/jbd2.h>
 
 /*
@@ -83,6 +84,34 @@ nope:
 	__brelse(bh);
 }
 
+static void end_empty_barrier(struct bio *bio, int err)
+{
+	if (err) {
+		if (err == -EOPNOTSUPP)
+			set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
+		clear_bit(BIO_UPTODATE, &bio->bi_flags);
+	}
+	complete(bio->bi_private);
+}
+
+struct bio *issue_flush(struct block_device *bdev, struct completion *wait)
+{
+
+	struct bio *bio;
+
+	if (!bdev->bd_disk || !bdev->bd_disk->queue)
+		return NULL;
+
+	bio = bio_alloc(GFP_KERNEL, 0);
+	if (!bio)
+		return NULL;
+	bio->bi_end_io = end_empty_barrier;
+	bio->bi_private = wait;
+	bio->bi_bdev = bdev;
+	submit_bio(WRITE_BARRIER, bio);
+	return bio;
+}
+
 /*
  * Done it all: now submit the commit record.  We should have
  * cleaned up our previous buffers by now, so if we are in abort
@@ -133,8 +162,8 @@ static int journal_submit_commit_record(journal_t *journal,
 	bh->b_end_io = journal_end_buffer_io_sync;
 
 	if (journal->j_flags & JBD2_BARRIER &&
-		!JBD2_HAS_INCOMPAT_FEATURE(journal,
-					 JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
+	    !JBD2_HAS_INCOMPAT_FEATURE(journal,
+				       JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
 		set_buffer_ordered(bh);
 		barrier_done = 1;
 	}
@@ -352,6 +381,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	transaction_t *commit_transaction;
 	struct journal_head *jh, *new_jh, *descriptor;
 	struct buffer_head **wbuf = journal->j_wbuf;
+	struct bio *bio_flush = NULL;
+	DECLARE_COMPLETION_ONSTACK(wait_flush);
 	int bufs;
 	int flags;
 	int err;
@@ -707,11 +738,13 @@ start_journal_io:
 	/* Done it all: now write the commit record asynchronously. */
 
 	if (JBD2_HAS_INCOMPAT_FEATURE(journal,
-		JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
+				      JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
 		err = journal_submit_commit_record(journal, commit_transaction,
 						 &cbh, crc32_sum);
 		if (err)
 			__jbd2_journal_abort_hard(journal);
+		if (journal->j_flags & JBD2_BARRIER)
+			bio_flush = issue_flush(journal->j_dev, &wait_flush);
 	}
 
 	/*
@@ -833,8 +866,13 @@ wait_for_iobuf:
 
 	jbd_debug(3, "JBD: commit phase 5\n");
 
-	if (!JBD2_HAS_INCOMPAT_FEATURE(journal,
-		JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
+	if (JBD2_HAS_INCOMPAT_FEATURE(journal,
+				      JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
+		if (bio_flush) {
+			wait_for_completion(&wait_flush);
+			bio_put(bio_flush);
+		}
+	} else {
 		err = journal_submit_commit_record(journal, commit_transaction,
 						&cbh, crc32_sum);
 		if (err)

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] ext4: Automatically enable journal_async_commit on ext4 file systems
  2009-09-11  2:45         ` Theodore Tso
@ 2009-09-11 11:07           ` Ric Wheeler
  2009-09-11 13:13             ` Theodore Tso
  0 siblings, 1 reply; 13+ messages in thread
From: Ric Wheeler @ 2009-09-11 11:07 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Ext4 Developers List

On 09/10/2009 10:45 PM, Theodore Tso wrote:
> On Tue, Sep 08, 2009 at 07:50:35AM -0400, Ric Wheeler wrote:
>>>
>>> So here's what we do on a non-async commit:
>>>
>>> This is what we do with an async commit:
>>>
>>> That's the only difference at this point.  The fatal flaw with async
>>> commit from before was this that we weren't writing the commit block
>>> in step (2) with a barrier --- and that *was* disastrous, since it
>>> meant the equivalent of mounting with barrier=0.
>>
>> I think that the difference is basically that in the original mode,
>> waiting for stage (2) to finish means that our commit block will never
>> hit the storage before the dependent data is committed. Remember that
>> barriers are actually 2 CACHE_FLUSH_EXT commands - one before the
>> flagged barrier IO is issued and one afterwards.
>
> I didn't realize that doing an ordered write meant that we had a
> barrier *before* and *after* the commit block; I didn't realiuze it
> was quite that strong.  I thought an ordered write only put a barrier
> *after* the commit block.  Looking more closely, you're right, and
> that actually explains why I wasn't see that much of a difference with
> and without journal_async_write.
>
> The fact that an ordered write puts barriers before and after the
> commit means that right now the two scenarios above are in fact
> *identical*.
>
> So here's a respin of the fix-async-journal patch that changes what we
> do from:
>
> 1)  Write the journal data, revoke, and descriptor blocks
> 2)  Wait for the block I/O layer to signal that all of these blocks
>       have been written out --- *without* a barrier
> 3)  Write the commit block in ordered mode
> 4)  Wait for the I/O to commit block to be done
>
> To this (in journal_async_commit):
>
> 1)  Write the journal data, revoke, and descriptor blocks
> 2)  Write the commit block (with a checksum) without setting ordered mode
> 3)  Send an empty barrier bio (so we only send a *single* CACHE_FLUSH_EXT)
> 4)  Wait for the I/O to in steps (1) and (2) to be done


I still think that we changing from a situation in which the drive state with 
regards to our transactions is almost always consistent to one in which we will 
often not be consistent.

More or less, moving from tight control of the persistent state on the platter 
to a situation in which, after power failure, we will more often see a bad 
transaction.  The checksum will catch those conditions, but catching and 
repairing is not the same as avoiding the need to repair in the first place :)

The key is really how can we measure the impact of this in a realistic way. How 
many fsck's are needed after a power fail? Chris's directory corruption test?

I have no objections to making this a non-default mount option, but think that 
we will need significant power fail testing before it would be a candidate for 
default use.

It certainly does have a significant performance bump!

ric


>
> This *does* show significant improvements:
>
> Using ./fs_mark  -d  /mnt  -s  10240  -n  1000
>
> W/o journal_async_commit:
>
> FSUse%        Count         Size    Files/sec     App Overhead
>       8         1000        10240         30.5            28242
>
> w/ journal_async_commit:
>
>       8         1000        10240         45.8            28620
>
> w/ barrier=0
>
>       8         1000        10240        320.0            27699
>
>
> Since this patch is a bit more complicated, I'll hold off on making it
> be the default for now, but if the testing goes well, I plan to make
> it default in the next kernel release, since an increase of 50% of
> fs_mark is something I think we all would agree counts as a "clear
> performance advantage".  :-)
>
> 							- Ted
>
> commit fd67d1cfd73f554bae6c37745222eac2723983c8
> Author: Theodore Ts'o<tytso@mit.edu>
> Date:   Thu Sep 10 22:34:27 2009 -0400
>
>      ext4: Fix async commit mode to be safe by using a barrier
>
>      Previously the journal_async_commit mount option was equivalent to
>      using barrier=0 (and just as unsafe).  This patch fixes it so that we
>      eliminate the barrier before the commit block (by not using ordered
>      mode), and explicitly issuing an empty barrier bio after writing the
>      commit block.  Because of the journal checksum, it is safe to do this;
>      if the journal blocks are not all written before a power failure, the
>      checksum in the commit block will prevent the last transaction from
>      being replayed.
>
>      Using the fs_mark benchmark, using journal_async_commit shows a 50%
>      improvement:
>
>      FSUse%        Count         Size    Files/sec     App Overhead
>           8         1000        10240         30.5            28242
>
>      vs.
>
>      FSUse%        Count         Size    Files/sec     App Overhead
>           8         1000        10240         45.8            28620
>
>
>      Signed-off-by: "Theodore Ts'o"<tytso@mit.edu>
>
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 7b4088b..d6f4763 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -25,6 +25,7 @@
>   #include<linux/writeback.h>
>   #include<linux/backing-dev.h>
>   #include<linux/bio.h>
> +#include<linux/blkdev.h>
>   #include<trace/events/jbd2.h>
>
>   /*
> @@ -83,6 +84,34 @@ nope:
>   	__brelse(bh);
>   }
>
> +static void end_empty_barrier(struct bio *bio, int err)
> +{
> +	if (err) {
> +		if (err == -EOPNOTSUPP)
> +			set_bit(BIO_EOPNOTSUPP,&bio->bi_flags);
> +		clear_bit(BIO_UPTODATE,&bio->bi_flags);
> +	}
> +	complete(bio->bi_private);
> +}
> +
> +struct bio *issue_flush(struct block_device *bdev, struct completion *wait)
> +{
> +
> +	struct bio *bio;
> +
> +	if (!bdev->bd_disk || !bdev->bd_disk->queue)
> +		return NULL;
> +
> +	bio = bio_alloc(GFP_KERNEL, 0);
> +	if (!bio)
> +		return NULL;
> +	bio->bi_end_io = end_empty_barrier;
> +	bio->bi_private = wait;
> +	bio->bi_bdev = bdev;
> +	submit_bio(WRITE_BARRIER, bio);
> +	return bio;
> +}
> +
>   /*
>    * Done it all: now submit the commit record.  We should have
>    * cleaned up our previous buffers by now, so if we are in abort
> @@ -133,8 +162,8 @@ static int journal_submit_commit_record(journal_t *journal,
>   	bh->b_end_io = journal_end_buffer_io_sync;
>
>   	if (journal->j_flags&  JBD2_BARRIER&&
> -		!JBD2_HAS_INCOMPAT_FEATURE(journal,
> -					 JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
> +	    !JBD2_HAS_INCOMPAT_FEATURE(journal,
> +				       JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
>   		set_buffer_ordered(bh);
>   		barrier_done = 1;
>   	}
> @@ -352,6 +381,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>   	transaction_t *commit_transaction;
>   	struct journal_head *jh, *new_jh, *descriptor;
>   	struct buffer_head **wbuf = journal->j_wbuf;
> +	struct bio *bio_flush = NULL;
> +	DECLARE_COMPLETION_ONSTACK(wait_flush);
>   	int bufs;
>   	int flags;
>   	int err;
> @@ -707,11 +738,13 @@ start_journal_io:
>   	/* Done it all: now write the commit record asynchronously. */
>
>   	if (JBD2_HAS_INCOMPAT_FEATURE(journal,
> -		JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
> +				      JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
>   		err = journal_submit_commit_record(journal, commit_transaction,
>   						&cbh, crc32_sum);
>   		if (err)
>   			__jbd2_journal_abort_hard(journal);
> +		if (journal->j_flags&  JBD2_BARRIER)
> +			bio_flush = issue_flush(journal->j_dev,&wait_flush);
>   	}
>
>   	/*
> @@ -833,8 +866,13 @@ wait_for_iobuf:
>
>   	jbd_debug(3, "JBD: commit phase 5\n");
>
> -	if (!JBD2_HAS_INCOMPAT_FEATURE(journal,
> -		JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
> +	if (JBD2_HAS_INCOMPAT_FEATURE(journal,
> +				      JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
> +		if (bio_flush) {
> +			wait_for_completion(&wait_flush);
> +			bio_put(bio_flush);
> +		}
> +	} else {
>   		err = journal_submit_commit_record(journal, commit_transaction,
>   						&cbh, crc32_sum);
>   		if (err)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] ext4: Automatically enable journal_async_commit on ext4 file systems
  2009-09-11 11:07           ` Ric Wheeler
@ 2009-09-11 13:13             ` Theodore Tso
  2009-09-11 14:39               ` Ric Wheeler
  0 siblings, 1 reply; 13+ messages in thread
From: Theodore Tso @ 2009-09-11 13:13 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Ext4 Developers List

On Fri, Sep 11, 2009 at 07:07:27AM -0400, Ric Wheeler wrote:
> I still think that we changing from a situation in which the drive state 
> with regards to our transactions is almost always consistent to one in 
> which we will often not be consistent.
>
> More or less, moving from tight control of the persistent state on the 
> platter to a situation in which, after power failure, we will more often 
> see a bad transaction.  The checksum will catch those conditions, but 
> catching and repairing is not the same as avoiding the need to repair in 
> the first place :)

We won't need to repair anything.  We still have a barrier before we
allow the filesystem to proceed with writing back buffers or
allocating blocks that aren't safe to be be written back or allocated
until after the commit.

So if the checksum doesn't match, we simply discard the last commit,
and the filesystem will be in a consistent state.  This case is
analogous to what happens if we didn't have enough time to write the
journal blocks plus the commit blocks before the crash.  By removing
the barrier before the commit block, it's possible for the commit
block to be written before the rest of the journal blocks, but we can
treat this case the same way that we treat a missing commit block ---
we simply throw away the last transaction.


The problems that I've worried about in the past is what happens if we
have a checksum failure on some commit block *other* than the last
commit block in the journal.  In that case, we *will* need to do a
full file system check and repair, and it is a toss up whether we are
better off ignoring the checksum failure, and replaying all of the
journal transaction, and hope that the checksum failure is caused by a
corrupted data block that will be later overwritten by a later
transaction, or whether we abort the journal replay immediately and
not replay the later transactions.  Currently we do the latter, but
the problem is that since we have already started reusing blocks that
might have been deleted in previous transactions, and some of the
buffes pinned by previous transactions have already been written out,
the file system will be in trouble.  This is where adding per-block
checksums into the journal descriptor blocks might allow us to do a
better job of recovering from failures in the journal.

*However*, this is problem is totally orthogonal to the async commit.
In the case of the last transaction, where some journal blocks were
written out before the commit block was written out, it is safe to
throw away the last transaction and consider it simply a "not
committed transaction".

> The key is really how can we measure the impact of this in a realistic 
> way. How many fsck's are needed after a power fail? Chris's directory 
> corruption test?

So the test should be that there should be *zero* file system
corruptions caused by a power failure.  (Unless the power fail induces
a hardware error, of course; if the stress caused by the power drop
causes a head crash, nothing we can do about that in software!)  The
async commit patch should be that safe.  If we can confirm that, then
the case for making it be the default mount option should be a
no-brainer.

       	      	     	     	       - Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] ext4: Automatically enable journal_async_commit on ext4 file systems
  2009-09-11 13:13             ` Theodore Tso
@ 2009-09-11 14:39               ` Ric Wheeler
  0 siblings, 0 replies; 13+ messages in thread
From: Ric Wheeler @ 2009-09-11 14:39 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Ext4 Developers List

On 09/11/2009 09:13 AM, Theodore Tso wrote:
> On Fri, Sep 11, 2009 at 07:07:27AM -0400, Ric Wheeler wrote:
>    
>> I still think that we changing from a situation in which the drive state
>> with regards to our transactions is almost always consistent to one in
>> which we will often not be consistent.
>>
>> More or less, moving from tight control of the persistent state on the
>> platter to a situation in which, after power failure, we will more often
>> see a bad transaction.  The checksum will catch those conditions, but
>> catching and repairing is not the same as avoiding the need to repair in
>> the first place :)
>>      
> We won't need to repair anything.  We still have a barrier before we
> allow the filesystem to proceed with writing back buffers or
> allocating blocks that aren't safe to be be written back or allocated
> until after the commit.
>
> So if the checksum doesn't match, we simply discard the last commit,
> and the filesystem will be in a consistent state.  This case is
> analogous to what happens if we didn't have enough time to write the
> journal blocks plus the commit blocks before the crash.  By removing
> the barrier before the commit block, it's possible for the commit
> block to be written before the rest of the journal blocks, but we can
> treat this case the same way that we treat a missing commit block ---
> we simply throw away the last transaction.
>
>
> The problems that I've worried about in the past is what happens if we
> have a checksum failure on some commit block *other* than the last
> commit block in the journal.  In that case, we *will* need to do a
> full file system check and repair, and it is a toss up whether we are
> better off ignoring the checksum failure, and replaying all of the
> journal transaction, and hope that the checksum failure is caused by a
> corrupted data block that will be later overwritten by a later
> transaction, or whether we abort the journal replay immediately and
> not replay the later transactions.  Currently we do the latter, but
> the problem is that since we have already started reusing blocks that
> might have been deleted in previous transactions, and some of the
> buffes pinned by previous transactions have already been written out,
> the file system will be in trouble.  This is where adding per-block
> checksums into the journal descriptor blocks might allow us to do a
> better job of recovering from failures in the journal.
>
> *However*, this is problem is totally orthogonal to the async commit.
> In the case of the last transaction, where some journal blocks were
> written out before the commit block was written out, it is safe to
> throw away the last transaction and consider it simply a "not
> committed transaction".
>
>    
>> The key is really how can we measure the impact of this in a realistic
>> way. How many fsck's are needed after a power fail? Chris's directory
>> corruption test?
>>      
> So the test should be that there should be *zero* file system
> corruptions caused by a power failure.  (Unless the power fail induces
> a hardware error, of course; if the stress caused by the power drop
> causes a head crash, nothing we can do about that in software!)  The
> async commit patch should be that safe.  If we can confirm that, then
> the case for making it be the default mount option should be a
> no-brainer.
>
>         	      	     	     	       - Ted
>    

The above makes sense to me. Now we just need to figure out how to test 
properly and verify :-(

ric


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-09-11 14:39 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-05 22:32 [PATCH 1/2] ext4: Remove journal_checksum mount option and enable it by default Theodore Ts'o
2009-09-05 22:32 ` [PATCH 2/2] ext4: Automatically enable journal_async_commit on ext4 file systems Theodore Ts'o
2009-09-05 22:57   ` Andreas Dilger
2009-09-06  1:32     ` Theodore Tso
2009-09-06  2:57   ` Eric Sandeen
2009-09-07 23:48     ` Ric Wheeler
2009-09-07 23:42   ` Ric Wheeler
2009-09-08  4:45     ` Theodore Tso
2009-09-08 11:50       ` Ric Wheeler
2009-09-11  2:45         ` Theodore Tso
2009-09-11 11:07           ` Ric Wheeler
2009-09-11 13:13             ` Theodore Tso
2009-09-11 14:39               ` Ric Wheeler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.