All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5 v2] Lazy itable initialization for Ext4
@ 2010-09-08 16:59 Lukas Czerner
  2010-09-08 16:59 ` [PATCH 1/5] Add helper function for blkdev_issue_zeroout Lukas Czerner
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Lukas Czerner @ 2010-09-08 16:59 UTC (permalink / raw)
  To: linux-ext4; +Cc: rwheeler, sandeen, tytso, adilger, lczerner

Hi, all

I am presenting you second version of my lazy inode table initialization
code for Ext4. The patch set consist of five patches. The first one adds
helper function for blkdev_issue_zeroout called sb_issue_zeroout as I am
using it to zero out inode table. Second patch adds new pair of mount
options (inititable/noinititable), so you can enable or disable this
feature. In default it is off (noinititable), so in order to try the new
code you should moutn the fs like this:

  mount -o noinititable /dev/sda /mnt/

The third patch adds the inode table initialization code itself. Thread
initialization was heavily inspired by nilfs2 segctord. And last two
patched are making use of sb_issue_discard() in other places in Ext4,
where is need to zero out a part of disk space.

To Andreas:
You suggested the approach with reading the table first to
determine if the device is sparse, or thinly provisioned, or trimmed SSD.
In this case the reading would be much more efficient than writing, so it
would be a win. But I just wonder, if we de believe the device, that
when returning zeroes it is safe to no zero the inode table, why not do it
at mkfs time instead of kernel ?

To Ted:
You were suggesting that it would be nice if the thread will not run, or
just quits when the system runs on the battery power. I agree that in that
case we probably should not do this to save some battery life. But is it
necessary, or wise to do this in kernel ? What we should do when the
system runs on battery and user still want to run the lazy initialization
? I would rather let the userspace handle it. For example just remount the
filesystem with -o noinititable.
___________
DESCRIPTION
___________

When lazy_itable_init extended option is passed to mke2fs, it
considerably speed up filesystem creation because inode tables are
not zeroed out, thus contains some old data. When this fs is mounted
filesystem code should initialize (zero out) inode tables.
So far this code was missing for ext4 and this patch adds this feature.

For purpose of zeroing inode tables it introduces new kernel thread
called ext4lazyinit, which is created on demand and destroyed, when it
is no longer needed. There is only one thread for all ext4
filesystems in the system. When the first filesystem with inititable
mount option is mounted, ext4lazyinit thread is created, then the
filesystem can register its request in the request list.

This thread then walks through the list of requests picking up scheduled
requests and invoking ext4_init_inode_table(). Next schedule time for
the request is determined from the time it took to zero out inode table,
so we do not take the whole I/O bandwidth. When the thread is no longer
necessary (request list is empty) it frees the appropriate structures and
exits (it can be invoked later on by another filesystem).

We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But we when zeroing we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
_________________
BENCHMARK RESULTS
_________________

We are trying to avoid performance loss when the ext4lazyinit thread is
working. This is done really simply: just measure the time it takes to
zero out inode table in one group and determine next schedule time from
that number. For example to reach approx. 10% of the I/O bandwidth we
should wait for 9 times the zeroout-time (1 time slice it is woking and 9
time slices it is sleeping). So this multiplier (9 in our example) is
defining how much I/O bandwidth would be used by the thread. It is very
simple method, but I think that it serves our needs.

In my benchmark I have set different values of multipliers
(EXT4_LI_WAIT_MULT) to see how it affects performance. As a tool for
performance measuring I have used postmark (see parameters bellow). I have
created average from five postmark runs to gen more stable results. In
each run I have created ext4 filesystem on the device (with
lazy_itable_init set properly), mounted with inititable/noinititable mount
option and run the postmark measuring the running time and number of
groups the ext4lazyinit thread initializes in one run. There are the
results.

All test was done on 2.6.35 kernel with and without my patches. In tables
below you can see comparison between the performance of the kernel without
my patches and several different settings (see 3rd column).

Graph is attached.

Type                              |NOPATCH      NOITABLEINIT DIFF    |
==================================+==================================+
Total_duration                    |130.00       130.00       -0.00%  |
Duration_of_transactions          |77.80        77.40        -0.51%  |
Transactions/s                    |642.73       646.15       0.53%   |
Files_created/s                   |575.15       575.15       -0.00%  |
Creation_alone/s                  |1024.83      1020.58      -0.41%  |
Creation_mixed_with_transaction/s |318.29       319.99       0.53%   |
Read/s                            |321.03       322.74       0.53%   |
Append/s                          |321.69       323.40       0.53%   |
Deleted/s                         |575.15       575.15       -0.00%  |
Deletion_alone/s                  |1015.03      1010.82      -0.41%  |
Deletion_mixed_with_transaction/s |324.44       326.16       0.53%   |
Read_B/s                          |21179620.40  21179620.40  -0.00%  |
Write_B/s                         |66279880.00  66279880.00  -0.00%  |
==================================+==================================+
RUNTIME:	2m10	GROUPS ZEROED: 0

Type                              |NOPATCH      MULT=10      DIFF    |
==================================+==================================+
Total_duration                    |130.00       132.40       1.85%   |
Duration_of_transactions          |77.80        80.80        3.86%   |
Transactions/s                    |642.73       618.82       -3.72%  |
Files_created/s                   |575.15       564.67       -1.82%  |
Creation_alone/s                  |1024.83      1033.17      0.81%   |
Creation_mixed_with_transaction/s |318.29       306.45       -3.72%  |
Read/s                            |321.03       309.09       -3.72%  |
Append/s                          |321.69       309.72       -3.72%  |
Deleted/s                         |575.15       564.67       -1.82%  |
Deletion_alone/s                  |1015.03      1023.29      0.81%   |
Deletion_mixed_with_transaction/s |324.44       312.37       -3.72%  |
Read_B/s                          |21179620.40  20793522.40  -1.82%  |
Write_B/s                         |66279880.00  65071617.60  -1.82%  |
==================================+==================================+
RUNTIME:	2m13	GROUPS ZEROED: 156

Type                              |NOPATCH      MULT=5       DIFF    |
==================================+==================================+
Total_duration                    |130.00       137.20       5.54%   |
Duration_of_transactions          |77.80        84.60        8.74%   |
Transactions/s                    |642.73       591.04       -8.04%  |
Files_created/s                   |575.15       544.96       -5.25%  |
Creation_alone/s                  |1024.83      1021.09      -0.36%  |
Creation_mixed_with_transaction/s |318.29       292.69       -8.04%  |
Read/s                            |321.03       295.21       -8.04%  |
Append/s                          |321.69       295.81       -8.05%  |
Deleted/s                         |575.15       544.96       -5.25%  |
Deletion_alone/s                  |1015.03      1011.33      -0.36%  |
Deletion_mixed_with_transaction/s |324.44       298.34       -8.04%  |
Read_B/s                          |21179620.40  20067661.60  -5.25%  |
Write_B/s                         |66279880.00  62800096.00  -5.25%  |
==================================+==================================+
RUNTIME:	2m16	GROUPS ZEROED: 324

Type                              |NOPATCH      MULT=2       DIFF    |
==================================+==================================+
Total_duration                    |130.00       148.40       14.15%  |
Duration_of_transactions          |77.80        95.00        22.11%  |
Transactions/s                    |642.73       526.38       -18.10% |
Files_created/s                   |575.15       503.78       -12.41% |
Creation_alone/s                  |1024.83      1004.24      -2.01%  |
Creation_mixed_with_transaction/s |318.29       260.67       -18.10% |
Read/s                            |321.03       262.92       -18.10% |
Append/s                          |321.69       263.45       -18.10% |
Deleted/s                         |575.15       503.78       -12.41% |
Deletion_alone/s                  |1015.03      994.64       -2.01%  |
Deletion_mixed_with_transaction/s |324.44       265.71       -18.10% |
Read_B/s                          |21179620.40  18551581.20  -12.41% |
Write_B/s                         |66279880.00  58055650.40  -12.41% |
==================================+==================================+
RUNTIME:	2m28	GROUPS ZEROED: 748

The benchmark showed, that patch itself does not introduce any performance
loss (at least for postmark), when ext4lazyinit thread is not activated.
However, when it is activated, there is explicit performance loss due to
inode table zeroing, but with EXT4_LI_WAIT_MULT=10 it is just about 1.8%,
which may, or may not be much, so when I think about it now we should
probably make this settable via sysfs. What do you think ?
___________________
POSTMARK PARAMETERS
___________________

set number 50000
set transactions 50000
set read 4096
set write 4096
set bias read 5
set bias create 5
set report terse
set size 1000 200000
set buffering false


Any comments are welcomed.

Thanks!
-Lukas

---
[PATCH 1/5] Add helper function for blkdev_issue_zeroout
[PATCH 2/5] Add inititable/noinititable mount options for ext4
[PATCH 3/5] Add inode table initialization code for Ext4
[PATCH 4/5] Use sb_issue_zeroout in setup_new_group_blocks
[PATCH 5/5] Use sb_issue_discard in ext4_ext_zeroout

 fs/ext4/ext4.h         |   37 +++++
 fs/ext4/extents.c      |   68 +--------
 fs/ext4/ialloc.c       |  108 +++++++++++++
 fs/ext4/resize.c       |   44 ++----
 fs/ext4/super.c        |  405 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h |    8 +
 6 files changed, 575 insertions(+), 95 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/5] Add helper function for blkdev_issue_zeroout
  2010-09-08 16:59 [PATCH 0/5 v2] Lazy itable initialization for Ext4 Lukas Czerner
@ 2010-09-08 16:59 ` Lukas Czerner
  2010-09-08 20:36   ` Andreas Dilger
  2010-09-08 16:59 ` [PATCH 2/5] Add inititable/noinititable mount options for ext4 Lukas Czerner
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Lukas Czerner @ 2010-09-08 16:59 UTC (permalink / raw)
  To: linux-ext4; +Cc: rwheeler, sandeen, tytso, adilger, lczerner

This is done the same way as function sb_issue_discard for
blkdev_issue_discard.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 include/linux/blkdev.h |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 09a8402..a22939d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1023,6 +1023,14 @@ static inline int sb_issue_discard(struct super_block *sb,
 	return blkdev_issue_discard(sb->s_bdev, block, nr_blocks, GFP_KERNEL,
 				   BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
 }
+static inline int sb_issue_zeroout(struct super_block *sb,
+				   sector_t block, sector_t nr_blocks)
+{
+	block <<= (sb->s_blocksize_bits - 9);
+	nr_blocks <<= (sb->s_blocksize_bits - 9);
+	return blkdev_issue_zeroout(sb->s_bdev, block, nr_blocks, GFP_KERNEL,
+				   BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
+}
 
 extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);
 
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/5] Add inititable/noinititable mount options for ext4
  2010-09-08 16:59 [PATCH 0/5 v2] Lazy itable initialization for Ext4 Lukas Czerner
  2010-09-08 16:59 ` [PATCH 1/5] Add helper function for blkdev_issue_zeroout Lukas Czerner
@ 2010-09-08 16:59 ` Lukas Czerner
  2010-09-08 16:59 ` [PATCH 3/5] Add inode table initialization code for Ext4 Lukas Czerner
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Lukas Czerner @ 2010-09-08 16:59 UTC (permalink / raw)
  To: linux-ext4; +Cc: rwheeler, sandeen, tytso, adilger, lczerner

Add new mount flag EXT4_MOUNT_INIT_INODE_TABLE and add new pair of mount
options (inititable/noinititable). When mounted with inititable file
system should try to initialize uninitialized inode tables, otherwise it
should prevent initializing inode tables.
For now, default is noinittable.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/ext4.h  |    1 +
 fs/ext4/super.c |   12 ++++++++++++
 2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 19a4de5..dbd6760 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -885,6 +885,7 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_DATA_ERR_ABORT	0x10000000 /* Abort on file data write */
 #define EXT4_MOUNT_BLOCK_VALIDITY	0x20000000 /* Block validity checking */
 #define EXT4_MOUNT_DISCARD		0x40000000 /* Issue DISCARD requests */
+#define EXT4_MOUNT_INIT_INODE_TABLE	0x80000000 /* Initialize uninitialized itables */
 
 #define clear_opt(o, opt)		o &= ~EXT4_MOUNT_##opt
 #define set_opt(o, opt)			o |= EXT4_MOUNT_##opt
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4e8983a..515e306 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -986,6 +986,9 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
 	if (test_opt(sb, DIOREAD_NOLOCK))
 		seq_puts(seq, ",dioread_nolock");
 
+	if (test_opt(sb, INIT_INODE_TABLE))
+		seq_puts(seq, ",init_inode_table");
+
 	ext4_show_quota_options(seq, sb);
 
 	return 0;
@@ -1161,6 +1164,7 @@ enum {
 	Opt_inode_readahead_blks, Opt_journal_ioprio,
 	Opt_dioread_nolock, Opt_dioread_lock,
 	Opt_discard, Opt_nodiscard,
+	Opt_init_inode_table, Opt_noinit_inode_table,
 };
 
 static const match_table_t tokens = {
@@ -1231,6 +1235,8 @@ static const match_table_t tokens = {
 	{Opt_dioread_lock, "dioread_lock"},
 	{Opt_discard, "discard"},
 	{Opt_nodiscard, "nodiscard"},
+	{Opt_init_inode_table, "inititable"},
+	{Opt_noinit_inode_table, "noinititable"},
 	{Opt_err, NULL},
 };
 
@@ -1699,6 +1705,12 @@ set_qf_format:
 		case Opt_dioread_lock:
 			clear_opt(sbi->s_mount_opt, DIOREAD_NOLOCK);
 			break;
+		case Opt_init_inode_table:
+			set_opt(sbi->s_mount_opt, INIT_INODE_TABLE);
+			break;
+		case Opt_noinit_inode_table:
+			clear_opt(sbi->s_mount_opt, INIT_INODE_TABLE);
+			break;
 		default:
 			ext4_msg(sb, KERN_ERR,
 			       "Unrecognized mount option \"%s\" "
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/5] Add inode table initialization code for Ext4
  2010-09-08 16:59 [PATCH 0/5 v2] Lazy itable initialization for Ext4 Lukas Czerner
  2010-09-08 16:59 ` [PATCH 1/5] Add helper function for blkdev_issue_zeroout Lukas Czerner
  2010-09-08 16:59 ` [PATCH 2/5] Add inititable/noinititable mount options for ext4 Lukas Czerner
@ 2010-09-08 16:59 ` Lukas Czerner
  2010-09-08 21:19   ` Andreas Dilger
  2010-09-08 16:59 ` [PATCH 4/5] Use sb_issue_zeroout in setup_new_group_blocks Lukas Czerner
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Lukas Czerner @ 2010-09-08 16:59 UTC (permalink / raw)
  To: linux-ext4; +Cc: rwheeler, sandeen, tytso, adilger, lczerner

When lazy_itable_init extended option is passed to mke2fs, it
considerably speed up filesystem creation because inode tables are
not zeroed out, thus contains some old data. When this fs is mounted
filesystem code should initialize (zero out) inode tables.
So far this code was missing for ext4 and this patch adds this feature.

For purpose of zeroing inode tables it introduces new kernel thread
called ext4lazyinit, which is created on demand and destroyed, when it
is no longer needed. There is only one thread for all ext4
filesystems in the system. When the first filesystem with inititable
mount option is mounted, ext4lazyinit thread is created, then the
filesystem can register its request in the request list.

This thread then walks through the list of requests picking up scheduled
requests and invoking ext4_init_inode_table(). Next schedule time for
the request is determined from the time it took to zero out inode table,
so we do not take the whole I/O bandwidth. When the thread is no longer
necessary (request list is empty) it frees the appropriate structures and
exits (it can be invoked later on by another filesystem).

We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But we when zeroing we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/ext4.h   |   36 +++++
 fs/ext4/ialloc.c |  108 +++++++++++++++
 fs/ext4/super.c  |  393 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 537 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index dbd6760..e8afb02 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1144,6 +1144,9 @@ struct ext4_sb_info {
 
 	/* workqueue for dio unwritten */
 	struct workqueue_struct *dio_unwritten_wq;
+
+	/* Lazy inode table initialization info */
+	struct ext4_li_request *s_li_request;
 };
 
 static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
@@ -1467,6 +1470,37 @@ void ext4_get_group_no_and_offset(struct super_block *sb, ext4_fsblk_t blocknr,
 extern struct proc_dir_entry *ext4_proc_root;
 
 /*
+ * Timeout and state flag for lazy initialization inode thread.
+ */
+#define EXT4_LI_WAIT_MULT		10
+#define EXT4_LI_MAX_START_DELAY		5
+#define EXT4_LAZYINIT_QUIT		0x0001
+#define EXT4_LAZYINIT_RUNNING		0x0002
+
+/*
+ * Lazy inode table initialization info
+ */
+struct ext4_lazy_init {
+	unsigned long		li_state;
+
+	wait_queue_head_t	li_wait_daemon;
+	wait_queue_head_t	li_wait_task;
+	struct timer_list	li_timer;
+	struct task_struct	*li_task;
+
+	struct list_head	li_request_list;
+	struct mutex		li_list_mtx;
+};
+
+struct ext4_li_request {
+	struct super_block	*lr_super;
+	struct ext4_sb_info	*lr_sbi;
+	ext4_group_t		lr_next_group;
+	struct list_head	lr_request;
+	unsigned long		lr_next_sched;
+};
+
+/*
  * Function prototypes
  */
 
@@ -1539,6 +1573,8 @@ extern unsigned ext4_init_inode_bitmap(struct super_block *sb,
 				       ext4_group_t group,
 				       struct ext4_group_desc *desc);
 extern void mark_bitmap_end(int start_bit, int end_bit, char *bitmap);
+extern int ext4_init_inode_table(struct super_block *sb,
+				 ext4_group_t group);
 
 /* mballoc.c */
 extern long ext4_mb_stats;
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 25c4b31..c545bac 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -107,6 +107,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
 	desc = ext4_get_group_desc(sb, block_group, NULL);
 	if (!desc)
 		return NULL;
+
 	bitmap_blk = ext4_inode_bitmap(sb, desc);
 	bh = sb_getblk(sb, bitmap_blk);
 	if (unlikely(!bh)) {
@@ -123,6 +124,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
 		unlock_buffer(bh);
 		return bh;
 	}
+
 	ext4_lock_group(sb, block_group);
 	if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
 		ext4_init_inode_bitmap(sb, bh, block_group, desc);
@@ -133,6 +135,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
 		return bh;
 	}
 	ext4_unlock_group(sb, block_group);
+
 	if (buffer_uptodate(bh)) {
 		/*
 		 * if not uninit if bh is uptodate,
@@ -712,8 +715,17 @@ static int ext4_claim_inode(struct super_block *sb,
 {
 	int free = 0, retval = 0, count;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
 	struct ext4_group_desc *gdp = ext4_get_group_desc(sb, group, NULL);
 
+	/*
+	 * We have to be sure that new inode allocation does not race with
+	 * inode table initialization, because otherwise we may end up
+	 * allocating and writing new inode right before sb_issue_zeroout
+	 * takes place and overwriting our new inode with zeroes. So we
+	 * take alloc_sem to prevent it.
+	 */
+	down_read(&grp->alloc_sem);
 	ext4_lock_group(sb, group);
 	if (ext4_set_bit(ino, inode_bitmap_bh->b_data)) {
 		/* not a free inode */
@@ -724,6 +736,7 @@ static int ext4_claim_inode(struct super_block *sb,
 	if ((group == 0 && ino < EXT4_FIRST_INO(sb)) ||
 			ino > EXT4_INODES_PER_GROUP(sb)) {
 		ext4_unlock_group(sb, group);
+		up_read(&grp->alloc_sem);
 		ext4_error(sb, "reserved inode or inode > inodes count - "
 			   "block_group = %u, inode=%lu", group,
 			   ino + group * EXT4_INODES_PER_GROUP(sb));
@@ -772,6 +785,7 @@ static int ext4_claim_inode(struct super_block *sb,
 	gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
 err_ret:
 	ext4_unlock_group(sb, group);
+	up_read(&grp->alloc_sem);
 	return retval;
 }
 
@@ -1205,3 +1219,97 @@ unsigned long ext4_count_dirs(struct super_block * sb)
 	}
 	return count;
 }
+
+/*
+ * Zeroes not yet zeroed inode table - just write zeroes through the whole
+ * inode table. Must be called without any spinlock held. The only place
+ * where it is called from on active part of filesystem is ext4lazyinit
+ * thread, so we do not need any special locks, however we have to prevent
+ * inode allocation from the current group, so we take alloc_sem lock, to
+ * block ext4_claim_inode until we are finished.
+ */
+extern int ext4_init_inode_table(struct super_block *sb, ext4_group_t group)
+{
+	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_group_desc *gdp = NULL;
+	struct buffer_head *group_desc_bh;
+	handle_t *handle;
+	ext4_fsblk_t blk;
+	int num, ret = 0, used_blks = 0;
+
+	if (sb->s_flags & MS_RDONLY) {
+		ext4_warning(sb, "Filesystem mounted read only. "
+				 "Won't zeroout anything!");
+		ret = 1;
+		goto out;
+	}
+
+	gdp = ext4_get_group_desc(sb, group, &group_desc_bh);
+	if (!gdp)
+		goto out;
+
+	/*
+	 * We do not need to lock this, because we are the only one
+	 * handling this flag.
+	 */
+	if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED))
+		goto out;
+
+	handle = ext4_journal_start_sb(sb, 1);
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		goto out;
+	}
+
+	down_write(&grp->alloc_sem);
+	/*
+	 * If inode bitmap was already initialized there may be some
+	 * used inodes so we need to skip blocks with used inodes in
+	 * inode table.
+	 */
+	if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)))
+		used_blks = DIV_ROUND_UP((EXT4_INODES_PER_GROUP(sb) -
+			    ext4_itable_unused_count(sb, gdp)),
+			    sbi->s_inodes_per_block);
+
+	blk = ext4_inode_table(sb, gdp) + used_blks;
+	num = sbi->s_itb_per_group - used_blks;
+
+	BUFFER_TRACE(group_desc_bh, "get_write_access");
+	ret = ext4_journal_get_write_access(handle,
+					    group_desc_bh);
+	if (ret)
+		goto err_out;
+
+	/*
+	 * Skip zeroout if the inode table is full. But we set the ZEROED
+	 * flag anyway, because obviously, when it is full it does not need
+	 * further zeroing.
+	 */
+	if (unlikely(num == 0))
+		goto skip_zeroout;
+
+	ext4_debug("going to zero out inode table in group %d\n",
+		group);
+	ret = sb_issue_zeroout(sb, blk, num);
+	if (ret < 0)
+		goto err_out;
+
+skip_zeroout:
+	ext4_lock_group(sb, group);
+	gdp->bg_flags |= cpu_to_le16(EXT4_BG_INODE_ZEROED);
+	gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
+	ext4_unlock_group(sb, group);
+
+	BUFFER_TRACE(group_desc_bh,
+		     "call ext4_handle_dirty_metadata");
+	ret = ext4_handle_dirty_metadata(handle, NULL,
+					 group_desc_bh);
+
+err_out:
+	up_write(&grp->alloc_sem);
+	ext4_journal_stop(handle);
+out:
+	return ret;
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 515e306..ce93fa3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -41,6 +41,9 @@
 #include <linux/crc16.h>
 #include <asm/uaccess.h>
 
+#include <linux/kthread.h>
+#include <linux/freezer.h>
+
 #include "ext4.h"
 #include "ext4_jbd2.h"
 #include "xattr.h"
@@ -52,6 +55,8 @@
 
 struct proc_dir_entry *ext4_proc_root;
 static struct kset *ext4_kset;
+struct ext4_lazy_init *ext4_li_info;
+struct mutex ext4_li_mtx;
 
 static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
 			     unsigned long journal_devnum);
@@ -70,6 +75,8 @@ static void ext4_write_super(struct super_block *sb);
 static int ext4_freeze(struct super_block *sb);
 static int ext4_get_sb(struct file_system_type *fs_type, int flags,
 		       const char *dev_name, void *data, struct vfsmount *mnt);
+static void ext4_destroy_lazyinit_thread(void);
+static void ext4_unregister_li_request(struct super_block *sb);
 
 #if !defined(CONFIG_EXT3_FS) && !defined(CONFIG_EXT3_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT23)
 static struct file_system_type ext3_fs_type = {
@@ -664,6 +671,7 @@ static void ext4_put_super(struct super_block *sb)
 				   "Couldn't clean up the journal");
 	}
 
+	ext4_unregister_li_request(sb);
 	ext4_release_system_zone(sb);
 	ext4_mb_release(sb);
 	ext4_ext_release(sb);
@@ -2443,6 +2451,373 @@ static int ext4_feature_set_ok(struct super_block *sb, int readonly)
 	return 1;
 }
 
+static void ext4_lazyinode_timeout(unsigned long data)
+{
+	struct task_struct *p = (struct task_struct *)data;
+	wake_up_process(p);
+}
+
+/* Find next suitable group adn run ext4_init_inode_table */
+static int ext4_run_li_request(struct ext4_li_request *elr)
+{
+	struct ext4_group_desc *gdp = NULL;
+	ext4_group_t group, ngroups;
+	struct super_block *sb;
+	int ret = 0;
+
+	sb = elr->lr_super;
+	ngroups = EXT4_SB(sb)->s_groups_count;
+
+	for (group = elr->lr_next_group; group < ngroups; group++) {
+		gdp = ext4_get_group_desc(sb, group, NULL);
+		if (!gdp) {
+			ret = 1;
+			break;
+		}
+
+		if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
+			break;
+	}
+
+	if (group == ngroups)
+		ret = 1;
+
+	if (!ret)
+		ret = ext4_init_inode_table(sb, group);
+
+	return ret;
+}
+
+/*
+ * Remove lr_request from the list_request and free the
+ * request tructure. Should be called with li_list_mtx held
+ */
+static void ext4_remove_li_request(struct ext4_li_request *elr)
+{
+	struct ext4_sb_info *sbi;
+
+	if (!elr)
+		return;
+
+	sbi = elr->lr_sbi;
+
+	list_del(&elr->lr_request);
+	sbi->s_li_request = NULL;
+	kfree(elr);
+}
+
+static void ext4_unregister_li_request(struct super_block *sb)
+{
+	struct ext4_li_request *elr = EXT4_SB(sb)->s_li_request;
+
+	if (!ext4_li_info)
+		return;
+
+	mutex_lock(&ext4_li_info->li_list_mtx);
+	ext4_remove_li_request(elr);
+	mutex_unlock(&ext4_li_info->li_list_mtx);
+}
+
+/*
+ * This is the function where ext4lazyinit thread lives. It walks
+ * through the request list searching for next scheduled filesystem.
+ * When such a fs is found, run the lazy initialization request
+ * (ext4_rn_li_request) and keep track of the time spend in this
+ * function. Based on that time we compute next schedule time of
+ * the request. When walking through the list is complete, compute
+ * next waking time and put itself into sleep.
+ */
+static int ext4_lazyinit_thread(void *arg)
+{
+	struct ext4_lazy_init *eli = (struct ext4_lazy_init *)arg;
+	struct list_head *pos, *n;
+	struct ext4_li_request *elr;
+	struct ext4_sb_info *sbi;
+	unsigned long next_wakeup;
+	unsigned long timeout = 0;
+	int ret;
+
+	BUG_ON(NULL == eli);
+
+	eli->li_timer.data = (unsigned long)current;
+	eli->li_timer.function = ext4_lazyinode_timeout;
+
+	eli->li_task = current;
+	wake_up(&eli->li_wait_task);
+
+cont_thread:
+	while (true) {
+		next_wakeup = ULONG_MAX;
+		mutex_lock(&eli->li_list_mtx);
+		if (list_empty(&eli->li_request_list)) {
+			mutex_unlock(&eli->li_list_mtx);
+			goto exit_thread;
+		}
+
+		list_for_each_safe(pos, n, &eli->li_request_list) {
+			elr = list_entry(pos, struct ext4_li_request,
+					 lr_request);
+
+			if (time_before_eq(jiffies, elr->lr_next_sched))
+				continue;
+			sbi = elr->lr_sbi;
+
+			timeout = jiffies;
+			ret = ext4_run_li_request(elr);
+
+			/* Handle jiffies overflow - is it even possible?*/
+			if (timeout > jiffies)
+				timeout = (ULONG_MAX - timeout) + jiffies;
+			else
+				timeout = jiffies - timeout;
+
+			timeout *= EXT4_LI_WAIT_MULT;
+
+			if (ret) {
+				ext4_remove_li_request(elr);
+				continue;
+			}
+
+			elr->lr_next_sched = jiffies + timeout;
+			if (elr->lr_next_sched < next_wakeup)
+				next_wakeup = elr->lr_next_sched;
+		}
+
+		mutex_unlock(&eli->li_list_mtx);
+
+		eli->li_timer.expires = next_wakeup;
+		add_timer(&eli->li_timer);
+
+		if (freezing(current)) {
+			refrigerator();
+		} else {
+			DEFINE_WAIT(wait);
+			prepare_to_wait(&eli->li_wait_daemon, &wait,
+					TASK_INTERRUPTIBLE);
+			schedule();
+			finish_wait(&eli->li_wait_daemon, &wait);
+		}
+	}
+
+exit_thread:
+	/*
+	 * It looks like the request list is empty, but we need
+	 * to check it under the li_list_mtx lock, to prevent any
+	 * additions into it, and of course we should lock ext4_li_mtx
+	 * to atomically free the list and ext4_li_info, because at
+	 * this point another ext4 filesystem could be registering
+	 * new one.
+	 */
+	mutex_lock(&ext4_li_mtx);
+	mutex_lock(&eli->li_list_mtx);
+	if (!list_empty(&eli->li_request_list)) {
+		mutex_unlock(&eli->li_list_mtx);
+		mutex_unlock(&ext4_li_mtx);
+		goto cont_thread;
+	}
+	mutex_unlock(&eli->li_list_mtx);
+	del_timer_sync(&ext4_li_info->li_timer);
+	eli->li_task = NULL;
+	wake_up(&eli->li_wait_task);
+
+	kfree(ext4_li_info);
+	ext4_li_info = NULL;
+	mutex_unlock(&ext4_li_mtx);
+	return 0;
+}
+
+static void ext4_clear_request_list(void)
+{
+	struct list_head *pos, *n;
+	struct ext4_li_request *elr;
+
+	mutex_lock(&ext4_li_info->li_list_mtx);
+	if (list_empty(&ext4_li_info->li_request_list))
+		return;
+
+	list_for_each_safe(pos, n, &ext4_li_info->li_request_list) {
+		elr = list_entry(pos, struct ext4_li_request,
+				 lr_request);
+		ext4_remove_li_request(elr);
+	}
+	mutex_unlock(&ext4_li_info->li_list_mtx);
+}
+
+static int ext4_run_lazyinit_thread(void)
+{
+	struct task_struct *t;
+
+	t = kthread_run(ext4_lazyinit_thread, ext4_li_info, "ext4lazyinit");
+	if (IS_ERR(t)) {
+		int err = PTR_ERR(t);
+		ext4_clear_request_list();
+		del_timer_sync(&ext4_li_info->li_timer);
+		kfree(ext4_li_info);
+		ext4_li_info = NULL;
+		printk(KERN_CRIT "EXT4: error %d creating inode table "
+				 "initialization thread\n",
+				 err);
+		return err;
+	}
+	ext4_li_info->li_state |= EXT4_LAZYINIT_RUNNING;
+
+	wait_event(ext4_li_info->li_wait_task, ext4_li_info->li_task != NULL);
+	return 0;
+}
+
+/*
+ * Check whether it make sense to run itable init. thread or not.
+ * If there is at least one uninitialized inode table, return
+ * corresponding group number, else the loop goes through all
+ * groups and return total number of groups.
+ */
+static ext4_group_t ext4_has_uninit_itable(struct super_block *sb)
+{
+	ext4_group_t group, ngroups = EXT4_SB(sb)->s_groups_count;
+	struct ext4_group_desc *gdp = NULL;
+
+	for (group = 0; group < ngroups; group++) {
+		gdp = ext4_get_group_desc(sb, group, NULL);
+		if (!gdp)
+			continue;
+
+		if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
+			break;
+	}
+
+	return group;
+}
+
+static int ext4_li_info_new(void)
+{
+	struct ext4_lazy_init *eli = NULL;
+
+	eli = kzalloc(sizeof(*eli), GFP_KERNEL);
+	if (!eli)
+		return -ENOMEM;
+
+	eli->li_task = NULL;
+	INIT_LIST_HEAD(&eli->li_request_list);
+	mutex_init(&eli->li_list_mtx);
+
+	init_waitqueue_head(&eli->li_wait_daemon);
+	init_waitqueue_head(&eli->li_wait_task);
+	init_timer(&eli->li_timer);
+	eli->li_state |= EXT4_LAZYINIT_QUIT;
+
+	ext4_li_info = eli;
+
+	return 0;
+}
+
+static struct ext4_li_request *ext4_li_request_new(struct super_block *sb,
+					    ext4_group_t start)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_li_request *elr;
+	unsigned long rnd;
+
+	elr = kzalloc(sizeof(*elr), GFP_KERNEL);
+	if (!elr)
+		return NULL;
+
+	elr->lr_super = sb;
+	elr->lr_sbi = sbi;
+	elr->lr_next_group = start;
+
+	/*
+	 * Randomize first schedule time of the request to
+	 * spread the inode table initialization requests
+	 * better.
+	 */
+	get_random_bytes(&rnd, sizeof(rnd));
+	elr->lr_next_sched = jiffies + (unsigned long)rnd %
+			     (EXT4_LI_MAX_START_DELAY * HZ);
+
+	return elr;
+}
+
+static int ext4_register_li_request(struct super_block *sb)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_li_request *elr;
+	ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
+	ext4_group_t start;
+	int ret = 0;
+
+	if (sbi->s_li_request != NULL)
+		goto out;
+
+	if ((sb->s_flags & MS_RDONLY) ||
+	     !test_opt(sb, INIT_INODE_TABLE)) {
+		sbi->s_li_request = NULL;
+		goto out;
+	}
+
+	start = ext4_has_uninit_itable(sb);
+	if (start == ngroups) {
+		sbi->s_li_request = NULL;
+		goto out;
+	}
+
+	elr = ext4_li_request_new(sb, start);
+	if (!elr) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	mutex_lock(&ext4_li_mtx);
+
+	if (NULL == ext4_li_info) {
+		ret = ext4_li_info_new();
+		if (ret)
+			goto out;
+	}
+
+	mutex_lock(&ext4_li_info->li_list_mtx);
+	list_add(&elr->lr_request, &ext4_li_info->li_request_list);
+	mutex_unlock(&ext4_li_info->li_list_mtx);
+
+	sbi->s_li_request = elr;
+
+	if (!(ext4_li_info->li_state & EXT4_LAZYINIT_RUNNING)) {
+		ret = ext4_run_lazyinit_thread();
+		if (ret)
+			goto out;
+	}
+
+	mutex_unlock(&ext4_li_mtx);
+
+out:
+	if (ret) {
+		mutex_unlock(&ext4_li_mtx);
+		kfree(elr);
+	}
+	return ret;
+}
+
+/*
+ * We do not need to lock anything since this is called on
+ * module unload.
+ */
+static void ext4_destroy_lazyinit_thread(void)
+{
+	/*
+	 * If thread exited earlier
+	 * there's nothing to be done.
+	 */
+	if (!ext4_li_info)
+		return;
+
+	ext4_clear_request_list();
+
+	while (ext4_li_info->li_task) {
+		wake_up(&ext4_li_info->li_wait_daemon);
+		wait_event(ext4_li_info->li_wait_task,
+			   ext4_li_info->li_task == NULL);
+	}
+}
+
 static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 				__releases(kernel_lock)
 				__acquires(kernel_lock)
@@ -3027,6 +3402,10 @@ no_journal:
 		goto failed_mount4;
 	}
 
+	err = ext4_register_li_request(sb);
+	if (err)
+		goto failed_mount4;
+
 	sbi->s_kobj.kset = ext4_kset;
 	init_completion(&sbi->s_kobj_unregister);
 	err = kobject_init_and_add(&sbi->s_kobj, &ext4_ktype, NULL,
@@ -3723,6 +4102,16 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
 			enable_quota = 1;
 		}
 	}
+
+	/*
+	 * Reinitialize lazy itable initialization thread based on
+	 * current settings
+	 */
+	if ((sb->s_flags & MS_RDONLY) || !test_opt(sb, INIT_INODE_TABLE))
+		ext4_unregister_li_request(sb);
+	else
+		ext4_register_li_request(sb);
+
 	ext4_setup_system_zone(sb);
 	if (sbi->s_journal == NULL)
 		ext4_commit_super(sb, 1);
@@ -4190,6 +4579,9 @@ static int __init init_ext4_fs(void)
 	err = register_filesystem(&ext4_fs_type);
 	if (err)
 		goto out;
+
+	ext4_li_info = NULL;
+	mutex_init(&ext4_li_mtx);
 	return 0;
 out:
 	unregister_as_ext2();
@@ -4209,6 +4601,7 @@ out4:
 
 static void __exit exit_ext4_fs(void)
 {
+	ext4_destroy_lazyinit_thread();
 	unregister_as_ext2();
 	unregister_as_ext3();
 	unregister_filesystem(&ext4_fs_type);
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/5] Use sb_issue_zeroout in setup_new_group_blocks
  2010-09-08 16:59 [PATCH 0/5 v2] Lazy itable initialization for Ext4 Lukas Czerner
                   ` (2 preceding siblings ...)
  2010-09-08 16:59 ` [PATCH 3/5] Add inode table initialization code for Ext4 Lukas Czerner
@ 2010-09-08 16:59 ` Lukas Czerner
  2010-09-08 16:59 ` [PATCH 5/5] Use sb_issue_discard in ext4_ext_zeroout Lukas Czerner
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Lukas Czerner @ 2010-09-08 16:59 UTC (permalink / raw)
  To: linux-ext4; +Cc: rwheeler, sandeen, tytso, adilger, lczerner

Use sn_issue_discard to zero out inode table and descriptor table
blocks instead of old approach which involves journaling.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/resize.c |   44 +++++++++++---------------------------------
 1 files changed, 11 insertions(+), 33 deletions(-)

diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 6df797e..60161bc 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -226,23 +226,12 @@ static int setup_new_group_blocks(struct super_block *sb,
 	}
 
 	/* Zero out all of the reserved backup group descriptor table blocks */
-	for (i = 0, bit = gdblocks + 1, block = start + bit;
-	     i < reserved_gdb; i++, block++, bit++) {
-		struct buffer_head *gdb;
-
-		ext4_debug("clear reserved block %#04llx (+%d)\n", block, bit);
-
-		if ((err = extend_or_restart_transaction(handle, 1, bh)))
-			goto exit_bh;
+	ext4_debug("clear inode table blocks %#04llx -> %#04llx\n",
+			block, sbi->s_itb_per_group);
+	err = sb_issue_zeroout(sb, gdblocks + start + 1, reserved_gdb);
+	if (err)
+		goto exit_journal;
 
-		if (IS_ERR(gdb = bclean(handle, sb, block))) {
-			err = PTR_ERR(gdb);
-			goto exit_bh;
-		}
-		ext4_handle_dirty_metadata(handle, NULL, gdb);
-		ext4_set_bit(bit, bh->b_data);
-		brelse(gdb);
-	}
 	ext4_debug("mark block bitmap %#04llx (+%llu)\n", input->block_bitmap,
 		   input->block_bitmap - start);
 	ext4_set_bit(input->block_bitmap - start, bh->b_data);
@@ -251,23 +240,12 @@ static int setup_new_group_blocks(struct super_block *sb,
 	ext4_set_bit(input->inode_bitmap - start, bh->b_data);
 
 	/* Zero out all of the inode table blocks */
-	for (i = 0, block = input->inode_table, bit = block - start;
-	     i < sbi->s_itb_per_group; i++, bit++, block++) {
-		struct buffer_head *it;
-
-		ext4_debug("clear inode block %#04llx (+%d)\n", block, bit);
-
-		if ((err = extend_or_restart_transaction(handle, 1, bh)))
-			goto exit_bh;

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/5] Use sb_issue_discard in ext4_ext_zeroout
  2010-09-08 16:59 [PATCH 0/5 v2] Lazy itable initialization for Ext4 Lukas Czerner
                   ` (3 preceding siblings ...)
  2010-09-08 16:59 ` [PATCH 4/5] Use sb_issue_zeroout in setup_new_group_blocks Lukas Czerner
@ 2010-09-08 16:59 ` Lukas Czerner
  2010-09-08 21:21   ` Andreas Dilger
  2010-09-08 17:49 ` [PATCH 0/5 v2] Lazy itable initialization for Ext4 Lukas Czerner
  2010-09-08 20:34 ` Andreas Dilger
  6 siblings, 1 reply; 16+ messages in thread
From: Lukas Czerner @ 2010-09-08 16:59 UTC (permalink / raw)
  To: linux-ext4; +Cc: rwheeler, sandeen, tytso, adilger, lczerner

Change ext4_ext4_zeroout to use sb_issue_discard instead of its
own approach to zero out extents.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/extents.c |   68 ++++------------------------------------------------
 1 files changed, 6 insertions(+), 62 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 377309c..c426d9e 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2536,77 +2536,21 @@ void ext4_ext_release(struct super_block *sb)
 #endif
 }
 
-static void bi_complete(struct bio *bio, int error)
-{
-	complete((struct completion *)bio->bi_private);
-}
-
 /* FIXME!! we need to try to merge to left or right after zero-out  */
 static int ext4_ext_zeroout(struct inode *inode, struct ext4_extent *ex)
 {
+	ext4_fsblk_t ee_pblock;
+	unsigned int ee_len;
 	int ret;
-	struct bio *bio;
-	int blkbits, blocksize;
-	sector_t ee_pblock;
-	struct completion event;
-	unsigned int ee_len, len, done, offset;
 
-
-	blkbits   = inode->i_blkbits;
-	blocksize = inode->i_sb->s_blocksize;
 	ee_len    = ext4_ext_get_actual_len(ex);
 	ee_pblock = ext_pblock(ex);
 
-	/* convert ee_pblock to 512 byte sectors */
-	ee_pblock = ee_pblock << (blkbits - 9);
-
-	while (ee_len > 0) {
-
-		if (ee_len > BIO_MAX_PAGES)
-			len = BIO_MAX_PAGES;
-		else
-			len = ee_len;
-
-		bio = bio_alloc(GFP_NOIO, len);
-		if (!bio)
-			return -ENOMEM;
-
-		bio->bi_sector = ee_pblock;
-		bio->bi_bdev   = inode->i_sb->s_bdev;
-
-		done = 0;
-		offset = 0;
-		while (done < len) {
-			ret = bio_add_page(bio, ZERO_PAGE(0),
-							blocksize, offset);
-			if (ret != blocksize) {
-				/*
-				 * We can't add any more pages because of
-				 * hardware limitations.  Start a new bio.
-				 */
-				break;
-			}
-			done++;
-			offset += blocksize;
-			if (offset >= PAGE_CACHE_SIZE)
-				offset = 0;
-		}
-
-		init_completion(&event);
-		bio->bi_private = &event;
-		bio->bi_end_io = bi_complete;
-		submit_bio(WRITE, bio);
-		wait_for_completion(&event);
+	ret = sb_issue_zeroout(inode->i_sb, ee_pblock, ee_len);
+	if (ret > 0)
+		ret = 0;
 
-		if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) {
-			bio_put(bio);
-			return -EIO;
-		}
-		bio_put(bio);
-		ee_len    -= done;
-		ee_pblock += done  << (blkbits - 9);
-	}
-	return 0;
+	return ret;
 }
 
 #define EXT4_EXT_ZERO_LEN 7
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5 v2] Lazy itable initialization for Ext4
  2010-09-08 16:59 [PATCH 0/5 v2] Lazy itable initialization for Ext4 Lukas Czerner
                   ` (4 preceding siblings ...)
  2010-09-08 16:59 ` [PATCH 5/5] Use sb_issue_discard in ext4_ext_zeroout Lukas Czerner
@ 2010-09-08 17:49 ` Lukas Czerner
  2010-09-08 21:23   ` Andreas Dilger
  2010-09-08 20:34 ` Andreas Dilger
  6 siblings, 1 reply; 16+ messages in thread
From: Lukas Czerner @ 2010-09-08 17:49 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, rwheeler, sandeen, tytso, adilger

[-- Attachment #1: Type: TEXT/PLAIN, Size: 23 bytes --]

Graph attached.

-Lukas

[-- Attachment #2: Type: IMAGE/png, Size: 5161 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5 v2] Lazy itable initialization for Ext4
  2010-09-08 16:59 [PATCH 0/5 v2] Lazy itable initialization for Ext4 Lukas Czerner
                   ` (5 preceding siblings ...)
  2010-09-08 17:49 ` [PATCH 0/5 v2] Lazy itable initialization for Ext4 Lukas Czerner
@ 2010-09-08 20:34 ` Andreas Dilger
  2010-09-09 10:30   ` Lukas Czerner
  6 siblings, 1 reply; 16+ messages in thread
From: Andreas Dilger @ 2010-09-08 20:34 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, rwheeler, sandeen, tytso

On 2010-09-08, at 10:59, Lukas Czerner wrote:
>  Second patch adds new pair of mount
> options (inititable/noinititable), so you can enable or disable this
> feature. In default it is off (noinititable), so in order to try the new
> code you should moutn the fs like this:
typo             ^^^^^^

>  mount -o noinititable /dev/sda /mnt/
typo       ^^^

It should use "inititable" if you want to try the new code.

> To Andreas:
> You suggested the approach with reading the table first to
> determine if the device is sparse, or thinly provisioned, or trimmed SSD.
> In this case the reading would be much more efficient than writing, so it
> would be a win. But I just wonder, if we de believe the device, that
> when returning zeroes it is safe to no zero the inode table, why not do it
> at mkfs time instead of kernel ?

Good question, but I think the answer is that reading the full itable at
mke2fs time, just like writing it at mke2fs time, is _serialized_ time
spent waiting for the filesystem to become useful.  Doing it in the
background in the kernel can happen in parallel with other operations
(e.g. formatting other disks, waiting for user input from the installer,
downloading updates, etc).

> To Ted:
> You were suggesting that it would be nice if the thread will not run, or
> just quits when the system runs on the battery power. I agree that in that
> case we probably should not do this to save some battery life. But is it
> necessary, or wise to do this in kernel ? What we should do when the
> system runs on battery and user still want to run the lazy initialization
> ? I would rather let the userspace handle it. For example just remount the
> filesystem with -o noinititable.

I would tend to agree with Ted.  There will be _some_ time that the system
is plugged in to charge the battery, and this is very normal when installing
the system initially, so delaying the zeroing will not affect most users.
For the case where the user IS on battery power for some reason, I think it
is better to avoid consuming the battery in that case.

Maybe a good way to compromise is to just put the thread to sleep for 5- or
10-minute intervals while on battery power, and only start zeroing once
plugged in.  That solves the situation where (like me) the laptop stays on
for months at a time with only suspend/resume, and is only rarely rebooted,
but it is plugged in to recharge often.

Since we don't expect to need the itable zeroing unless there is corruption
of the on-disk group descriptor data, I don't think that it is urgent to do
this immediately after install.  If there is corruption within hours of
installing a system, there are more serious problems with the system that
we cannot fix.

> In my benchmark I have set different values of multipliers
> (EXT4_LI_WAIT_MULT) to see how it affects performance. As a tool for
> performance measuring I have used postmark (see parameters bellow). I have
> created average from five postmark runs to gen more stable results. In
> each run I have created ext4 filesystem on the device (with
> lazy_itable_init set properly), mounted with inititable/noinititable mount
> option and run the postmark measuring the running time and number of
> groups the ext4lazyinit thread initializes in one run. 
> 
> Type                              |NOPATCH      MULT=10      DIFF    |
> ==================================+==================================+
> Total_duration                    |130.00       132.40       1.85%   |
> Duration_of_transactions          |77.80        80.80        3.86%   |
> Transactions/s                    |642.73       618.82       -3.72%  |
> [snip]
> Read_B/s                          |21179620.40  20793522.40  -1.82%  |
> Write_B/s                         |66279880.00  65071617.60  -1.82%  |
> ==================================+==================================+
> RUNTIME:	2m13	GROUPS ZEROED: 156

This is a relatively minor overhead, and when one considers that this is
a very metadata-heavy benchmark being run immediately after reformatting
the filesystem, it is not a very realistic real-world situation.

The good (expected) news is that there is no performance impact when the
thread is not active, so this is a one-time hit.  In fairness, the
"NOPATCH" test times should include the full mke2fs time as well, if one
wants to consider the real-world impact of a much faster mke2fs run and
a slightly-slower runtime for a few minutes.

Do you have any idea of how long the zeroing takes to complete in
the case of MULT=10 without any load, as a function of the filesystem
size?  That would tell us what the minimum time after startup that the
system might be slowed down by the zeroing thread.

> The benchmark showed, that patch itself does not introduce any performance
> loss (at least for postmark), when ext4lazyinit thread is not activated.
> However, when it is activated, there is explicit performance loss due to
> inode table zeroing, but with EXT4_LI_WAIT_MULT=10 it is just about 1.8%,
> which may, or may not be much, so when I think about it now we should
> probably make this settable via sysfs. What do you think ?

I don't think it is necessary to have a sysfs parameter for this.  Instead
I would suggest making the "inititable" mount option take an optional
numeric parameter that specifies the MULT factor.  The ideal solution is
to make the zeroing happen with a MULT=100 under IO load, but run full-out (i.e. MULT=0?) while there is no IO load.  That said, I don't think it is
critical enough to delay this patch from landing to implement that.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] Add helper function for blkdev_issue_zeroout
  2010-09-08 16:59 ` [PATCH 1/5] Add helper function for blkdev_issue_zeroout Lukas Czerner
@ 2010-09-08 20:36   ` Andreas Dilger
  2010-09-09 11:11     ` Lukas Czerner
  0 siblings, 1 reply; 16+ messages in thread
From: Andreas Dilger @ 2010-09-08 20:36 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, rwheeler, sandeen, tytso

On 2010-09-08, at 10:59, Lukas Czerner wrote:
> +static inline int sb_issue_zeroout(struct super_block *sb,
> +				   sector_t block, sector_t nr_blocks)
> +{
> +	block <<= (sb->s_blocksize_bits - 9);
> +	nr_blocks <<= (sb->s_blocksize_bits - 9);
> +	return blkdev_issue_zeroout(sb->s_bdev, block, nr_blocks, GFP_KERNEL,
> +				   BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
> +}

While I can understand that we might need a barrier for this (to avoid
it being reordered with later writes that are using these blocks), I'm
not sure it needs to wait for previous IO to complete.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/5] Add inode table initialization code for Ext4
  2010-09-08 16:59 ` [PATCH 3/5] Add inode table initialization code for Ext4 Lukas Czerner
@ 2010-09-08 21:19   ` Andreas Dilger
  2010-09-09  9:08     ` Lukas Czerner
  0 siblings, 1 reply; 16+ messages in thread
From: Andreas Dilger @ 2010-09-08 21:19 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, rwheeler, sandeen, tytso

On 2010-09-08, at 10:59, Lukas Czerner wrote:
> +extern int ext4_init_inode_table(struct super_block *sb, ext4_group_t
> +{
> +	if (sb->s_flags & MS_RDONLY) {
> +		ext4_warning(sb, "Filesystem mounted read only. "
> +				 "Won't zeroout anything!");
> +		ret = 1;
> +		goto out;
> +	}

Is it actually needed to print this error message?  This thread shouldn't
be run if the filesystem was originally mounted read-only (that is checked
in ext4_register_li_request()), and if it is later remounted read-only
(due to user action or filesystem error) I don't think it is critical to
print a message here.

The ext4lazyinit thread will be restarted when the filesystem is remounted
read-write in the future.

> +	/*
> +	 * Skip zeroout if the inode table is full. But we set the ZEROED
> +	 * flag anyway, because obviously, when it is full it does not need
> +	 * further zeroing.
> +	 */
> +	if (unlikely(num == 0))
> +		goto skip_zeroout;

In fact, this is still safe as long as num < EXT4_INODES_PER_GROUP(sb).

> +	ext4_debug("going to zero out inode table in group %d\n",
> +		group);

the "group" should align with the '(' on the previous line.

> +/* Find next suitable group adn run ext4_init_inode_table */
> +static int ext4_run_li_request(struct ext4_li_request *elr)
> +{
> +	for (group = elr->lr_next_group; group < ngroups; group++) {
> +		gdp = ext4_get_group_desc(sb, group, NULL);
> +		if (!gdp) {
> +			ret = 1;
> +			break;
> +		}
> +
> +		if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
> +			break;
> +	}

There is nowhere in the code that lr_next_group is increased.  That should
probably be done in this function.  Otherwise, for a large filesystem with
256k groups in it there will be an O(n^2) search for the next group to zero.

> +static int ext4_lazyinit_thread(void *arg)
> +{
> +	while (true) {
> +		list_for_each_safe(pos, n, &eli->li_request_list) {
> +			/* Handle jiffies overflow - is it even possible?*/
> +			if (timeout > jiffies)
> +				timeout = (ULONG_MAX - timeout) + jiffies;
> +			else
> +				timeout = jiffies - timeout;

I'm not sure this is needed.  When you compute "jiffies - timeout", it
doesn't matter if "jiffies" has suddenly wrapped, the result will be the
same.

> +static ext4_group_t ext4_has_uninit_itable(struct super_block *sb)
> +{
> +	for (group = 0; group < ngroups; group++) {
> +		gdp = ext4_get_group_desc(sb, group, NULL);
> +		if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
> +			break;

It would be much more efficient to do this in ext4_check_descriptors(),
which is already walking through all of the groups checking the validity
of the group descriptors at mount time.

An alternative that would speed up mounting of large filesystems is to
do the ext4_check_descriptors() checks from the ext4lazyinit thread.
If any groups are found to be in error (which should be very rare)
they can be marked in the group descriptor with a new per-group
EXT4_BG_ERROR flag and skipped for any future allocations.  This would
need an additional in-memory flag that tracked if the group descriptor's
checksum and block pointers had been validated yet or not, so that they
are not used by anything before they are checked.  That is a change that
could be made at some later stage, after the initial patch is landed.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/5] Use sb_issue_discard in ext4_ext_zeroout
  2010-09-08 16:59 ` [PATCH 5/5] Use sb_issue_discard in ext4_ext_zeroout Lukas Czerner
@ 2010-09-08 21:21   ` Andreas Dilger
  0 siblings, 0 replies; 16+ messages in thread
From: Andreas Dilger @ 2010-09-08 21:21 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, rwheeler, sandeen, tytso

On 2010-09-08, at 10:59, Lukas Czerner wrote:
> Change ext4_ext4_zeroout to use sb_issue_discard instead of its
> own approach to zero out extents.

typo - should be ext4_ext_zeroout().


Cheers, Andreas






^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5 v2] Lazy itable initialization for Ext4
  2010-09-08 17:49 ` [PATCH 0/5 v2] Lazy itable initialization for Ext4 Lukas Czerner
@ 2010-09-08 21:23   ` Andreas Dilger
  2010-09-09  8:29     ` Lukas Czerner
  0 siblings, 1 reply; 16+ messages in thread
From: Andreas Dilger @ 2010-09-08 21:23 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, rwheeler, sandeen, tytso

On 2010-09-08, at 11:49, Lukas Czerner wrote:
> Graph attached.

I assume the groups are "unpatched", "noinititable", "MULT=10, "MULT=5", and "MULT=2"?

Cheers, Andreas






^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5 v2] Lazy itable initialization for Ext4
  2010-09-08 21:23   ` Andreas Dilger
@ 2010-09-09  8:29     ` Lukas Czerner
  0 siblings, 0 replies; 16+ messages in thread
From: Lukas Czerner @ 2010-09-09  8:29 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Lukas Czerner, linux-ext4, rwheeler, sandeen, tytso

[-- Attachment #1: Type: TEXT/PLAIN, Size: 286 bytes --]

On Wed, 8 Sep 2010, Andreas Dilger wrote:

> On 2010-09-08, at 11:49, Lukas Czerner wrote:
> > Graph attached.
> 
> I assume the groups are "unpatched", "noinititable", "MULT=10, "MULT=5", and "MULT=2"?
> 
> Cheers, Andreas
> 

Of course, sorry about that. Fixed graph attached.

-Lukas

[-- Attachment #2: Type: IMAGE/png, Size: 5664 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/5] Add inode table initialization code for Ext4
  2010-09-08 21:19   ` Andreas Dilger
@ 2010-09-09  9:08     ` Lukas Czerner
  0 siblings, 0 replies; 16+ messages in thread
From: Lukas Czerner @ 2010-09-09  9:08 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Lukas Czerner, linux-ext4, rwheeler, sandeen, tytso

On Wed, 8 Sep 2010, Andreas Dilger wrote:

> On 2010-09-08, at 10:59, Lukas Czerner wrote:
> > +extern int ext4_init_inode_table(struct super_block *sb, ext4_group_t
> > +{
> > +	if (sb->s_flags & MS_RDONLY) {
> > +		ext4_warning(sb, "Filesystem mounted read only. "
> > +				 "Won't zeroout anything!");
> > +		ret = 1;
> > +		goto out;
> > +	}
> 
> Is it actually needed to print this error message?  This thread shouldn't
> be run if the filesystem was originally mounted read-only (that is checked
> in ext4_register_li_request()), and if it is later remounted read-only
> (due to user action or filesystem error) I don't think it is critical to
> print a message here.
> 
> The ext4lazyinit thread will be restarted when the filesystem is remounted
> read-write in the future.

Yeah, it is not needed, because when filesystem is mounted (or
remounted) as read only the thread would not start, or would be
destroyed. I'll get rid of that error message, but I would rather leave
the check anyway, just to be sure.

> 
> > +	/*
> > +	 * Skip zeroout if the inode table is full. But we set the ZEROED
> > +	 * flag anyway, because obviously, when it is full it does not need
> > +	 * further zeroing.
> > +	 */
> > +	if (unlikely(num == 0))
> > +		goto skip_zeroout;
> 
> In fact, this is still safe as long as num < EXT4_INODES_PER_GROUP(sb).

Yes, it is safe when num is zero, but in that case the barrier would be
sent anyway, so I rather skip it. Or I can fix the
blkdev_issue_zeroout() to check if nr_sects is zero and exit before the
barrier, this might be a better solution.

> 
> > +	ext4_debug("going to zero out inode table in group %d\n",
> > +		group);
> 
> the "group" should align with the '(' on the previous line.
> 
> > +/* Find next suitable group adn run ext4_init_inode_table */
> > +static int ext4_run_li_request(struct ext4_li_request *elr)
> > +{
> > +	for (group = elr->lr_next_group; group < ngroups; group++) {
> > +		gdp = ext4_get_group_desc(sb, group, NULL);
> > +		if (!gdp) {
> > +			ret = 1;
> > +			break;
> > +		}
> > +
> > +		if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
> > +			break;
> > +	}
> 
> There is nowhere in the code that lr_next_group is increased.  That should
> probably be done in this function.  Otherwise, for a large filesystem with
> 256k groups in it there will be an O(n^2) search for the next group to zero.

I do not believe I forgot to set the lr_next_group, but obviously I
did :). I'll fix that.

> 
> > +static int ext4_lazyinit_thread(void *arg)
> > +{
> > +	while (true) {
> > +		list_for_each_safe(pos, n, &eli->li_request_list) {
> > +			/* Handle jiffies overflow - is it even possible?*/
> > +			if (timeout > jiffies)
> > +				timeout = (ULONG_MAX - timeout) + jiffies;
> > +			else
> > +				timeout = jiffies - timeout;
> 
> I'm not sure this is needed.  When you compute "jiffies - timeout", it
> doesn't matter if "jiffies" has suddenly wrapped, the result will be the
> same.

You're right, what I was thinking...

> 
> > +static ext4_group_t ext4_has_uninit_itable(struct super_block *sb)
> > +{
> > +	for (group = 0; group < ngroups; group++) {
> > +		gdp = ext4_get_group_desc(sb, group, NULL);
> > +		if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
> > +			break;
> 
> It would be much more efficient to do this in ext4_check_descriptors(),
> which is already walking through all of the groups checking the validity
> of the group descriptors at mount time.

Sounds good.

> 
> An alternative that would speed up mounting of large filesystems is to
> do the ext4_check_descriptors() checks from the ext4lazyinit thread.
> If any groups are found to be in error (which should be very rare)
> they can be marked in the group descriptor with a new per-group
> EXT4_BG_ERROR flag and skipped for any future allocations.  This would
> need an additional in-memory flag that tracked if the group descriptor's
> checksum and block pointers had been validated yet or not, so that they
> are not used by anything before they are checked.  That is a change that
> could be made at some later stage, after the initial patch is landed.

Right now, when descriptors are bad filesystem is not mounted, which is
a good thing because we do not want to work with corrupted filesystem
and we are forcing user to run fsck to fix it (if it is possible). But
in your scenario we will be actually using the filesystem with errors.
Of course we will not be using corrupted groups, but user will probably
ignore the error and try to use the filesystem as he was used to. Is it
a good thing ? We won't be able to use data from corrupted groups
and what we will tell user when he would like to work with them ?

> 
> Cheers, Andreas
> 

Thanks a lot for you comments and suggestions Andreas, I really appreciate
it.

Regards.
-Lukas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5 v2] Lazy itable initialization for Ext4
  2010-09-08 20:34 ` Andreas Dilger
@ 2010-09-09 10:30   ` Lukas Czerner
  0 siblings, 0 replies; 16+ messages in thread
From: Lukas Czerner @ 2010-09-09 10:30 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Lukas Czerner, linux-ext4, rwheeler, sandeen, tytso

On Wed, 8 Sep 2010, Andreas Dilger wrote:

> On 2010-09-08, at 10:59, Lukas Czerner wrote:
> >  Second patch adds new pair of mount
> > options (inititable/noinititable), so you can enable or disable this
> > feature. In default it is off (noinititable), so in order to try the new
> > code you should moutn the fs like this:
> typo             ^^^^^^
> 
> >  mount -o noinititable /dev/sda /mnt/
> typo       ^^^
> 
> It should use "inititable" if you want to try the new code.

Of course, thanks.

> 
> > To Andreas:
> > You suggested the approach with reading the table first to
> > determine if the device is sparse, or thinly provisioned, or trimmed SSD.
> > In this case the reading would be much more efficient than writing, so it
> > would be a win. But I just wonder, if we de believe the device, that
> > when returning zeroes it is safe to no zero the inode table, why not do it
> > at mkfs time instead of kernel ?
> 
> Good question, but I think the answer is that reading the full itable at
> mke2fs time, just like writing it at mke2fs time, is _serialized_ time
> spent waiting for the filesystem to become useful.  Doing it in the
> background in the kernel can happen in parallel with other operations
> (e.g. formatting other disks, waiting for user input from the installer,
> downloading updates, etc).

I think that important thing is how long it will take to verify that the
device is, or is not sparse. Obviously (in almost all cases) we won't be
reading all inode tables from usual physical disk, because there will be
some garbage, so we will just mark everything not zeroed.

In case of SSD we can just do the trim and verify that it really returns
zeroes. If there are some devices which will return zeroes after trim,
but after one power cycle it will return garbage, we do not care that
much about it, because obviously when we do not believe the device in
mkfs, we can't believe it in kernel.

In the case of sparse and thinly provisioned devices reads will be
fairly quick, so it should not take long to verify that it really return
zeroes for all inode tables. The question is, how long exactly will it
take.

> 
> > To Ted:
> > You were suggesting that it would be nice if the thread will not run, or
> > just quits when the system runs on the battery power. I agree that in that
> > case we probably should not do this to save some battery life. But is it
> > necessary, or wise to do this in kernel ? What we should do when the
> > system runs on battery and user still want to run the lazy initialization
> > ? I would rather let the userspace handle it. For example just remount the
> > filesystem with -o noinititable.
> 
> I would tend to agree with Ted.  There will be _some_ time that the system
> is plugged in to charge the battery, and this is very normal when installing
> the system initially, so delaying the zeroing will not affect most users.
> For the case where the user IS on battery power for some reason, I think it
> is better to avoid consuming the battery in that case.
> 
> Maybe a good way to compromise is to just put the thread to sleep for 5- or
> 10-minute intervals while on battery power, and only start zeroing once
> plugged in.  That solves the situation where (like me) the laptop stays on
> for months at a time with only suspend/resume, and is only rarely rebooted,
> but it is plugged in to recharge often.
> 
> Since we don't expect to need the itable zeroing unless there is corruption
> of the on-disk group descriptor data, I don't think that it is urgent to do
> this immediately after install.  If there is corruption within hours of
> installing a system, there are more serious problems with the system that
> we cannot fix.

I still do not see the reason why not to do simply

 mount -o remount,noinititable <dir>

I believe that there are daemons which are adjusting system settings
when the system is running on battery. I really do not like the kernel
solution for this because once it is hardcoded in kernel I can not do
anything about it, even if I want to run ext4lazyinit no matter what.
I really think that there is no way we should hard code it in the kernel
without any possibility for user to decide on his own.

> 
> > In my benchmark I have set different values of multipliers
> > (EXT4_LI_WAIT_MULT) to see how it affects performance. As a tool for
> > performance measuring I have used postmark (see parameters bellow). I have
> > created average from five postmark runs to gen more stable results. In
> > each run I have created ext4 filesystem on the device (with
> > lazy_itable_init set properly), mounted with inititable/noinititable mount
> > option and run the postmark measuring the running time and number of
> > groups the ext4lazyinit thread initializes in one run. 
> > 
> > Type                              |NOPATCH      MULT=10      DIFF    |
> > ==================================+==================================+
> > Total_duration                    |130.00       132.40       1.85%   |
> > Duration_of_transactions          |77.80        80.80        3.86%   |
> > Transactions/s                    |642.73       618.82       -3.72%  |
> > [snip]
> > Read_B/s                          |21179620.40  20793522.40  -1.82%  |
> > Write_B/s                         |66279880.00  65071617.60  -1.82%  |
> > ==================================+==================================+
> > RUNTIME:	2m13	GROUPS ZEROED: 156
> 
> This is a relatively minor overhead, and when one considers that this is
> a very metadata-heavy benchmark being run immediately after reformatting
> the filesystem, it is not a very realistic real-world situation.
> 
> The good (expected) news is that there is no performance impact when the
> thread is not active, so this is a one-time hit.  In fairness, the
> "NOPATCH" test times should include the full mke2fs time as well, if one
> wants to consider the real-world impact of a much faster mke2fs run and
> a slightly-slower runtime for a few minutes.
> 
> Do you have any idea of how long the zeroing takes to complete in
> the case of MULT=10 without any load, as a function of the filesystem
> size?  That would tell us what the minimum time after startup that the
> system might be slowed down by the zeroing thread.

Well, that depends on how fast the device is. In this case zeroing one
single group takes approx. 28ms without any load. So we can do a little
math to figure out average time to complete the task (assume 4k block
size).

149GB filesystem - the one I was using in the test.
1192 groups -> 1192 inode tables
1 inode table takes (28ms zeroing + 28*10ms waiting) = 308ms
1192 inode tables takes 367136ms = 367.136s = 6m7.136s

In the real test it took 6m22s which is pretty close to my calculation.

> 
> > The benchmark showed, that patch itself does not introduce any performance
> > loss (at least for postmark), when ext4lazyinit thread is not activated.
> > However, when it is activated, there is explicit performance loss due to
> > inode table zeroing, but with EXT4_LI_WAIT_MULT=10 it is just about 1.8%,
> > which may, or may not be much, so when I think about it now we should
> > probably make this settable via sysfs. What do you think ?
> 
> I don't think it is necessary to have a sysfs parameter for this.  Instead
> I would suggest making the "inititable" mount option take an optional
> numeric parameter that specifies the MULT factor.  The ideal solution is
> to make the zeroing happen with a MULT=100 under IO load, but run full-out (i.e. MULT=0?) while there is no IO load.  That said, I don't think it is
> critical enough to delay this patch from landing to implement that.

Right, mount option parameter would be even better.

> 
> Cheers, Andreas
> 

Thanks!
-Lukas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] Add helper function for blkdev_issue_zeroout
  2010-09-08 20:36   ` Andreas Dilger
@ 2010-09-09 11:11     ` Lukas Czerner
  0 siblings, 0 replies; 16+ messages in thread
From: Lukas Czerner @ 2010-09-09 11:11 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Lukas Czerner, linux-ext4, rwheeler, sandeen, tytso

On Wed, 8 Sep 2010, Andreas Dilger wrote:

> On 2010-09-08, at 10:59, Lukas Czerner wrote:
> > +static inline int sb_issue_zeroout(struct super_block *sb,
> > +				   sector_t block, sector_t nr_blocks)
> > +{
> > +	block <<= (sb->s_blocksize_bits - 9);
> > +	nr_blocks <<= (sb->s_blocksize_bits - 9);
> > +	return blkdev_issue_zeroout(sb->s_bdev, block, nr_blocks, GFP_KERNEL,
> > +				   BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
> > +}
> 
> While I can understand that we might need a barrier for this (to avoid
> it being reordered with later writes that are using these blocks), I'm
> not sure it needs to wait for previous IO to complete.

We are waiting for submitted bios to finish. Especially in my patch it
is needed, because I need to know how long it takes to get this IO on
disk do determine next schedule time, depending on the IO load. I am not
sure about this, but when it would not wait it will complete fairly
quickly and my "IO throttling" code would be useless.

And if I am wrong, and you are really convinced that it should not wait,
what about sb_issue_discard(), it is also waiting for completion, is it
ok?

Regards.
-Lukas

> 
> Cheers, Andreas
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2010-09-09 11:11 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-08 16:59 [PATCH 0/5 v2] Lazy itable initialization for Ext4 Lukas Czerner
2010-09-08 16:59 ` [PATCH 1/5] Add helper function for blkdev_issue_zeroout Lukas Czerner
2010-09-08 20:36   ` Andreas Dilger
2010-09-09 11:11     ` Lukas Czerner
2010-09-08 16:59 ` [PATCH 2/5] Add inititable/noinititable mount options for ext4 Lukas Czerner
2010-09-08 16:59 ` [PATCH 3/5] Add inode table initialization code for Ext4 Lukas Czerner
2010-09-08 21:19   ` Andreas Dilger
2010-09-09  9:08     ` Lukas Czerner
2010-09-08 16:59 ` [PATCH 4/5] Use sb_issue_zeroout in setup_new_group_blocks Lukas Czerner
2010-09-08 16:59 ` [PATCH 5/5] Use sb_issue_discard in ext4_ext_zeroout Lukas Czerner
2010-09-08 21:21   ` Andreas Dilger
2010-09-08 17:49 ` [PATCH 0/5 v2] Lazy itable initialization for Ext4 Lukas Czerner
2010-09-08 21:23   ` Andreas Dilger
2010-09-09  8:29     ` Lukas Czerner
2010-09-08 20:34 ` Andreas Dilger
2010-09-09 10:30   ` Lukas Czerner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.