All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6 v4] Lazy itable initialization for Ext4
@ 2010-09-16 12:47 Lukas Czerner
  2010-09-16 12:47 ` [PATCH 1/6] Add helper function for blkdev_issue_zeroout Lukas Czerner
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Lukas Czerner @ 2010-09-16 12:47 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, rwheeler, sandeen, adilger, lczerner, snitzer

Hi,

as Mike suggested I have rebased the patch #1 against Jens'
linux-2.6-block.git 'for-next' branch and changed sb_issue_zeroout()
to cope with the new blkdev_issue_zeroout(), and changed
sb_issue_zeroout() to the new syntax everywhere I am using it.
Also some typos gets fixed.

-Lukas



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/6] Add helper function for blkdev_issue_zeroout
  2010-09-16 12:47 [PATCH 0/6 v4] Lazy itable initialization for Ext4 Lukas Czerner
@ 2010-09-16 12:47 ` Lukas Czerner
  2010-09-16 12:47 ` [PATCH 2/6] Add inititable/noinititable mount options for ext4 Lukas Czerner
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Lukas Czerner @ 2010-09-16 12:47 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, rwheeler, sandeen, adilger, lczerner, snitzer

This is done the same way as helper sb_issue_discard for
blkdev_issue_discard.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 include/linux/blkdev.h |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e7ddd6b..e37e82d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -895,6 +895,13 @@ static inline int sb_issue_discard(struct super_block *sb, sector_t block,
 				    nr_blocks << (sb->s_blocksize_bits - 9),
 				    gfp_mask, flags);
 }
+static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
+		sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
+{
+	return blkdev_issue_zeroout(sb->s_bdev, block << (sb->s_blocksize_bits - 9),
+				    nr_blocks << (sb->s_blocksize_bits - 9),
+				    gfp_mask, flags);
+}
 
 extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);
 
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/6] Add inititable/noinititable mount options for ext4
  2010-09-16 12:47 [PATCH 0/6 v4] Lazy itable initialization for Ext4 Lukas Czerner
  2010-09-16 12:47 ` [PATCH 1/6] Add helper function for blkdev_issue_zeroout Lukas Czerner
@ 2010-09-16 12:47 ` Lukas Czerner
  2010-09-27 18:35   ` Ted Ts'o
  2010-09-16 12:47 ` [PATCH 3/6] Add inode table initialization code for Ext4 Lukas Czerner
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Lukas Czerner @ 2010-09-16 12:47 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, rwheeler, sandeen, adilger, lczerner, snitzer

Add new mount flag EXT4_MOUNT_INIT_INODE_TABLE and add new pair of mount
options (inititable/noinititable). When mounted with inititable file
system should try to initialize uninitialized inode tables, otherwise it
should prevent initializing inode tables. For now, default is noinittable.

One can also specify inititable=n where n is a number that will be used
as the wait multiplier (see "Add inode table initialization code into
Ext4" patch for more info). Bigger number means slower inode table
initialization thus less impact on performance, but longer
inititalization (default is 10).

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/ext4.h  |    1 +
 fs/ext4/super.c |   22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 889ec9d..9600897 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -889,6 +889,7 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_DATA_ERR_ABORT	0x10000000 /* Abort on file data write */
 #define EXT4_MOUNT_BLOCK_VALIDITY	0x20000000 /* Block validity checking */
 #define EXT4_MOUNT_DISCARD		0x40000000 /* Issue DISCARD requests */
+#define EXT4_MOUNT_INIT_INODE_TABLE	0x80000000 /* Initialize uninitialized itables */
 
 #define clear_opt(o, opt)		o &= ~EXT4_MOUNT_##opt
 #define set_opt(o, opt)			o |= EXT4_MOUNT_##opt
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 2614774..c15e84d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1045,6 +1045,10 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
 	    !(def_mount_opts & EXT4_DEFM_BLOCK_VALIDITY))
 		seq_puts(seq, ",block_validity");
 
+	if (test_opt(sb, INIT_INODE_TABLE))
+		seq_printf(seq, ",init_inode_table=%u",
+			   (unsigned) sbi->s_li_wait_mult);
+
 	ext4_show_quota_options(seq, sb);
 
 	return 0;
@@ -1219,6 +1223,7 @@ enum {
 	Opt_inode_readahead_blks, Opt_journal_ioprio,
 	Opt_dioread_nolock, Opt_dioread_lock,
 	Opt_discard, Opt_nodiscard,
+	Opt_init_inode_table, Opt_noinit_inode_table,
 };
 
 static const match_table_t tokens = {
@@ -1289,6 +1294,9 @@ static const match_table_t tokens = {
 	{Opt_dioread_lock, "dioread_lock"},
 	{Opt_discard, "discard"},
 	{Opt_nodiscard, "nodiscard"},
+	{Opt_init_inode_table, "inititable=%u"},
+	{Opt_init_inode_table, "inititable"},
+	{Opt_noinit_inode_table, "noinititable"},
 	{Opt_err, NULL},
 };
 
@@ -1759,6 +1767,20 @@ set_qf_format:
 		case Opt_dioread_lock:
 			clear_opt(sbi->s_mount_opt, DIOREAD_NOLOCK);
 			break;
+		case Opt_init_inode_table:
+			set_opt(sbi->s_mount_opt, INIT_INODE_TABLE);
+			if (args[0].from) {
+				if (match_int(&args[0], &option))
+					return 0;
+			} else
+				option = EXT4_DEF_LI_WAIT_MULT;
+			if (option < 0)
+				return 0;
+			sbi->s_li_wait_mult = option;
+			break;
+		case Opt_noinit_inode_table:
+			clear_opt(sbi->s_mount_opt, INIT_INODE_TABLE);
+			break;
 		default:
 			ext4_msg(sb, KERN_ERR,
 			       "Unrecognized mount option \"%s\" "
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/6] Add inode table initialization code for Ext4
  2010-09-16 12:47 [PATCH 0/6 v4] Lazy itable initialization for Ext4 Lukas Czerner
  2010-09-16 12:47 ` [PATCH 1/6] Add helper function for blkdev_issue_zeroout Lukas Czerner
  2010-09-16 12:47 ` [PATCH 2/6] Add inititable/noinititable mount options for ext4 Lukas Czerner
@ 2010-09-16 12:47 ` Lukas Czerner
  2010-09-16 12:47 ` [PATCH 4/6] Use sb_issue_zeroout in setup_new_group_blocks Lukas Czerner
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Lukas Czerner @ 2010-09-16 12:47 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, rwheeler, sandeen, adilger, lczerner, snitzer

When lazy_itable_init extended option is passed to mke2fs, it
considerably speed up filesystem creation because inode tables are
not zeroed out, thus contains some old data. When this fs is mounted
filesystem code should initialize (zero out) inode tables.
So far this code was missing for ext4 and this patch adds this feature.

For purpose of zeroing inode tables it introduces new kernel thread
called ext4lazyinit, which is created on demand and destroyed, when it
is no longer needed. There is only one thread for all ext4
filesystems in the system. When the first filesystem with inititable
mount option is mounted, ext4lazyinit thread is created, then the
filesystem can register its request in the request list.

This thread then walks through the list of requests picking up scheduled
requests and invoking ext4_init_inode_table(). Next schedule time for
the request is computed by multiplying the time it took to zero out last
inode table with wait multiplier, which can be set with the
(inititable=n) mount option (default is 10). We are doing this so we do
not take the whole I/O bandwidth. When the thread is no longer
necessary (request list is empty) it frees the appropriate structures and
exits (and can be created later later by another filesystem).

We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/ext4.h   |   38 +++++
 fs/ext4/ialloc.c |  116 +++++++++++++++
 fs/ext4/super.c  |  415 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 566 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9600897..96884c5 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1173,6 +1173,11 @@ struct ext4_sb_info {
 
 	/* timer for periodic error stats printing */
 	struct timer_list s_err_report;
+
+	/* Lazy inode table initialization info */
+	struct ext4_li_request *s_li_request;
+	/* Wait multiplier for lazy initialization thread */
+	unsigned int s_li_wait_mult;
 };
 
 static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
@@ -1537,6 +1542,37 @@ void ext4_get_group_no_and_offset(struct super_block *sb, ext4_fsblk_t blocknr,
 extern struct proc_dir_entry *ext4_proc_root;
 
 /*
+ * Timeout and state flag for lazy initialization inode thread.
+ */
+#define EXT4_DEF_LI_WAIT_MULT			10
+#define EXT4_DEF_LI_MAX_START_DELAY		5
+#define EXT4_LAZYINIT_QUIT			0x0001
+#define EXT4_LAZYINIT_RUNNING			0x0002
+
+/*
+ * Lazy inode table initialization info
+ */
+struct ext4_lazy_init {
+	unsigned long		li_state;
+
+	wait_queue_head_t	li_wait_daemon;
+	wait_queue_head_t	li_wait_task;
+	struct timer_list	li_timer;
+	struct task_struct	*li_task;
+
+	struct list_head	li_request_list;
+	struct mutex		li_list_mtx;
+};
+
+struct ext4_li_request {
+	struct super_block	*lr_super;
+	struct ext4_sb_info	*lr_sbi;
+	ext4_group_t		lr_next_group;
+	struct list_head	lr_request;
+	unsigned long		lr_next_sched;
+};
+
+/*
  * Function prototypes
  */
 
@@ -1611,6 +1647,8 @@ extern unsigned ext4_init_inode_bitmap(struct super_block *sb,
 				       ext4_group_t group,
 				       struct ext4_group_desc *desc);
 extern void mark_bitmap_end(int start_bit, int end_bit, char *bitmap);
+extern int ext4_init_inode_table(struct super_block *sb,
+				 ext4_group_t group);
 
 /* mballoc.c */
 extern long ext4_mb_stats;
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 45853e0..ea3ba70 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -107,6 +107,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
 	desc = ext4_get_group_desc(sb, block_group, NULL);
 	if (!desc)
 		return NULL;
+
 	bitmap_blk = ext4_inode_bitmap(sb, desc);
 	bh = sb_getblk(sb, bitmap_blk);
 	if (unlikely(!bh)) {
@@ -123,6 +124,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
 		unlock_buffer(bh);
 		return bh;
 	}
+
 	ext4_lock_group(sb, block_group);
 	if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
 		ext4_init_inode_bitmap(sb, bh, block_group, desc);
@@ -133,6 +135,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
 		return bh;
 	}
 	ext4_unlock_group(sb, block_group);
+
 	if (buffer_uptodate(bh)) {
 		/*
 		 * if not uninit if bh is uptodate,
@@ -712,8 +715,17 @@ static int ext4_claim_inode(struct super_block *sb,
 {
 	int free = 0, retval = 0, count;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
 	struct ext4_group_desc *gdp = ext4_get_group_desc(sb, group, NULL);
 
+	/*
+	 * We have to be sure that new inode allocation does not race with
+	 * inode table initialization, because otherwise we may end up
+	 * allocating and writing new inode right before sb_issue_zeroout
+	 * takes place and overwriting our new inode with zeroes. So we
+	 * take alloc_sem to prevent it.
+	 */
+	down_read(&grp->alloc_sem);
 	ext4_lock_group(sb, group);
 	if (ext4_set_bit(ino, inode_bitmap_bh->b_data)) {
 		/* not a free inode */
@@ -724,6 +736,7 @@ static int ext4_claim_inode(struct super_block *sb,
 	if ((group == 0 && ino < EXT4_FIRST_INO(sb)) ||
 			ino > EXT4_INODES_PER_GROUP(sb)) {
 		ext4_unlock_group(sb, group);
+		up_read(&grp->alloc_sem);
 		ext4_error(sb, "reserved inode or inode > inodes count - "
 			   "block_group = %u, inode=%lu", group,
 			   ino + group * EXT4_INODES_PER_GROUP(sb));
@@ -772,6 +785,7 @@ static int ext4_claim_inode(struct super_block *sb,
 	gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
 err_ret:
 	ext4_unlock_group(sb, group);
+	up_read(&grp->alloc_sem);
 	return retval;
 }
 
@@ -1205,3 +1219,105 @@ unsigned long ext4_count_dirs(struct super_block * sb)
 	}
 	return count;
 }
+
+/*
+ * Zeroes not yet zeroed inode table - just write zeroes through the whole
+ * inode table. Must be called without any spinlock held. The only place
+ * where it is called from on active part of filesystem is ext4lazyinit
+ * thread, so we do not need any special locks, however we have to prevent
+ * inode allocation from the current group, so we take alloc_sem lock, to
+ * block ext4_claim_inode until we are finished.
+ */
+extern int ext4_init_inode_table(struct super_block *sb, ext4_group_t group)
+{
+	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_group_desc *gdp = NULL;
+	struct buffer_head *group_desc_bh;
+	handle_t *handle;
+	ext4_fsblk_t blk;
+	int num, ret = 0, used_blks = 0;
+
+	/* This should not happen, but just to be sure check this */
+	if (sb->s_flags & MS_RDONLY) {
+		ret = 1;
+		goto out;
+	}
+
+	gdp = ext4_get_group_desc(sb, group, &group_desc_bh);
+	if (!gdp)
+		goto out;
+
+	/*
+	 * We do not need to lock this, because we are the only one
+	 * handling this flag.
+	 */
+	if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED))
+		goto out;
+
+	handle = ext4_journal_start_sb(sb, 1);
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		goto out;
+	}
+
+	down_write(&grp->alloc_sem);
+	/*
+	 * If inode bitmap was already initialized there may be some
+	 * used inodes so we need to skip blocks with used inodes in
+	 * inode table.
+	 */
+	if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)))
+		used_blks = DIV_ROUND_UP((EXT4_INODES_PER_GROUP(sb) -
+			    ext4_itable_unused_count(sb, gdp)),
+			    sbi->s_inodes_per_block);
+
+	blk = ext4_inode_table(sb, gdp) + used_blks;
+	num = sbi->s_itb_per_group - used_blks;
+
+	BUFFER_TRACE(group_desc_bh, "get_write_access");
+	ret = ext4_journal_get_write_access(handle,
+					    group_desc_bh);
+	if (ret)
+		goto err_out;
+
+	if (unlikely(num > EXT4_INODES_PER_GROUP(sb))) {
+		ext4_error(sb, "Something is wrong with group %u\n"
+			   "Used itable blocks: %d"
+			   "Itable blocks per group: %lu\n",
+			   group, used_blks, sbi->s_itb_per_group);
+		ret = 1;
+		goto err_out;
+	}
+
+	/*
+	 * Skip zeroout if the inode table is full. But we set the ZEROED
+	 * flag anyway, because obviously, when it is full it does not need
+	 * further zeroing.
+	 */
+	if (unlikely(num == 0))
+		goto skip_zeroout;
+
+	ext4_debug("going to zero out inode table in group %d\n",
+		   group);
+	ret = sb_issue_zeroout(sb, blk, num, GFP_NOFS, BLKDEV_IFL_WAIT);
+	if (ret < 0)
+		goto err_out;
+
+skip_zeroout:
+	ext4_lock_group(sb, group);
+	gdp->bg_flags |= cpu_to_le16(EXT4_BG_INODE_ZEROED);
+	gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
+	ext4_unlock_group(sb, group);
+
+	BUFFER_TRACE(group_desc_bh,
+		     "call ext4_handle_dirty_metadata");
+	ret = ext4_handle_dirty_metadata(handle, NULL,
+					 group_desc_bh);
+
+err_out:
+	up_write(&grp->alloc_sem);
+	ext4_journal_stop(handle);
+out:
+	return ret;
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c15e84d..2b53a48 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -41,6 +41,9 @@
 #include <linux/crc16.h>
 #include <asm/uaccess.h>
 
+#include <linux/kthread.h>
+#include <linux/freezer.h>
+
 #include "ext4.h"
 #include "ext4_jbd2.h"
 #include "xattr.h"
@@ -52,6 +55,8 @@
 
 struct proc_dir_entry *ext4_proc_root;
 static struct kset *ext4_kset;
+struct ext4_lazy_init *ext4_li_info;
+struct mutex ext4_li_mtx;
 
 static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
 			     unsigned long journal_devnum);
@@ -70,6 +75,8 @@ static void ext4_write_super(struct super_block *sb);
 static int ext4_freeze(struct super_block *sb);
 static int ext4_get_sb(struct file_system_type *fs_type, int flags,
 		       const char *dev_name, void *data, struct vfsmount *mnt);
+static void ext4_destroy_lazyinit_thread(void);
+static void ext4_unregister_li_request(struct super_block *sb);
 
 #if !defined(CONFIG_EXT3_FS) && !defined(CONFIG_EXT3_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT23)
 static struct file_system_type ext3_fs_type = {
@@ -719,6 +726,7 @@ static void ext4_put_super(struct super_block *sb)
 			ext4_abort(sb, "Couldn't clean up the journal");
 	}
 
+	ext4_unregister_li_request(sb);
 	ext4_release_system_zone(sb);
 	ext4_mb_release(sb);
 	ext4_ext_release(sb);
@@ -1964,7 +1972,8 @@ int ext4_group_desc_csum_verify(struct ext4_sb_info *sbi, __u32 block_group,
 }
 
 /* Called at mount-time, super-block is locked */
-static int ext4_check_descriptors(struct super_block *sb)
+static int ext4_check_descriptors(struct super_block *sb,
+				  ext4_group_t *first_not_zeroed)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	ext4_fsblk_t first_block = le32_to_cpu(sbi->s_es->s_first_data_block);
@@ -1973,7 +1982,7 @@ static int ext4_check_descriptors(struct super_block *sb)
 	ext4_fsblk_t inode_bitmap;
 	ext4_fsblk_t inode_table;
 	int flexbg_flag = 0;
-	ext4_group_t i;
+	ext4_group_t i, grp = sbi->s_groups_count;
 
 	if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG))
 		flexbg_flag = 1;
@@ -1989,6 +1998,10 @@ static int ext4_check_descriptors(struct super_block *sb)
 			last_block = first_block +
 				(EXT4_BLOCKS_PER_GROUP(sb) - 1);
 
+		if ((grp == sbi->s_groups_count) &&
+		   !(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
+			grp = i;
+
 		block_bitmap = ext4_block_bitmap(sb, gdp);
 		if (block_bitmap < first_block || block_bitmap > last_block) {
 			ext4_msg(sb, KERN_ERR, "ext4_check_descriptors: "
@@ -2026,6 +2039,8 @@ static int ext4_check_descriptors(struct super_block *sb)
 		if (!flexbg_flag)
 			first_block += EXT4_BLOCKS_PER_GROUP(sb);
 	}
+	if (NULL != first_not_zeroed)
+		*first_not_zeroed = grp;
 
 	ext4_free_blocks_count_set(sbi->s_es, ext4_count_free_blocks(sb));
 	sbi->s_es->s_free_inodes_count =cpu_to_le32(ext4_count_free_inodes(sb));
@@ -2564,6 +2579,378 @@ static void print_daily_error_info(unsigned long arg)
 	mod_timer(&sbi->s_err_report, jiffies + 24*60*60*HZ);  /* Once a day */
 }
 
+static void ext4_lazyinode_timeout(unsigned long data)
+{
+	struct task_struct *p = (struct task_struct *)data;
+	wake_up_process(p);
+}
+
+/* Find next suitable group adn run ext4_init_inode_table */
+static int ext4_run_li_request(struct ext4_li_request *elr)
+{
+	struct ext4_group_desc *gdp = NULL;
+	ext4_group_t group, ngroups;
+	struct super_block *sb;
+	int ret = 0;
+
+	sb = elr->lr_super;
+	ngroups = EXT4_SB(sb)->s_groups_count;
+
+	for (group = elr->lr_next_group; group < ngroups; group++) {
+		gdp = ext4_get_group_desc(sb, group, NULL);
+		if (!gdp) {
+			ret = 1;
+			break;
+		}
+
+		if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
+			break;
+	}
+
+	if (group == ngroups)
+		ret = 1;
+
+	if (!ret) {
+		ret = ext4_init_inode_table(sb, group);
+		elr->lr_next_group = group + 1;
+	}
+
+	return ret;
+}
+
+/*
+ * Remove lr_request from the list_request and free the
+ * request tructure. Should be called with li_list_mtx held
+ */
+static void ext4_remove_li_request(struct ext4_li_request *elr)
+{
+	struct ext4_sb_info *sbi;
+
+	if (!elr)
+		return;
+
+	sbi = elr->lr_sbi;
+
+	list_del(&elr->lr_request);
+	sbi->s_li_request = NULL;
+	kfree(elr);
+}
+
+static void ext4_unregister_li_request(struct super_block *sb)
+{
+	struct ext4_li_request *elr = EXT4_SB(sb)->s_li_request;
+
+	if (!ext4_li_info)
+		return;
+
+	mutex_lock(&ext4_li_info->li_list_mtx);
+	ext4_remove_li_request(elr);
+	mutex_unlock(&ext4_li_info->li_list_mtx);
+}
+
+/*
+ * This is the function where ext4lazyinit thread lives. It walks
+ * through the request list searching for next scheduled filesystem.
+ * When such a fs is found, run the lazy initialization request
+ * (ext4_rn_li_request) and keep track of the time spend in this
+ * function. Based on that time we compute next schedule time of
+ * the request. When walking through the list is complete, compute
+ * next waking time and put itself into sleep.
+ */
+static int ext4_lazyinit_thread(void *arg)
+{
+	struct ext4_lazy_init *eli = (struct ext4_lazy_init *)arg;
+	struct list_head *pos, *n;
+	struct ext4_li_request *elr;
+	struct ext4_sb_info *sbi;
+	unsigned long next_wakeup;
+	unsigned long timeout = 0;
+	int ret;
+
+	BUG_ON(NULL == eli);
+
+	eli->li_timer.data = (unsigned long)current;
+	eli->li_timer.function = ext4_lazyinode_timeout;
+
+	eli->li_task = current;
+	wake_up(&eli->li_wait_task);
+
+cont_thread:
+	while (true) {
+		next_wakeup = ULONG_MAX;
+
+		mutex_lock(&eli->li_list_mtx);
+		if (list_empty(&eli->li_request_list)) {
+			mutex_unlock(&eli->li_list_mtx);
+			goto exit_thread;
+		}
+
+		list_for_each_safe(pos, n, &eli->li_request_list) {
+			elr = list_entry(pos, struct ext4_li_request,
+					 lr_request);
+
+			if (time_before_eq(jiffies, elr->lr_next_sched))
+				continue;
+			sbi = elr->lr_sbi;
+
+			timeout = jiffies;
+			ret = ext4_run_li_request(elr);
+			timeout = (jiffies - timeout) * sbi->s_li_wait_mult;
+
+			if (ret) {
+				ext4_remove_li_request(elr);
+				continue;
+			}
+
+			elr->lr_next_sched = jiffies + timeout;
+			if (elr->lr_next_sched < next_wakeup)
+				next_wakeup = elr->lr_next_sched;
+		}
+		mutex_unlock(&eli->li_list_mtx);
+
+		/*
+		 * We need to check this otherwise we may end up sleeping
+		 * for very long time.
+		 */
+		if (jiffies >= next_wakeup) {
+			cond_resched();
+			continue;
+		}
+
+		eli->li_timer.expires = next_wakeup;
+		add_timer(&eli->li_timer);
+
+		if (freezing(current)) {
+			refrigerator();
+		} else {
+			DEFINE_WAIT(wait);
+			prepare_to_wait(&eli->li_wait_daemon, &wait,
+					TASK_INTERRUPTIBLE);
+			schedule();
+			finish_wait(&eli->li_wait_daemon, &wait);
+		}
+	}
+
+exit_thread:
+	/*
+	 * It looks like the request list is empty, but we need
+	 * to check it under the li_list_mtx lock, to prevent any
+	 * additions into it, and of course we should lock ext4_li_mtx
+	 * to atomically free the list and ext4_li_info, because at
+	 * this point another ext4 filesystem could be registering
+	 * new one.
+	 */
+	mutex_lock(&ext4_li_mtx);
+	mutex_lock(&eli->li_list_mtx);
+	if (!list_empty(&eli->li_request_list)) {
+		mutex_unlock(&eli->li_list_mtx);
+		mutex_unlock(&ext4_li_mtx);
+		goto cont_thread;
+	}
+	mutex_unlock(&eli->li_list_mtx);
+	del_timer_sync(&ext4_li_info->li_timer);
+	eli->li_task = NULL;
+	wake_up(&eli->li_wait_task);
+
+	kfree(ext4_li_info);
+	ext4_li_info = NULL;
+	mutex_unlock(&ext4_li_mtx);
+
+	return 0;
+}
+
+static void ext4_clear_request_list(void)
+{
+	struct list_head *pos, *n;
+	struct ext4_li_request *elr;
+
+	mutex_lock(&ext4_li_info->li_list_mtx);
+	if (list_empty(&ext4_li_info->li_request_list))
+		return;
+
+	list_for_each_safe(pos, n, &ext4_li_info->li_request_list) {
+		elr = list_entry(pos, struct ext4_li_request,
+				 lr_request);
+		ext4_remove_li_request(elr);
+	}
+	mutex_unlock(&ext4_li_info->li_list_mtx);
+}
+
+static int ext4_run_lazyinit_thread(void)
+{
+	struct task_struct *t;
+
+	t = kthread_run(ext4_lazyinit_thread, ext4_li_info, "ext4lazyinit");
+	if (IS_ERR(t)) {
+		int err = PTR_ERR(t);
+		ext4_clear_request_list();
+		del_timer_sync(&ext4_li_info->li_timer);
+		kfree(ext4_li_info);
+		ext4_li_info = NULL;
+		printk(KERN_CRIT "EXT4: error %d creating inode table "
+				 "initialization thread\n",
+				 err);
+		return err;
+	}
+	ext4_li_info->li_state |= EXT4_LAZYINIT_RUNNING;
+
+	wait_event(ext4_li_info->li_wait_task, ext4_li_info->li_task != NULL);
+	return 0;
+}
+
+/*
+ * Check whether it make sense to run itable init. thread or not.
+ * If there is at least one uninitialized inode table, return
+ * corresponding group number, else the loop goes through all
+ * groups and return total number of groups.
+ */
+static ext4_group_t ext4_has_uninit_itable(struct super_block *sb)
+{
+	ext4_group_t group, ngroups = EXT4_SB(sb)->s_groups_count;
+	struct ext4_group_desc *gdp = NULL;
+
+	for (group = 0; group < ngroups; group++) {
+		gdp = ext4_get_group_desc(sb, group, NULL);
+		if (!gdp)
+			continue;
+
+		if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
+			break;
+	}
+
+	return group;
+}
+
+static int ext4_li_info_new(void)
+{
+	struct ext4_lazy_init *eli = NULL;
+
+	eli = kzalloc(sizeof(*eli), GFP_KERNEL);
+	if (!eli)
+		return -ENOMEM;
+
+	eli->li_task = NULL;
+	INIT_LIST_HEAD(&eli->li_request_list);
+	mutex_init(&eli->li_list_mtx);
+
+	init_waitqueue_head(&eli->li_wait_daemon);
+	init_waitqueue_head(&eli->li_wait_task);
+	init_timer(&eli->li_timer);
+	eli->li_state |= EXT4_LAZYINIT_QUIT;
+
+	ext4_li_info = eli;
+
+	return 0;
+}
+
+static struct ext4_li_request *ext4_li_request_new(struct super_block *sb,
+					    ext4_group_t start)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_li_request *elr;
+	unsigned long rnd;
+
+	elr = kzalloc(sizeof(*elr), GFP_KERNEL);
+	if (!elr)
+		return NULL;
+
+	elr->lr_super = sb;
+	elr->lr_sbi = sbi;
+	elr->lr_next_group = start;
+
+	/*
+	 * Randomize first schedule time of the request to
+	 * spread the inode table initialization requests
+	 * better.
+	 */
+	get_random_bytes(&rnd, sizeof(rnd));
+	elr->lr_next_sched = jiffies + (unsigned long)rnd %
+			     (EXT4_DEF_LI_MAX_START_DELAY * HZ);
+
+	return elr;
+}
+
+static int ext4_register_li_request(struct super_block *sb,
+				    ext4_group_t first_not_zeroed)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_li_request *elr;
+	ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
+	int ret = 0;
+
+	if (sbi->s_li_request != NULL)
+		goto out;
+
+	if (first_not_zeroed == ngroups ||
+	    (sb->s_flags & MS_RDONLY) ||
+	    !test_opt(sb, INIT_INODE_TABLE)) {
+		sbi->s_li_request = NULL;
+		goto out;
+	}
+
+	if (first_not_zeroed == ngroups) {
+		sbi->s_li_request = NULL;
+		goto out;
+	}
+
+	elr = ext4_li_request_new(sb, first_not_zeroed);
+	if (!elr) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	mutex_lock(&ext4_li_mtx);
+
+	if (NULL == ext4_li_info) {
+		ret = ext4_li_info_new();
+		if (ret)
+			goto out;
+	}
+
+	mutex_lock(&ext4_li_info->li_list_mtx);
+	list_add(&elr->lr_request, &ext4_li_info->li_request_list);
+	mutex_unlock(&ext4_li_info->li_list_mtx);
+
+	sbi->s_li_request = elr;
+
+	if (!(ext4_li_info->li_state & EXT4_LAZYINIT_RUNNING)) {
+		ret = ext4_run_lazyinit_thread();
+		if (ret)
+			goto out;
+	}
+
+	mutex_unlock(&ext4_li_mtx);
+
+out:
+	if (ret) {
+		mutex_unlock(&ext4_li_mtx);
+		kfree(elr);
+	}
+	return ret;
+}
+
+/*
+ * We do not need to lock anything since this is called on
+ * module unload.
+ */
+static void ext4_destroy_lazyinit_thread(void)
+{
+	/*
+	 * If thread exited earlier
+	 * there's nothing to be done.
+	 */
+	if (!ext4_li_info)
+		return;
+
+	ext4_clear_request_list();
+
+	while (ext4_li_info->li_task) {
+		wake_up(&ext4_li_info->li_wait_daemon);
+		wait_event(ext4_li_info->li_wait_task,
+			   ext4_li_info->li_task == NULL);
+	}
+}
+
 static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 				__releases(kernel_lock)
 				__acquires(kernel_lock)
@@ -2589,6 +2976,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	__u64 blocks_count;
 	int err;
 	unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
+	ext4_group_t first_not_zeroed;
 
 	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
 	if (!sbi)
@@ -2930,7 +3318,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 			goto failed_mount2;
 		}
 	}
-	if (!ext4_check_descriptors(sb)) {
+	if (!ext4_check_descriptors(sb, &first_not_zeroed)) {
 		ext4_msg(sb, KERN_ERR, "group descriptors corrupted!");
 		goto failed_mount2;
 	}
@@ -3151,6 +3539,10 @@ no_journal:
 		goto failed_mount4;
 	}
 
+	err = ext4_register_li_request(sb, first_not_zeroed);
+	if (err)
+		goto failed_mount4;
+
 	sbi->s_kobj.kset = ext4_kset;
 	init_completion(&sbi->s_kobj_unregister);
 	err = kobject_init_and_add(&sbi->s_kobj, &ext4_ktype, NULL,
@@ -3868,6 +4260,19 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
 			enable_quota = 1;
 		}
 	}
+
+	/*
+	 * Reinitialize lazy itable initialization thread based on
+	 * current settings
+	 */
+	if ((sb->s_flags & MS_RDONLY) || !test_opt(sb, INIT_INODE_TABLE))
+		ext4_unregister_li_request(sb);
+	else {
+		ext4_group_t first_not_zeroed;
+		first_not_zeroed = ext4_has_uninit_itable(sb);
+		ext4_register_li_request(sb, first_not_zeroed);
+	}
+
 	ext4_setup_system_zone(sb);
 	if (sbi->s_journal == NULL)
 		ext4_commit_super(sb, 1);
@@ -4338,6 +4743,9 @@ static int __init init_ext4_fs(void)
 	err = register_filesystem(&ext4_fs_type);
 	if (err)
 		goto out;
+
+	ext4_li_info = NULL;
+	mutex_init(&ext4_li_mtx);
 	return 0;
 out:
 	unregister_as_ext2();
@@ -4357,6 +4765,7 @@ out4:
 
 static void __exit exit_ext4_fs(void)
 {
+	ext4_destroy_lazyinit_thread();
 	unregister_as_ext2();
 	unregister_as_ext3();
 	unregister_filesystem(&ext4_fs_type);
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/6] Use sb_issue_zeroout in setup_new_group_blocks
  2010-09-16 12:47 [PATCH 0/6 v4] Lazy itable initialization for Ext4 Lukas Czerner
                   ` (2 preceding siblings ...)
  2010-09-16 12:47 ` [PATCH 3/6] Add inode table initialization code for Ext4 Lukas Czerner
@ 2010-09-16 12:47 ` Lukas Czerner
  2010-09-29 14:12   ` Lukas Czerner
  2010-09-16 12:47 ` [PATCH 5/6] Use sb_issue_zeroout in ext4_ext_zeroout Lukas Czerner
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Lukas Czerner @ 2010-09-16 12:47 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, rwheeler, sandeen, adilger, lczerner, snitzer

Use sb_issue_zeroout to zero out inode table and descriptor table
blocks instead of old approach which involves journaling.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/resize.c |   46 +++++++++++++---------------------------------
 1 files changed, 13 insertions(+), 33 deletions(-)

diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index ca5c8aa..afba286 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -226,23 +226,13 @@ static int setup_new_group_blocks(struct super_block *sb,
 	}
 
 	/* Zero out all of the reserved backup group descriptor table blocks */
-	for (i = 0, bit = gdblocks + 1, block = start + bit;
-	     i < reserved_gdb; i++, block++, bit++) {
-		struct buffer_head *gdb;
-
-		ext4_debug("clear reserved block %#04llx (+%d)\n", block, bit);
-
-		if ((err = extend_or_restart_transaction(handle, 1, bh)))
-			goto exit_bh;
+	ext4_debug("clear inode table blocks %#04llx -> %#04llx\n",
+			block, sbi->s_itb_per_group);
+	err = sb_issue_zeroout(sb, gdblocks + start + 1, reserved_gdb,
+			       GFP_NOFS, BLKDEV_IFL_WAIT);
+	if (err)
+		goto exit_journal;
 
-		if (IS_ERR(gdb = bclean(handle, sb, block))) {
-			err = PTR_ERR(gdb);
-			goto exit_bh;
-		}
-		ext4_handle_dirty_metadata(handle, NULL, gdb);
-		ext4_set_bit(bit, bh->b_data);
-		brelse(gdb);
-	}
 	ext4_debug("mark block bitmap %#04llx (+%llu)\n", input->block_bitmap,
 		   input->block_bitmap - start);
 	ext4_set_bit(input->block_bitmap - start, bh->b_data);
@@ -251,23 +241,13 @@ static int setup_new_group_blocks(struct super_block *sb,
 	ext4_set_bit(input->inode_bitmap - start, bh->b_data);
 
 	/* Zero out all of the inode table blocks */
-	for (i = 0, block = input->inode_table, bit = block - start;
-	     i < sbi->s_itb_per_group; i++, bit++, block++) {
-		struct buffer_head *it;
-
-		ext4_debug("clear inode block %#04llx (+%d)\n", block, bit);
-
-		if ((err = extend_or_restart_transaction(handle, 1, bh)))
-			goto exit_bh;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/6] Use sb_issue_zeroout in ext4_ext_zeroout
  2010-09-16 12:47 [PATCH 0/6 v4] Lazy itable initialization for Ext4 Lukas Czerner
                   ` (3 preceding siblings ...)
  2010-09-16 12:47 ` [PATCH 4/6] Use sb_issue_zeroout in setup_new_group_blocks Lukas Czerner
@ 2010-09-16 12:47 ` Lukas Czerner
  2010-09-16 12:47 ` [PATCH 6/6] Add interface to advertise ext4 features in sysfs Lukas Czerner
  2010-09-28  4:01 ` [PATCH 0/6 v4] Lazy itable initialization for Ext4 Ted Ts'o
  6 siblings, 0 replies; 22+ messages in thread
From: Lukas Czerner @ 2010-09-16 12:47 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, rwheeler, sandeen, adilger, lczerner, snitzer

Change ext4_ext_zeroout to use sb_issue_zeroout instead of its
own approach to zero out extents.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/extents.c |   69 +++++-----------------------------------------------
 1 files changed, 7 insertions(+), 62 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 06328d3..519cd2b 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2535,77 +2535,22 @@ void ext4_ext_release(struct super_block *sb)
 #endif
 }
 
-static void bi_complete(struct bio *bio, int error)
-{
-	complete((struct completion *)bio->bi_private);
-}
-
 /* FIXME!! we need to try to merge to left or right after zero-out  */
 static int ext4_ext_zeroout(struct inode *inode, struct ext4_extent *ex)
 {
+	ext4_fsblk_t ee_pblock;
+	unsigned int ee_len;
 	int ret;
-	struct bio *bio;
-	int blkbits, blocksize;
-	sector_t ee_pblock;
-	struct completion event;
-	unsigned int ee_len, len, done, offset;
 
-
-	blkbits   = inode->i_blkbits;
-	blocksize = inode->i_sb->s_blocksize;
 	ee_len    = ext4_ext_get_actual_len(ex);
 	ee_pblock = ext_pblock(ex);
 
-	/* convert ee_pblock to 512 byte sectors */
-	ee_pblock = ee_pblock << (blkbits - 9);
-
-	while (ee_len > 0) {
-
-		if (ee_len > BIO_MAX_PAGES)
-			len = BIO_MAX_PAGES;
-		else
-			len = ee_len;
-
-		bio = bio_alloc(GFP_NOIO, len);
-		if (!bio)
-			return -ENOMEM;
-
-		bio->bi_sector = ee_pblock;
-		bio->bi_bdev   = inode->i_sb->s_bdev;
-
-		done = 0;
-		offset = 0;
-		while (done < len) {
-			ret = bio_add_page(bio, ZERO_PAGE(0),
-							blocksize, offset);
-			if (ret != blocksize) {
-				/*
-				 * We can't add any more pages because of
-				 * hardware limitations.  Start a new bio.
-				 */
-				break;
-			}
-			done++;
-			offset += blocksize;
-			if (offset >= PAGE_CACHE_SIZE)
-				offset = 0;
-		}
-
-		init_completion(&event);
-		bio->bi_private = &event;
-		bio->bi_end_io = bi_complete;
-		submit_bio(WRITE, bio);
-		wait_for_completion(&event);
+	ret = sb_issue_zeroout(inode->i_sb, ee_pblock, ee_len,
+			       GFP_NOFS, BLKDEV_IFL_WAIT);
+	if (ret > 0)
+		ret = 0;
 
-		if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) {
-			bio_put(bio);
-			return -EIO;
-		}
-		bio_put(bio);
-		ee_len    -= done;
-		ee_pblock += done  << (blkbits - 9);
-	}
-	return 0;
+	return ret;
 }
 
 #define EXT4_EXT_ZERO_LEN 7
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/6] Add interface to advertise ext4 features in sysfs
  2010-09-16 12:47 [PATCH 0/6 v4] Lazy itable initialization for Ext4 Lukas Czerner
                   ` (4 preceding siblings ...)
  2010-09-16 12:47 ` [PATCH 5/6] Use sb_issue_zeroout in ext4_ext_zeroout Lukas Czerner
@ 2010-09-16 12:47 ` Lukas Czerner
  2010-09-28  4:01 ` [PATCH 0/6 v4] Lazy itable initialization for Ext4 Ted Ts'o
  6 siblings, 0 replies; 22+ messages in thread
From: Lukas Czerner @ 2010-09-16 12:47 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, rwheeler, sandeen, adilger, lczerner, snitzer

User-space should have the opportunity to check what features doest ext4
support in each particular copy. This adds easy interface by creating new
"features" directory in sys/fs/ext4/. In that directory files
advertising feature names can be created.

Add lazy_itable_init to the feature list.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/ext4.h  |    5 +++++
 fs/ext4/super.c |   50 +++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 54 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 96884c5..74ec1fc 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1572,6 +1572,11 @@ struct ext4_li_request {
 	unsigned long		lr_next_sched;
 };
 
+struct ext4_features {
+	struct kobject f_kobj;
+	struct completion f_kobj_unregister;
+};
+
 /*
  * Function prototypes
  */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 2b53a48..bb84c27 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -57,6 +57,7 @@ struct proc_dir_entry *ext4_proc_root;
 static struct kset *ext4_kset;
 struct ext4_lazy_init *ext4_li_info;
 struct mutex ext4_li_mtx;
+struct ext4_features *ext4_feat;
 
 static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
 			     unsigned long journal_devnum);
@@ -2413,6 +2414,7 @@ static struct ext4_attr ext4_attr_##_name = {			\
 #define EXT4_ATTR(name, mode, show, store) \
 static struct ext4_attr ext4_attr_##name = __ATTR(name, mode, show, store)
 
+#define EXT4_INFO_ATTR(name) EXT4_ATTR(name, 0444, NULL, NULL)
 #define EXT4_RO_ATTR(name) EXT4_ATTR(name, 0444, name##_show, NULL)
 #define EXT4_RW_ATTR(name) EXT4_ATTR(name, 0644, name##_show, name##_store)
 #define EXT4_RW_ATTR_SBI_UI(name, elname)	\
@@ -2449,6 +2451,14 @@ static struct attribute *ext4_attrs[] = {
 	NULL,
 };
 
+/* Features this copy of ext4 supports */
+EXT4_INFO_ATTR(lazy_itable_init);
+
+static struct attribute *ext4_feat_attrs[] = {
+	ATTR_LIST(lazy_itable_init),
+	NULL,
+};
+
 static ssize_t ext4_attr_show(struct kobject *kobj,
 			      struct attribute *attr, char *buf)
 {
@@ -2477,7 +2487,6 @@ static void ext4_sb_release(struct kobject *kobj)
 	complete(&sbi->s_kobj_unregister);
 }
 
-
 static const struct sysfs_ops ext4_attr_ops = {
 	.show	= ext4_attr_show,
 	.store	= ext4_attr_store,
@@ -2489,6 +2498,17 @@ static struct kobj_type ext4_ktype = {
 	.release	= ext4_sb_release,
 };
 
+static void ext4_feat_release(struct kobject *kobj)
+{
+	complete(&ext4_feat->f_kobj_unregister);
+}
+
+static struct kobj_type ext4_feat_ktype = {
+	.default_attrs	= ext4_feat_attrs,
+	.sysfs_ops	= &ext4_attr_ops,
+	.release	= ext4_feat_release,
+};
+
 /*
  * Check whether this filesystem can be mounted based on
  * the features present and the RDONLY/RDWR mount requested.
@@ -4716,6 +4736,30 @@ static struct file_system_type ext4_fs_type = {
 	.fs_flags	= FS_REQUIRES_DEV,
 };
 
+int __init ext4_init_feat_adverts(void)
+{
+	struct ext4_features *ef;
+	int ret = -ENOMEM;
+
+	ef = kzalloc(sizeof(struct ext4_features), GFP_KERNEL);
+	if (!ef)
+		goto out;
+
+	ef->f_kobj.kset = ext4_kset;
+	init_completion(&ef->f_kobj_unregister);
+	ret = kobject_init_and_add(&ef->f_kobj, &ext4_feat_ktype, NULL,
+				   "features");
+	if (ret) {
+		kfree(ef);
+		goto out;
+	}
+
+	ext4_feat = ef;
+	ret = 0;
+out:
+	return ret;
+}
+
 static int __init init_ext4_fs(void)
 {
 	int err;
@@ -4728,6 +4772,9 @@ static int __init init_ext4_fs(void)
 	if (!ext4_kset)
 		goto out4;
 	ext4_proc_root = proc_mkdir("fs/ext4", NULL);
+
+	err = ext4_init_feat_adverts();
+
 	err = init_ext4_mballoc();
 	if (err)
 		goto out3;
@@ -4756,6 +4803,7 @@ out1:
 out2:
 	exit_ext4_mballoc();
 out3:
+	kfree(ext4_feat);
 	remove_proc_entry("fs/ext4", NULL);
 	kset_unregister(ext4_kset);
 out4:
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] Add inititable/noinititable mount options for ext4
  2010-09-16 12:47 ` [PATCH 2/6] Add inititable/noinititable mount options for ext4 Lukas Czerner
@ 2010-09-27 18:35   ` Ted Ts'o
  0 siblings, 0 replies; 22+ messages in thread
From: Ted Ts'o @ 2010-09-27 18:35 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, rwheeler, sandeen, adilger, snitzer

On Thu, Sep 16, 2010 at 02:47:27PM +0200, Lukas Czerner wrote:
> Add new mount flag EXT4_MOUNT_INIT_INODE_TABLE and add new pair of mount
> options (inititable/noinititable). When mounted with inititable file
> system should try to initialize uninitialized inode tables, otherwise it
> should prevent initializing inode tables. For now, default is noinittable.
> 
> One can also specify inititable=n where n is a number that will be used
> as the wait multiplier (see "Add inode table initialization code into
> Ext4" patch for more info). Bigger number means slower inode table
> initialization thus less impact on performance, but longer
> inititalization (default is 10).

Note: this patch doesn't compile on its own, since it uses s_li_wait
before it is defined (in the next patch).

This is a problem if someone tries to do a "git bisect" and lands
between these two patches.

I'll probably fix this just by merging these two patches together...

     	      	       	       	       - Ted

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6 v4] Lazy itable initialization for Ext4
  2010-09-16 12:47 [PATCH 0/6 v4] Lazy itable initialization for Ext4 Lukas Czerner
                   ` (5 preceding siblings ...)
  2010-09-16 12:47 ` [PATCH 6/6] Add interface to advertise ext4 features in sysfs Lukas Czerner
@ 2010-09-28  4:01 ` Ted Ts'o
  2010-09-28 15:05   ` Ted Ts'o
  2010-09-29 13:37   ` Lukas Czerner
  6 siblings, 2 replies; 22+ messages in thread
From: Ted Ts'o @ 2010-09-28  4:01 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, rwheeler, sandeen, adilger, snitzer

On Thu, Sep 16, 2010 at 02:47:25PM +0200, Lukas Czerner wrote:
> 
> as Mike suggested I have rebased the patch #1 against Jens'
> linux-2.6-block.git 'for-next' branch and changed sb_issue_zeroout()
> to cope with the new blkdev_issue_zeroout(), and changed
> sb_issue_zeroout() to the new syntax everywhere I am using it.
> Also some typos gets fixed.

We may have a problem with the lazy_itable patches.  I've tried
running the XFSTESTS three times now.  This was with a system where
mke2fs was setup (via /etc/mke2fs.conf) to always format the file
system using lazy_itable_init.  This meant that any of the xfstests
which reformated the scratch partition and then started a stress test
would stress the newly added itable initialization code.
Unfortunately the results weren't good.

The first time, I got the following soft lockup warning:

[ 2520.528745] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2520.531445]  ef2b8e44 00000046 00000007 e29c1500 e29c1500 e29c1760 e29c175c c0b55500
[ 2520.534983]  c0b55500 e29c175c c0b55500 c0b55500 c0b55500 32423426 00000224 00000000
[ 2520.538270]  00000224 e29c1500 00000001 ef205000 00000005 ef2b8e74 ef2b8e80 c026eb2c
[ 2520.541743] Call Trace:
[ 2520.542742]  [<c026eb2c>] jbd2_log_wait_commit+0x103/0x14f
[ 2520.544291]  [<c01711dc>] ? autoremove_wake_function+0x0/0x34
[ 2520.545816]  [<c026bf95>] jbd2_log_do_checkpoint+0x1a8/0x458
[ 2520.547431]  [<c026f4ed>] jbd2_journal_destroy+0x107/0x1d3
[ 2520.549602]  [<c01711dc>] ? autoremove_wake_function+0x0/0x34
[ 2520.551100]  [<c0252bef>] ext4_put_super+0x78/0x2f7
[ 2520.552798]  [<c01f3c3c>] generic_shutdown_super+0x47/0xb8
[ 2520.554692]  [<c01f3ccf>] kill_block_super+0x22/0x36
[ 2520.556470]  [<c01f3816>] deactivate_locked_super+0x22/0x3e
[ 2520.558372]  [<c01f3bf1>] deactivate_super+0x3d/0x41
[ 2520.560138]  [<c02057a9>] mntput_no_expire+0xb5/0xd8
[ 2520.561880]  [<c0206609>] sys_umount+0x273/0x298
[ 2520.563358]  [<c0206640>] sys_oldumount+0x12/0x14
[ 2520.564952]  [<c0646715>] syscall_call+0x7/0xb
[ 2520.566596] 3 locks held by umount/15126:
[ 2520.568121]  #0:  (&type->s_umount_key#20){++++..}, at: [<c01f3bea>] deactivate_super+0x36/0x41
[ 2520.571819]  #1:  (&type->s_lock_key#2){+.+...}, at: [<c01f3096>] lock_super+0x20/0x22
[ 2520.574788]  #2:  (&journal->j_checkpoint_mutex){+.+...}, at: [<c026f4e6>] jbd2_journal_destroy+0x100/0x1d3

In addition, there were these mysterious error messages:

[ 2542.026996] ata1: lost interrupt (Status 0x50)
[ 2542.029750] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 2542.032656] ata1.00: failed command: WRITE DMA
[ 2542.034312] ata1.00: cmd ca/00:10:00:00:00/00:00:00:00:00/e0 tag 0 dma 8192 out
[ 2542.034313]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 2542.039892] ata1.00: status: { DRDY }

Why are they strange?  Because this was running under KVM, and there
were no underlying hardware problems in the host OS.

The other two times I got a hard hang at XFStests 219 and 83, and the
system was caught in such a type look that magic-sysrq wasn't working
correctly.

I've XFStests in this setup before applying these patches, and things
worked fine.  I'm currently rolling back the patches and trying
another xfstests runs just to make sure the problem wasn't introduced
by some patch, but for now, it looks there might be a problem
somewhere.  And unfortunately, since it's not happening in a regular
location or test, and the system is so badly locked up sysrq doesn't
work, finding it may be intersting....

					- Ted

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6 v4] Lazy itable initialization for Ext4
  2010-09-28  4:01 ` [PATCH 0/6 v4] Lazy itable initialization for Ext4 Ted Ts'o
@ 2010-09-28 15:05   ` Ted Ts'o
  2010-09-29 13:37   ` Lukas Czerner
  1 sibling, 0 replies; 22+ messages in thread
From: Ted Ts'o @ 2010-09-28 15:05 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, rwheeler, sandeen, adilger, snitzer

On Tue, Sep 28, 2010 at 12:01:42AM -0400, Ted Ts'o wrote:
> Why are they strange?  Because this was running under KVM, and there
> were no underlying hardware problems in the host OS.
> 
> The other two times I got a hard hang at XFStests 219 and 83, and the
> system was caught in such a type look that magic-sysrq wasn't working
> correctly.
> 
> I've XFStests in this setup before applying these patches, and things
> worked fine.  I'm currently rolling back the patches and trying
> another xfstests runs just to make sure the problem wasn't introduced
> by some patch, but for now, it looks there might be a problem
> somewhere.  And unfortunately, since it's not happening in a regular
> location or test, and the system is so badly locked up sysrq doesn't
> work, finding it may be intersting....

I've just tried bisecting the patches, and tried applying the first
three (well, two since I combined patches #2 and #3).  Simply enabling
the init_itables code wasn't enough to trigger the problem.  It looks
like the problem is in the last three patches (probably in one of the
patches where we convert ext4 to use sb_issue_zeroout, either the
extent or the resize code).

What I'll probably do (unless we find the problem very quickly) is to
reorder things so that we take the init_itable patch and the sysfs
feature patch, and put the rest into the unstable portion of the patch
queue.  That way I can work on the rest of the ext4 patches for the
merge window, without getting blocked on this patch series.  And if we
don't manage to figure out what went wrong, while it would be nice to
simplify the code for 2.6.36, it won't be the end of the world if they
need to wait until the next cycle.

				- Ted

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6 v4] Lazy itable initialization for Ext4
  2010-09-28  4:01 ` [PATCH 0/6 v4] Lazy itable initialization for Ext4 Ted Ts'o
  2010-09-28 15:05   ` Ted Ts'o
@ 2010-09-29 13:37   ` Lukas Czerner
  2010-10-01 15:58     ` Lukas Czerner
  1 sibling, 1 reply; 22+ messages in thread
From: Lukas Czerner @ 2010-09-29 13:37 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Lukas Czerner, linux-ext4, rwheeler, sandeen, adilger, snitzer

On Tue, 28 Sep 2010, Ted Ts'o wrote:

> On Thu, Sep 16, 2010 at 02:47:25PM +0200, Lukas Czerner wrote:
> > 
> > as Mike suggested I have rebased the patch #1 against Jens'
> > linux-2.6-block.git 'for-next' branch and changed sb_issue_zeroout()
> > to cope with the new blkdev_issue_zeroout(), and changed
> > sb_issue_zeroout() to the new syntax everywhere I am using it.
> > Also some typos gets fixed.
> 
> We may have a problem with the lazy_itable patches.  I've tried
> running the XFSTESTS three times now.  This was with a system where
> mke2fs was setup (via /etc/mke2fs.conf) to always format the file
> system using lazy_itable_init.  This meant that any of the xfstests
> which reformated the scratch partition and then started a stress test
> would stress the newly added itable initialization code.
> Unfortunately the results weren't good.
> 
> The first time, I got the following soft lockup warning:
> 
> [ 2520.528745] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 2520.531445]  ef2b8e44 00000046 00000007 e29c1500 e29c1500 e29c1760 e29c175c c0b55500
> [ 2520.534983]  c0b55500 e29c175c c0b55500 c0b55500 c0b55500 32423426 00000224 00000000
> [ 2520.538270]  00000224 e29c1500 00000001 ef205000 00000005 ef2b8e74 ef2b8e80 c026eb2c
> [ 2520.541743] Call Trace:
> [ 2520.542742]  [<c026eb2c>] jbd2_log_wait_commit+0x103/0x14f
> [ 2520.544291]  [<c01711dc>] ? autoremove_wake_function+0x0/0x34
> [ 2520.545816]  [<c026bf95>] jbd2_log_do_checkpoint+0x1a8/0x458
> [ 2520.547431]  [<c026f4ed>] jbd2_journal_destroy+0x107/0x1d3
> [ 2520.549602]  [<c01711dc>] ? autoremove_wake_function+0x0/0x34
> [ 2520.551100]  [<c0252bef>] ext4_put_super+0x78/0x2f7
> [ 2520.552798]  [<c01f3c3c>] generic_shutdown_super+0x47/0xb8
> [ 2520.554692]  [<c01f3ccf>] kill_block_super+0x22/0x36
> [ 2520.556470]  [<c01f3816>] deactivate_locked_super+0x22/0x3e
> [ 2520.558372]  [<c01f3bf1>] deactivate_super+0x3d/0x41
> [ 2520.560138]  [<c02057a9>] mntput_no_expire+0xb5/0xd8
> [ 2520.561880]  [<c0206609>] sys_umount+0x273/0x298
> [ 2520.563358]  [<c0206640>] sys_oldumount+0x12/0x14
> [ 2520.564952]  [<c0646715>] syscall_call+0x7/0xb
> [ 2520.566596] 3 locks held by umount/15126:
> [ 2520.568121]  #0:  (&type->s_umount_key#20){++++..}, at: [<c01f3bea>] deactivate_super+0x36/0x41
> [ 2520.571819]  #1:  (&type->s_lock_key#2){+.+...}, at: [<c01f3096>] lock_super+0x20/0x22
> [ 2520.574788]  #2:  (&journal->j_checkpoint_mutex){+.+...}, at: [<c026f4e6>] jbd2_journal_destroy+0x100/0x1d3
> 
> In addition, there were these mysterious error messages:
> 
> [ 2542.026996] ata1: lost interrupt (Status 0x50)
> [ 2542.029750] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> [ 2542.032656] ata1.00: failed command: WRITE DMA
> [ 2542.034312] ata1.00: cmd ca/00:10:00:00:00/00:00:00:00:00/e0 tag 0 dma 8192 out
> [ 2542.034313]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [ 2542.039892] ata1.00: status: { DRDY }
> 
> Why are they strange?  Because this was running under KVM, and there
> were no underlying hardware problems in the host OS.

Hi Ted,

this is really strange. I have never seen anything like this and I have
tried running the xfstests several times on the patchset while I was
creating it. Unfortunately I am not able to reproduce those errors even
now. I am running 2.6.26-rc6 with real SSD device.

Maybe the one difference is that I am using 2.6.36-rc6, so there is old
sb_issue_discard() interface (no flags and gfp_mask in function definition).
 And it is before Christoph's "remove BLKDEV_IFL_WAIT" patch
(dd3932eddf428571762596e17b65f5dc92ca361b in Jens for-next branch).

I'll search further.

> 
> The other two times I got a hard hang at XFStests 219 and 83, and the
> system was caught in such a type look that magic-sysrq wasn't working
> correctly.

Are you sure about the test numbers ? 083 does not even run on ext4 it
is xfs specific.

> 
> I've XFStests in this setup before applying these patches, and things
> worked fine.  I'm currently rolling back the patches and trying
> another xfstests runs just to make sure the problem wasn't introduced
> by some patch, but for now, it looks there might be a problem
> somewhere.  And unfortunately, since it's not happening in a regular
> location or test, and the system is so badly locked up sysrq doesn't
> work, finding it may be intersting....
> 
> 					- Ted
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/6] Use sb_issue_zeroout in setup_new_group_blocks
  2010-09-16 12:47 ` [PATCH 4/6] Use sb_issue_zeroout in setup_new_group_blocks Lukas Czerner
@ 2010-09-29 14:12   ` Lukas Czerner
  2010-09-29 14:14     ` Lukas Czerner
  0 siblings, 1 reply; 22+ messages in thread
From: Lukas Czerner @ 2010-09-29 14:12 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, tytso, rwheeler, sandeen, adilger, snitzer

On Thu, 16 Sep 2010, Lukas Czerner wrote:

> Use sb_issue_zeroout to zero out inode table and descriptor table
> blocks instead of old approach which involves journaling.
> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> ---
>  fs/ext4/resize.c |   46 +++++++++++++---------------------------------
>  1 files changed, 13 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
> index ca5c8aa..afba286 100644
> --- a/fs/ext4/resize.c
> +++ b/fs/ext4/resize.c
> @@ -226,23 +226,13 @@ static int setup_new_group_blocks(struct super_block *sb,
>  	}
>  
>  	/* Zero out all of the reserved backup group descriptor table blocks */
> -	for (i = 0, bit = gdblocks + 1, block = start + bit;
> -	     i < reserved_gdb; i++, block++, bit++) {
> -		struct buffer_head *gdb;
> -
> -		ext4_debug("clear reserved block %#04llx (+%d)\n", block, bit);
> -
> -		if ((err = extend_or_restart_transaction(handle, 1, bh)))
> -			goto exit_bh;
> +	ext4_debug("clear inode table blocks %#04llx -> %#04llx\n",
> +			block, sbi->s_itb_per_group);
> +	err = sb_issue_zeroout(sb, gdblocks + start + 1, reserved_gdb,
> +			       GFP_NOFS, BLKDEV_IFL_WAIT);
> +	if (err)
> +		goto exit_journal;

When I look at this now, it seems it is bad, because when
sb_issue_discard() returns error for some reason we end up with not
released buffer_head. Since I am still not able to reproduce Ted's
errors I can't say whether it will help or not, but is doubt it will.

>  
> -		if (IS_ERR(gdb = bclean(handle, sb, block))) {
> -			err = PTR_ERR(gdb);
> -			goto exit_bh;
> -		}
> -		ext4_handle_dirty_metadata(handle, NULL, gdb);
> -		ext4_set_bit(bit, bh->b_data);
> -		brelse(gdb);
> -	}
>  	ext4_debug("mark block bitmap %#04llx (+%llu)\n", input->block_bitmap,
>  		   input->block_bitmap - start);
>  	ext4_set_bit(input->block_bitmap - start, bh->b_data);
> @@ -251,23 +241,13 @@ static int setup_new_group_blocks(struct super_block *sb,
>  	ext4_set_bit(input->inode_bitmap - start, bh->b_data);
>  
>  	/* Zero out all of the inode table blocks */
> -	for (i = 0, block = input->inode_table, bit = block - start;
> -	     i < sbi->s_itb_per_group; i++, bit++, block++) {
> -		struct buffer_head *it;
> -
> -		ext4_debug("clear inode block %#04llx (+%d)\n", block, bit);
> -
> -		if ((err = extend_or_restart_transaction(handle, 1, bh)))
> -			goto exit_bh;
> -
> -		if (IS_ERR(it = bclean(handle, sb, block))) {
> -			err = PTR_ERR(it);
> -			goto exit_bh;
> -		}
> -		ext4_handle_dirty_metadata(handle, NULL, it);
> -		brelse(it);
> -		ext4_set_bit(bit, bh->b_data);
> -	}
> +	block = input->inode_table;
> +	ext4_debug("clear inode table blocks %#04llx -> %#04llx\n",
> +			block, sbi->s_itb_per_group);
> +	err = sb_issue_zeroout(sb, block, sbi->s_itb_per_group,
> +			       GFP_NOFS, BLKDEV_IFL_WAIT);
> +	if (err)
> +		goto exit_journal;

here as well.

>  
>  	if ((err = extend_or_restart_transaction(handle, 2, bh)))
>  		goto exit_bh;
> 

I'll resend the patch shortly.

Thanks!
-Lukas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/6] Use sb_issue_zeroout in setup_new_group_blocks
  2010-09-29 14:12   ` Lukas Czerner
@ 2010-09-29 14:14     ` Lukas Czerner
  2010-10-01 16:00       ` [PATCH 4/6 fixed] " Lukas Czerner
  0 siblings, 1 reply; 22+ messages in thread
From: Lukas Czerner @ 2010-09-29 14:14 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, tytso, rwheeler, sandeen, adilger, snitzer

Use sb_issue_zeroout to zero out inode table and descriptor table
blocks instead of old approach which involves journaling.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/resize.c |   46 +++++++++++++---------------------------------
 1 files changed, 13 insertions(+), 33 deletions(-)

diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index ca5c8aa..2f5e347 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -226,23 +226,13 @@ static int setup_new_group_blocks(struct super_block *sb,
 	}
 
 	/* Zero out all of the reserved backup group descriptor table blocks */
-	for (i = 0, bit = gdblocks + 1, block = start + bit;
-	     i < reserved_gdb; i++, block++, bit++) {
-		struct buffer_head *gdb;
-
-		ext4_debug("clear reserved block %#04llx (+%d)\n", block, bit);
-
-		if ((err = extend_or_restart_transaction(handle, 1, bh)))
-			goto exit_bh;
+	ext4_debug("clear inode table blocks %#04llx -> %#04llx\n",
+			block, sbi->s_itb_per_group);
+	err = sb_issue_zeroout(sb, gdblocks + start + 1, reserved_gdb,
+			       GFP_NOFS, BLKDEV_IFL_WAIT);
+	if (err)
+		goto exit_bh;
 
-		if (IS_ERR(gdb = bclean(handle, sb, block))) {
-			err = PTR_ERR(gdb);
-			goto exit_bh;
-		}
-		ext4_handle_dirty_metadata(handle, NULL, gdb);
-		ext4_set_bit(bit, bh->b_data);
-		brelse(gdb);
-	}
 	ext4_debug("mark block bitmap %#04llx (+%llu)\n", input->block_bitmap,
 		   input->block_bitmap - start);
 	ext4_set_bit(input->block_bitmap - start, bh->b_data);
@@ -251,23 +241,13 @@ static int setup_new_group_blocks(struct super_block *sb,
 	ext4_set_bit(input->inode_bitmap - start, bh->b_data);
 
 	/* Zero out all of the inode table blocks */
-	for (i = 0, block = input->inode_table, bit = block - start;
-	     i < sbi->s_itb_per_group; i++, bit++, block++) {
-		struct buffer_head *it;
-
-		ext4_debug("clear inode block %#04llx (+%d)\n", block, bit);
-
-		if ((err = extend_or_restart_transaction(handle, 1, bh)))
-			goto exit_bh;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6 v4] Lazy itable initialization for Ext4
  2010-09-29 13:37   ` Lukas Czerner
@ 2010-10-01 15:58     ` Lukas Czerner
  2010-10-02 19:55       ` Ted Ts'o
  0 siblings, 1 reply; 22+ messages in thread
From: Lukas Czerner @ 2010-10-01 15:58 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: Ted Ts'o, linux-ext4, rwheeler, sandeen, adilger, snitzer

On Wed, 29 Sep 2010, Lukas Czerner wrote:

> On Tue, 28 Sep 2010, Ted Ts'o wrote:
> 
> > On Thu, Sep 16, 2010 at 02:47:25PM +0200, Lukas Czerner wrote:
> > > 
> > > as Mike suggested I have rebased the patch #1 against Jens'
> > > linux-2.6-block.git 'for-next' branch and changed sb_issue_zeroout()
> > > to cope with the new blkdev_issue_zeroout(), and changed
> > > sb_issue_zeroout() to the new syntax everywhere I am using it.
> > > Also some typos gets fixed.
> > 
> > We may have a problem with the lazy_itable patches.  I've tried
> > running the XFSTESTS three times now.  This was with a system where
> > mke2fs was setup (via /etc/mke2fs.conf) to always format the file
> > system using lazy_itable_init.  This meant that any of the xfstests
> > which reformated the scratch partition and then started a stress test
> > would stress the newly added itable initialization code.
> > Unfortunately the results weren't good.
> > 
> > The first time, I got the following soft lockup warning:
> > 
> > [ 2520.528745] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 2520.531445]  ef2b8e44 00000046 00000007 e29c1500 e29c1500 e29c1760 e29c175c c0b55500
> > [ 2520.534983]  c0b55500 e29c175c c0b55500 c0b55500 c0b55500 32423426 00000224 00000000
> > [ 2520.538270]  00000224 e29c1500 00000001 ef205000 00000005 ef2b8e74 ef2b8e80 c026eb2c
> > [ 2520.541743] Call Trace:
> > [ 2520.542742]  [<c026eb2c>] jbd2_log_wait_commit+0x103/0x14f
> > [ 2520.544291]  [<c01711dc>] ? autoremove_wake_function+0x0/0x34
> > [ 2520.545816]  [<c026bf95>] jbd2_log_do_checkpoint+0x1a8/0x458
> > [ 2520.547431]  [<c026f4ed>] jbd2_journal_destroy+0x107/0x1d3
> > [ 2520.549602]  [<c01711dc>] ? autoremove_wake_function+0x0/0x34
> > [ 2520.551100]  [<c0252bef>] ext4_put_super+0x78/0x2f7
> > [ 2520.552798]  [<c01f3c3c>] generic_shutdown_super+0x47/0xb8
> > [ 2520.554692]  [<c01f3ccf>] kill_block_super+0x22/0x36
> > [ 2520.556470]  [<c01f3816>] deactivate_locked_super+0x22/0x3e
> > [ 2520.558372]  [<c01f3bf1>] deactivate_super+0x3d/0x41
> > [ 2520.560138]  [<c02057a9>] mntput_no_expire+0xb5/0xd8
> > [ 2520.561880]  [<c0206609>] sys_umount+0x273/0x298
> > [ 2520.563358]  [<c0206640>] sys_oldumount+0x12/0x14
> > [ 2520.564952]  [<c0646715>] syscall_call+0x7/0xb
> > [ 2520.566596] 3 locks held by umount/15126:
> > [ 2520.568121]  #0:  (&type->s_umount_key#20){++++..}, at: [<c01f3bea>] deactivate_super+0x36/0x41
> > [ 2520.571819]  #1:  (&type->s_lock_key#2){+.+...}, at: [<c01f3096>] lock_super+0x20/0x22
> > [ 2520.574788]  #2:  (&journal->j_checkpoint_mutex){+.+...}, at: [<c026f4e6>] jbd2_journal_destroy+0x100/0x1d3
> > 
> > In addition, there were these mysterious error messages:
> > 
> > [ 2542.026996] ata1: lost interrupt (Status 0x50)
> > [ 2542.029750] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> > [ 2542.032656] ata1.00: failed command: WRITE DMA
> > [ 2542.034312] ata1.00: cmd ca/00:10:00:00:00/00:00:00:00:00/e0 tag 0 dma 8192 out
> > [ 2542.034313]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> > [ 2542.039892] ata1.00: status: { DRDY }
> > 
> > Why are they strange?  Because this was running under KVM, and there
> > were no underlying hardware problems in the host OS.
> 
> Hi Ted,
> 
> this is really strange. I have never seen anything like this and I have
> tried running the xfstests several times on the patchset while I was
> creating it. Unfortunately I am not able to reproduce those errors even
> now. I am running 2.6.26-rc6 with real SSD device.
> 
> Maybe the one difference is that I am using 2.6.36-rc6, so there is old
> sb_issue_discard() interface (no flags and gfp_mask in function definition).
>  And it is before Christoph's "remove BLKDEV_IFL_WAIT" patch
> (dd3932eddf428571762596e17b65f5dc92ca361b in Jens for-next branch).
> 
> I'll search further.
> 

After extensive xfstest-ing I have not been able to reproduce it.
However, after a while hammering it with other stress test (the one
I have proposed to test batched discard implementation with) I have
got a panic due to not up-to-date buffer_head in submit_bh() :
kernel BUG at fs/buffer.c:2910! - I have been able to reproduce it
every time (on different BUG_ON sometimes)

The one responsible for this bug is [PATCH 4/6] Use sb_issue_zeroout
in setup_new_group_blocks. Without this patch I was not able to hit
that panic again. Also I have manage to find and fix the problem in
this patch. I'll send fixed version of [PATCH 4/6] shortly.

Importantly, with that fix I was not able to hit that panic again,
but since I was not able to reproduce what you have seen with
xfstests it may be a different issue. It would be great if you have
time to try it with that fixed patch, to see if ti makes a
difference.

Thanks!
-Lukas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 4/6 fixed] Use sb_issue_zeroout in setup_new_group_blocks
  2010-09-29 14:14     ` Lukas Czerner
@ 2010-10-01 16:00       ` Lukas Czerner
  0 siblings, 0 replies; 22+ messages in thread
From: Lukas Czerner @ 2010-10-01 16:00 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, tytso, rwheeler, sandeen, adilger, snitzer

Use sb_issue_zeroout to zero out inode table and descriptor table
blocks instead of old approach which involves journaling.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/resize.c |   47 +++++++++++++++--------------------------------
 1 files changed, 15 insertions(+), 32 deletions(-)

diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index ca5c8aa..49c8aff 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -226,23 +226,16 @@ static int setup_new_group_blocks(struct super_block *sb,
 	}
 
 	/* Zero out all of the reserved backup group descriptor table blocks */
-	for (i = 0, bit = gdblocks + 1, block = start + bit;
-	     i < reserved_gdb; i++, block++, bit++) {
-		struct buffer_head *gdb;
-
-		ext4_debug("clear reserved block %#04llx (+%d)\n", block, bit);
+	ext4_debug("clear inode table blocks %#04llx -> %#04llx\n",
+			block, sbi->s_itb_per_group);
+	err = sb_issue_zeroout(sb, gdblocks + start + 1, reserved_gdb,
+			       GFP_NOFS, BLKDEV_IFL_WAIT);
+	if (err)
+		goto exit_bh;
 
-		if ((err = extend_or_restart_transaction(handle, 1, bh)))
-			goto exit_bh;
+	if ((err = extend_or_restart_transaction(handle, 1, bh)))
+		goto exit_bh;
 
-		if (IS_ERR(gdb = bclean(handle, sb, block))) {
-			err = PTR_ERR(gdb);
-			goto exit_bh;
-		}
-		ext4_handle_dirty_metadata(handle, NULL, gdb);
-		ext4_set_bit(bit, bh->b_data);
-		brelse(gdb);
-	}
 	ext4_debug("mark block bitmap %#04llx (+%llu)\n", input->block_bitmap,
 		   input->block_bitmap - start);
 	ext4_set_bit(input->block_bitmap - start, bh->b_data);
@@ -251,23 +244,13 @@ static int setup_new_group_blocks(struct super_block *sb,
 	ext4_set_bit(input->inode_bitmap - start, bh->b_data);
 
 	/* Zero out all of the inode table blocks */
-	for (i = 0, block = input->inode_table, bit = block - start;
-	     i < sbi->s_itb_per_group; i++, bit++, block++) {
-		struct buffer_head *it;
-
-		ext4_debug("clear inode block %#04llx (+%d)\n", block, bit);
-
-		if ((err = extend_or_restart_transaction(handle, 1, bh)))
-			goto exit_bh;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6 v4] Lazy itable initialization for Ext4
  2010-10-01 15:58     ` Lukas Czerner
@ 2010-10-02 19:55       ` Ted Ts'o
  2010-10-03  2:43         ` Ted Ts'o
  0 siblings, 1 reply; 22+ messages in thread
From: Ted Ts'o @ 2010-10-02 19:55 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, rwheeler, sandeen, adilger, snitzer

On Fri, Oct 01, 2010 at 05:58:52PM +0200, Lukas Czerner wrote:
> 
> After extensive xfstest-ing I have not been able to reproduce it.
> However, after a while hammering it with other stress test (the one
> I have proposed to test batched discard implementation with) I have
> got a panic due to not up-to-date buffer_head in submit_bh() :
> kernel BUG at fs/buffer.c:2910! - I have been able to reproduce it
> every time (on different BUG_ON sometimes)

I found it --- or at least I found one of the problems.

The call to ext4_unregister_li_request(sb) comes *after* the call to
jbd2_journal_destroy().  If while we are destroying the journal, we
get unlucky and call ext4_init_inode_table(), then we end up creating
a handle after the journal thread is shutdown, during the final call
to jbd2_journal_commit_transaction(), but before
jbd2_journal_destroy() calls jbd2_log_do_checkpoint(), then we end up
waiting forever in jbd2_log_wait_commit().

This shouldn't however lock up the system tight enough that it doesn't
respond to magic sysrq, but I haven't seen that problem since I moved
from 2.6.36-rc3 to 2.6.36-rc6.  I do see this problem, which is
definitely a bug.

I am getting a lot of warnings from fs/writeback.c:76 (Dirtiable inode
bdi block != sb bdi block) which I have been commenting out for now,
since it seems to be noisy but otherwise relatively harmless.

I also found a bug in ext4_init_inode_table() where you compare 
(num > EXT4_INODES_PER_GROUP(sb)) in ext4_init_inode_table(), which
I'm pretty sure should be (num > sbi->s_itb_per_group) instead.

Regards,

							- Ted

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6 v4] Lazy itable initialization for Ext4
  2010-10-02 19:55       ` Ted Ts'o
@ 2010-10-03  2:43         ` Ted Ts'o
  2010-10-04  2:36           ` Ted Ts'o
  0 siblings, 1 reply; 22+ messages in thread
From: Ted Ts'o @ 2010-10-03  2:43 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, rwheeler, sandeen, adilger, snitzer

OK, so this looks like it's working for me now.  This combines patches
the old 2/6 and 3/6.  I've added documentation for the mount options,
fixed the lazyinit daemon shutdown-on-mount bug, as well as cleaning
up the sanity check in ext4_init_inode_table().

I've run XFStests with a 2CPU KVM test setup (this may have been why
you couldn't reproduce it --- perhaps you weren't using an SMP
system?), with both 1k and 4k block sizes, and it seem to pass without
problems.

     	       	     	    	   	- Ted

>From 7d729b3c1baaed56099378927e433fd90a46a91f Mon Sep 17 00:00:00 2001
From: Lukas Czerner <lczerner@redhat.com>
Date: Sat, 2 Oct 2010 16:43:37 -0400
Subject: [PATCH] ext4: Add support for lazy inode table initialization

When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out.  The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.

Hence, it is important for the inode tables to be initialized as soon
as possble.  This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.

This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed.  There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.

This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10).  We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).

We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.

This can be suppresed using the mount option no_init_itable.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 Documentation/filesystems/ext4.txt |   14 ++
 fs/ext4/ext4.h                     |   39 ++++
 fs/ext4/ialloc.c                   |  116 ++++++++++
 fs/ext4/super.c                    |  440 +++++++++++++++++++++++++++++++++++-
 4 files changed, 606 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index e1def17..6ab9442 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -353,6 +353,20 @@ noauto_da_alloc		replacing existing files via patterns such as
 			system crashes before the delayed allocation
 			blocks are forced to disk.
 
+noinit_itable		Do not initialize any uninitialized inode table
+			blocks in the background.  This feature may be
+			used by installation CD's so that the install
+			process can complete as quickly as possible; the
+			inode table initialization process would then be
+			deferred until the next time the  file system
+			is unmounted.
+
+init_itable=n		The lazy itable init code will wait n times the
+			number of milliseconds it took to zero out the
+			previous block group's inode table.  This
+			minimizes the impact on the systme performance
+			while file system's inode table is being initialized.
+
 discard		Controls whether ext4 should issue discard/TRIM
 nodiscard(*)		commands to the underlying block device when
 			blocks are freed.  This is useful for SSD devices
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b364b9d..dfca73f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -890,6 +890,7 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_DATA_ERR_ABORT	0x10000000 /* Abort on file data write */
 #define EXT4_MOUNT_BLOCK_VALIDITY	0x20000000 /* Block validity checking */
 #define EXT4_MOUNT_DISCARD		0x40000000 /* Issue DISCARD requests */
+#define EXT4_MOUNT_INIT_INODE_TABLE	0x80000000 /* Initialize uninitialized itables */
 
 #define clear_opt(o, opt)		o &= ~EXT4_MOUNT_##opt
 #define set_opt(o, opt)			o |= EXT4_MOUNT_##opt
@@ -1173,6 +1174,11 @@ struct ext4_sb_info {
 
 	/* timer for periodic error stats printing */
 	struct timer_list s_err_report;
+
+	/* Lazy inode table initialization info */
+	struct ext4_li_request *s_li_request;
+	/* Wait multiplier for lazy initialization thread */
+	unsigned int s_li_wait_mult;
 };
 
 static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
@@ -1537,6 +1543,37 @@ void ext4_get_group_no_and_offset(struct super_block *sb, ext4_fsblk_t blocknr,
 extern struct proc_dir_entry *ext4_proc_root;
 
 /*
+ * Timeout and state flag for lazy initialization inode thread.
+ */
+#define EXT4_DEF_LI_WAIT_MULT			10
+#define EXT4_DEF_LI_MAX_START_DELAY		5
+#define EXT4_LAZYINIT_QUIT			0x0001
+#define EXT4_LAZYINIT_RUNNING			0x0002
+
+/*
+ * Lazy inode table initialization info
+ */
+struct ext4_lazy_init {
+	unsigned long		li_state;
+
+	wait_queue_head_t	li_wait_daemon;
+	wait_queue_head_t	li_wait_task;
+	struct timer_list	li_timer;
+	struct task_struct	*li_task;
+
+	struct list_head	li_request_list;
+	struct mutex		li_list_mtx;
+};
+
+struct ext4_li_request {
+	struct super_block	*lr_super;
+	struct ext4_sb_info	*lr_sbi;
+	ext4_group_t		lr_next_group;
+	struct list_head	lr_request;
+	unsigned long		lr_next_sched;
+};
+
+/*
  * Function prototypes
  */
 
@@ -1611,6 +1648,8 @@ extern unsigned ext4_init_inode_bitmap(struct super_block *sb,
 				       ext4_group_t group,
 				       struct ext4_group_desc *desc);
 extern void mark_bitmap_end(int start_bit, int end_bit, char *bitmap);
+extern int ext4_init_inode_table(struct super_block *sb,
+				 ext4_group_t group);
 
 /* mballoc.c */
 extern long ext4_mb_stats;
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 45853e0..ea3ba70 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -107,6 +107,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
 	desc = ext4_get_group_desc(sb, block_group, NULL);
 	if (!desc)
 		return NULL;
+
 	bitmap_blk = ext4_inode_bitmap(sb, desc);
 	bh = sb_getblk(sb, bitmap_blk);
 	if (unlikely(!bh)) {
@@ -123,6 +124,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
 		unlock_buffer(bh);
 		return bh;
 	}
+
 	ext4_lock_group(sb, block_group);
 	if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
 		ext4_init_inode_bitmap(sb, bh, block_group, desc);
@@ -133,6 +135,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
 		return bh;
 	}
 	ext4_unlock_group(sb, block_group);
+
 	if (buffer_uptodate(bh)) {
 		/*
 		 * if not uninit if bh is uptodate,
@@ -712,8 +715,17 @@ static int ext4_claim_inode(struct super_block *sb,
 {
 	int free = 0, retval = 0, count;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
 	struct ext4_group_desc *gdp = ext4_get_group_desc(sb, group, NULL);
 
+	/*
+	 * We have to be sure that new inode allocation does not race with
+	 * inode table initialization, because otherwise we may end up
+	 * allocating and writing new inode right before sb_issue_zeroout
+	 * takes place and overwriting our new inode with zeroes. So we
+	 * take alloc_sem to prevent it.
+	 */
+	down_read(&grp->alloc_sem);
 	ext4_lock_group(sb, group);
 	if (ext4_set_bit(ino, inode_bitmap_bh->b_data)) {
 		/* not a free inode */
@@ -724,6 +736,7 @@ static int ext4_claim_inode(struct super_block *sb,
 	if ((group == 0 && ino < EXT4_FIRST_INO(sb)) ||
 			ino > EXT4_INODES_PER_GROUP(sb)) {
 		ext4_unlock_group(sb, group);
+		up_read(&grp->alloc_sem);
 		ext4_error(sb, "reserved inode or inode > inodes count - "
 			   "block_group = %u, inode=%lu", group,
 			   ino + group * EXT4_INODES_PER_GROUP(sb));
@@ -772,6 +785,7 @@ static int ext4_claim_inode(struct super_block *sb,
 	gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
 err_ret:
 	ext4_unlock_group(sb, group);
+	up_read(&grp->alloc_sem);
 	return retval;
 }
 
@@ -1205,3 +1219,105 @@ unsigned long ext4_count_dirs(struct super_block * sb)
 	}
 	return count;
 }
+
+/*
+ * Zeroes not yet zeroed inode table - just write zeroes through the whole
+ * inode table. Must be called without any spinlock held. The only place
+ * where it is called from on active part of filesystem is ext4lazyinit
+ * thread, so we do not need any special locks, however we have to prevent
+ * inode allocation from the current group, so we take alloc_sem lock, to
+ * block ext4_claim_inode until we are finished.
+ */
+extern int ext4_init_inode_table(struct super_block *sb, ext4_group_t group)
+{
+	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_group_desc *gdp = NULL;
+	struct buffer_head *group_desc_bh;
+	handle_t *handle;
+	ext4_fsblk_t blk;
+	int num, ret = 0, used_blks = 0;
+
+	/* This should not happen, but just to be sure check this */
+	if (sb->s_flags & MS_RDONLY) {
+		ret = 1;
+		goto out;
+	}
+
+	gdp = ext4_get_group_desc(sb, group, &group_desc_bh);
+	if (!gdp)
+		goto out;
+
+	/*
+	 * We do not need to lock this, because we are the only one
+	 * handling this flag.
+	 */
+	if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED))
+		goto out;
+
+	handle = ext4_journal_start_sb(sb, 1);
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		goto out;
+	}
+
+	down_write(&grp->alloc_sem);
+	/*
+	 * If inode bitmap was already initialized there may be some
+	 * used inodes so we need to skip blocks with used inodes in
+	 * inode table.
+	 */
+	if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)))
+		used_blks = DIV_ROUND_UP((EXT4_INODES_PER_GROUP(sb) -
+			    ext4_itable_unused_count(sb, gdp)),
+			    sbi->s_inodes_per_block);
+
+	blk = ext4_inode_table(sb, gdp) + used_blks;
+	num = sbi->s_itb_per_group - used_blks;
+
+	BUFFER_TRACE(group_desc_bh, "get_write_access");
+	ret = ext4_journal_get_write_access(handle,
+					    group_desc_bh);
+	if (ret)
+		goto err_out;
+
+	if (unlikely(num > EXT4_INODES_PER_GROUP(sb))) {
+		ext4_error(sb, "Something is wrong with group %u\n"
+			   "Used itable blocks: %d"
+			   "Itable blocks per group: %lu\n",
+			   group, used_blks, sbi->s_itb_per_group);
+		ret = 1;
+		goto err_out;
+	}
+
+	/*
+	 * Skip zeroout if the inode table is full. But we set the ZEROED
+	 * flag anyway, because obviously, when it is full it does not need
+	 * further zeroing.
+	 */
+	if (unlikely(num == 0))
+		goto skip_zeroout;
+
+	ext4_debug("going to zero out inode table in group %d\n",
+		   group);
+	ret = sb_issue_zeroout(sb, blk, num, GFP_NOFS, BLKDEV_IFL_WAIT);
+	if (ret < 0)
+		goto err_out;
+
+skip_zeroout:
+	ext4_lock_group(sb, group);
+	gdp->bg_flags |= cpu_to_le16(EXT4_BG_INODE_ZEROED);
+	gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
+	ext4_unlock_group(sb, group);
+
+	BUFFER_TRACE(group_desc_bh,
+		     "call ext4_handle_dirty_metadata");
+	ret = ext4_handle_dirty_metadata(handle, NULL,
+					 group_desc_bh);
+
+err_out:
+	up_write(&grp->alloc_sem);
+	ext4_journal_stop(handle);
+out:
+	return ret;
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 751997d..825d847 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -41,6 +41,9 @@
 #include <linux/crc16.h>
 #include <asm/uaccess.h>
 
+#include <linux/kthread.h>
+#include <linux/freezer.h>
+
 #include "ext4.h"
 #include "ext4_jbd2.h"
 #include "xattr.h"
@@ -52,6 +55,8 @@
 
 struct proc_dir_entry *ext4_proc_root;
 static struct kset *ext4_kset;
+struct ext4_lazy_init *ext4_li_info;
+struct mutex ext4_li_mtx;
 
 static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
 			     unsigned long journal_devnum);
@@ -70,6 +75,8 @@ static void ext4_write_super(struct super_block *sb);
 static int ext4_freeze(struct super_block *sb);
 static int ext4_get_sb(struct file_system_type *fs_type, int flags,
 		       const char *dev_name, void *data, struct vfsmount *mnt);
+static void ext4_destroy_lazyinit_thread(void);
+static void ext4_unregister_li_request(struct super_block *sb);
 
 #if !defined(CONFIG_EXT3_FS) && !defined(CONFIG_EXT3_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT23)
 static struct file_system_type ext3_fs_type = {
@@ -720,6 +727,7 @@ static void ext4_put_super(struct super_block *sb)
 	}
 
 	del_timer(&sbi->s_err_report);
+	ext4_unregister_li_request(sb);
 	ext4_release_system_zone(sb);
 	ext4_mb_release(sb);
 	ext4_ext_release(sb);
@@ -1046,6 +1054,12 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
 	    !(def_mount_opts & EXT4_DEFM_BLOCK_VALIDITY))
 		seq_puts(seq, ",block_validity");
 
+	if (!test_opt(sb, INIT_INODE_TABLE))
+		seq_puts(seq, ",noinit_inode_table");
+	else if (sbi->s_li_wait_mult)
+		seq_printf(seq, ",init_inode_table=%u",
+			   (unsigned) sbi->s_li_wait_mult);
+
 	ext4_show_quota_options(seq, sb);
 
 	return 0;
@@ -1220,6 +1234,7 @@ enum {
 	Opt_inode_readahead_blks, Opt_journal_ioprio,
 	Opt_dioread_nolock, Opt_dioread_lock,
 	Opt_discard, Opt_nodiscard,
+	Opt_init_inode_table, Opt_noinit_inode_table,
 };
 
 static const match_table_t tokens = {
@@ -1290,6 +1305,9 @@ static const match_table_t tokens = {
 	{Opt_dioread_lock, "dioread_lock"},
 	{Opt_discard, "discard"},
 	{Opt_nodiscard, "nodiscard"},
+	{Opt_init_inode_table, "init_itable=%u"},
+	{Opt_init_inode_table, "init_itable"},
+	{Opt_noinit_inode_table, "noinit_itable"},
 	{Opt_err, NULL},
 };
 
@@ -1760,6 +1778,20 @@ set_qf_format:
 		case Opt_dioread_lock:
 			clear_opt(sbi->s_mount_opt, DIOREAD_NOLOCK);
 			break;
+		case Opt_init_inode_table:
+			set_opt(sbi->s_mount_opt, INIT_INODE_TABLE);
+			if (args[0].from) {
+				if (match_int(&args[0], &option))
+					return 0;
+			} else
+				option = EXT4_DEF_LI_WAIT_MULT;
+			if (option < 0)
+				return 0;
+			sbi->s_li_wait_mult = option;
+			break;
+		case Opt_noinit_inode_table:
+			clear_opt(sbi->s_mount_opt, INIT_INODE_TABLE);
+			break;
 		default:
 			ext4_msg(sb, KERN_ERR,
 			       "Unrecognized mount option \"%s\" "
@@ -1943,7 +1975,8 @@ int ext4_group_desc_csum_verify(struct ext4_sb_info *sbi, __u32 block_group,
 }
 
 /* Called at mount-time, super-block is locked */
-static int ext4_check_descriptors(struct super_block *sb)
+static int ext4_check_descriptors(struct super_block *sb,
+				  ext4_group_t *first_not_zeroed)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	ext4_fsblk_t first_block = le32_to_cpu(sbi->s_es->s_first_data_block);
@@ -1952,7 +1985,7 @@ static int ext4_check_descriptors(struct super_block *sb)
 	ext4_fsblk_t inode_bitmap;
 	ext4_fsblk_t inode_table;
 	int flexbg_flag = 0;
-	ext4_group_t i;
+	ext4_group_t i, grp = sbi->s_groups_count;
 
 	if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG))
 		flexbg_flag = 1;
@@ -1968,6 +2001,10 @@ static int ext4_check_descriptors(struct super_block *sb)
 			last_block = first_block +
 				(EXT4_BLOCKS_PER_GROUP(sb) - 1);
 
+		if ((grp == sbi->s_groups_count) &&
+		   !(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
+			grp = i;
+
 		block_bitmap = ext4_block_bitmap(sb, gdp);
 		if (block_bitmap < first_block || block_bitmap > last_block) {
 			ext4_msg(sb, KERN_ERR, "ext4_check_descriptors: "
@@ -2005,6 +2042,8 @@ static int ext4_check_descriptors(struct super_block *sb)
 		if (!flexbg_flag)
 			first_block += EXT4_BLOCKS_PER_GROUP(sb);
 	}
+	if (NULL != first_not_zeroed)
+		*first_not_zeroed = grp;
 
 	ext4_free_blocks_count_set(sbi->s_es, ext4_count_free_blocks(sb));
 	sbi->s_es->s_free_inodes_count =cpu_to_le32(ext4_count_free_inodes(sb));
@@ -2543,6 +2582,378 @@ static void print_daily_error_info(unsigned long arg)
 	mod_timer(&sbi->s_err_report, jiffies + 24*60*60*HZ);  /* Once a day */
 }
 
+static void ext4_lazyinode_timeout(unsigned long data)
+{
+	struct task_struct *p = (struct task_struct *)data;
+	wake_up_process(p);
+}
+
+/* Find next suitable group and run ext4_init_inode_table */
+static int ext4_run_li_request(struct ext4_li_request *elr)
+{
+	struct ext4_group_desc *gdp = NULL;
+	ext4_group_t group, ngroups;
+	struct super_block *sb;
+	int ret = 0;
+
+	sb = elr->lr_super;
+	ngroups = EXT4_SB(sb)->s_groups_count;
+
+	for (group = elr->lr_next_group; group < ngroups; group++) {
+		gdp = ext4_get_group_desc(sb, group, NULL);
+		if (!gdp) {
+			ret = 1;
+			break;
+		}
+
+		if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
+			break;
+	}
+
+	if (group == ngroups)
+		ret = 1;
+
+	if (!ret) {
+		ret = ext4_init_inode_table(sb, group);
+		elr->lr_next_group = group + 1;
+	}
+
+	return ret;
+}
+
+/*
+ * Remove lr_request from the list_request and free the
+ * request tructure. Should be called with li_list_mtx held
+ */
+static void ext4_remove_li_request(struct ext4_li_request *elr)
+{
+	struct ext4_sb_info *sbi;
+
+	if (!elr)
+		return;
+
+	sbi = elr->lr_sbi;
+
+	list_del(&elr->lr_request);
+	sbi->s_li_request = NULL;
+	kfree(elr);
+}
+
+static void ext4_unregister_li_request(struct super_block *sb)
+{
+	struct ext4_li_request *elr = EXT4_SB(sb)->s_li_request;
+
+	if (!ext4_li_info)
+		return;
+
+	mutex_lock(&ext4_li_info->li_list_mtx);
+	ext4_remove_li_request(elr);
+	mutex_unlock(&ext4_li_info->li_list_mtx);
+}
+
+/*
+ * This is the function where ext4lazyinit thread lives. It walks
+ * through the request list searching for next scheduled filesystem.
+ * When such a fs is found, run the lazy initialization request
+ * (ext4_rn_li_request) and keep track of the time spend in this
+ * function. Based on that time we compute next schedule time of
+ * the request. When walking through the list is complete, compute
+ * next waking time and put itself into sleep.
+ */
+static int ext4_lazyinit_thread(void *arg)
+{
+	struct ext4_lazy_init *eli = (struct ext4_lazy_init *)arg;
+	struct list_head *pos, *n;
+	struct ext4_li_request *elr;
+	struct ext4_sb_info *sbi;
+	unsigned long next_wakeup;
+	unsigned long timeout = 0;
+	int ret;
+
+	BUG_ON(NULL == eli);
+
+	eli->li_timer.data = (unsigned long)current;
+	eli->li_timer.function = ext4_lazyinode_timeout;
+
+	eli->li_task = current;
+	wake_up(&eli->li_wait_task);
+
+cont_thread:
+	while (true) {
+		next_wakeup = ULONG_MAX;
+
+		mutex_lock(&eli->li_list_mtx);
+		if (list_empty(&eli->li_request_list)) {
+			mutex_unlock(&eli->li_list_mtx);
+			goto exit_thread;
+		}
+
+		list_for_each_safe(pos, n, &eli->li_request_list) {
+			elr = list_entry(pos, struct ext4_li_request,
+					 lr_request);
+
+			if (time_before_eq(jiffies, elr->lr_next_sched))
+				continue;
+			sbi = elr->lr_sbi;
+
+			timeout = jiffies;
+			ret = ext4_run_li_request(elr);
+			timeout = (jiffies - timeout) * sbi->s_li_wait_mult;
+
+			if (ret) {
+				ext4_remove_li_request(elr);
+				continue;
+			}
+
+			elr->lr_next_sched = jiffies + timeout;
+			if (elr->lr_next_sched < next_wakeup)
+				next_wakeup = elr->lr_next_sched;
+		}
+		mutex_unlock(&eli->li_list_mtx);
+
+		/*
+		 * We need to check this otherwise we may end up sleeping
+		 * for very long time.
+		 */
+		if (jiffies >= next_wakeup) {
+			cond_resched();
+			continue;
+		}
+
+		eli->li_timer.expires = next_wakeup;
+		add_timer(&eli->li_timer);
+
+		if (freezing(current)) {
+			refrigerator();
+		} else {
+			DEFINE_WAIT(wait);
+			prepare_to_wait(&eli->li_wait_daemon, &wait,
+					TASK_INTERRUPTIBLE);
+			schedule();
+			finish_wait(&eli->li_wait_daemon, &wait);
+		}
+	}
+
+exit_thread:
+	/*
+	 * It looks like the request list is empty, but we need
+	 * to check it under the li_list_mtx lock, to prevent any
+	 * additions into it, and of course we should lock ext4_li_mtx
+	 * to atomically free the list and ext4_li_info, because at
+	 * this point another ext4 filesystem could be registering
+	 * new one.
+	 */
+	mutex_lock(&ext4_li_mtx);
+	mutex_lock(&eli->li_list_mtx);
+	if (!list_empty(&eli->li_request_list)) {
+		mutex_unlock(&eli->li_list_mtx);
+		mutex_unlock(&ext4_li_mtx);
+		goto cont_thread;
+	}
+	mutex_unlock(&eli->li_list_mtx);
+	del_timer_sync(&ext4_li_info->li_timer);
+	eli->li_task = NULL;
+	wake_up(&eli->li_wait_task);
+
+	kfree(ext4_li_info);
+	ext4_li_info = NULL;
+	mutex_unlock(&ext4_li_mtx);
+
+	return 0;
+}
+
+static void ext4_clear_request_list(void)
+{
+	struct list_head *pos, *n;
+	struct ext4_li_request *elr;
+
+	mutex_lock(&ext4_li_info->li_list_mtx);
+	if (list_empty(&ext4_li_info->li_request_list))
+		return;
+
+	list_for_each_safe(pos, n, &ext4_li_info->li_request_list) {
+		elr = list_entry(pos, struct ext4_li_request,
+				 lr_request);
+		ext4_remove_li_request(elr);
+	}
+	mutex_unlock(&ext4_li_info->li_list_mtx);
+}
+
+static int ext4_run_lazyinit_thread(void)
+{
+	struct task_struct *t;
+
+	t = kthread_run(ext4_lazyinit_thread, ext4_li_info, "ext4lazyinit");
+	if (IS_ERR(t)) {
+		int err = PTR_ERR(t);
+		ext4_clear_request_list();
+		del_timer_sync(&ext4_li_info->li_timer);
+		kfree(ext4_li_info);
+		ext4_li_info = NULL;
+		printk(KERN_CRIT "EXT4: error %d creating inode table "
+				 "initialization thread\n",
+				 err);
+		return err;
+	}
+	ext4_li_info->li_state |= EXT4_LAZYINIT_RUNNING;
+
+	wait_event(ext4_li_info->li_wait_task, ext4_li_info->li_task != NULL);
+	return 0;
+}
+
+/*
+ * Check whether it make sense to run itable init. thread or not.
+ * If there is at least one uninitialized inode table, return
+ * corresponding group number, else the loop goes through all
+ * groups and return total number of groups.
+ */
+static ext4_group_t ext4_has_uninit_itable(struct super_block *sb)
+{
+	ext4_group_t group, ngroups = EXT4_SB(sb)->s_groups_count;
+	struct ext4_group_desc *gdp = NULL;
+
+	for (group = 0; group < ngroups; group++) {
+		gdp = ext4_get_group_desc(sb, group, NULL);
+		if (!gdp)
+			continue;
+
+		if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
+			break;
+	}
+
+	return group;
+}
+
+static int ext4_li_info_new(void)
+{
+	struct ext4_lazy_init *eli = NULL;
+
+	eli = kzalloc(sizeof(*eli), GFP_KERNEL);
+	if (!eli)
+		return -ENOMEM;
+
+	eli->li_task = NULL;
+	INIT_LIST_HEAD(&eli->li_request_list);
+	mutex_init(&eli->li_list_mtx);
+
+	init_waitqueue_head(&eli->li_wait_daemon);
+	init_waitqueue_head(&eli->li_wait_task);
+	init_timer(&eli->li_timer);
+	eli->li_state |= EXT4_LAZYINIT_QUIT;
+
+	ext4_li_info = eli;
+
+	return 0;
+}
+
+static struct ext4_li_request *ext4_li_request_new(struct super_block *sb,
+					    ext4_group_t start)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_li_request *elr;
+	unsigned long rnd;
+
+	elr = kzalloc(sizeof(*elr), GFP_KERNEL);
+	if (!elr)
+		return NULL;
+
+	elr->lr_super = sb;
+	elr->lr_sbi = sbi;
+	elr->lr_next_group = start;
+
+	/*
+	 * Randomize first schedule time of the request to
+	 * spread the inode table initialization requests
+	 * better.
+	 */
+	get_random_bytes(&rnd, sizeof(rnd));
+	elr->lr_next_sched = jiffies + (unsigned long)rnd %
+			     (EXT4_DEF_LI_MAX_START_DELAY * HZ);
+
+	return elr;
+}
+
+static int ext4_register_li_request(struct super_block *sb,
+				    ext4_group_t first_not_zeroed)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_li_request *elr;
+	ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
+	int ret = 0;
+
+	if (sbi->s_li_request != NULL)
+		goto out;
+
+	if (first_not_zeroed == ngroups ||
+	    (sb->s_flags & MS_RDONLY) ||
+	    !test_opt(sb, INIT_INODE_TABLE)) {
+		sbi->s_li_request = NULL;
+		goto out;
+	}
+
+	if (first_not_zeroed == ngroups) {
+		sbi->s_li_request = NULL;
+		goto out;
+	}
+
+	elr = ext4_li_request_new(sb, first_not_zeroed);
+	if (!elr) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	mutex_lock(&ext4_li_mtx);
+
+	if (NULL == ext4_li_info) {
+		ret = ext4_li_info_new();
+		if (ret)
+			goto out;
+	}
+
+	mutex_lock(&ext4_li_info->li_list_mtx);
+	list_add(&elr->lr_request, &ext4_li_info->li_request_list);
+	mutex_unlock(&ext4_li_info->li_list_mtx);
+
+	sbi->s_li_request = elr;
+
+	if (!(ext4_li_info->li_state & EXT4_LAZYINIT_RUNNING)) {
+		ret = ext4_run_lazyinit_thread();
+		if (ret)
+			goto out;
+	}
+
+	mutex_unlock(&ext4_li_mtx);
+
+out:
+	if (ret) {
+		mutex_unlock(&ext4_li_mtx);
+		kfree(elr);
+	}
+	return ret;
+}
+
+/*
+ * We do not need to lock anything since this is called on
+ * module unload.
+ */
+static void ext4_destroy_lazyinit_thread(void)
+{
+	/*
+	 * If thread exited earlier
+	 * there's nothing to be done.
+	 */
+	if (!ext4_li_info)
+		return;
+
+	ext4_clear_request_list();
+
+	while (ext4_li_info->li_task) {
+		wake_up(&ext4_li_info->li_wait_daemon);
+		wait_event(ext4_li_info->li_wait_task,
+			   ext4_li_info->li_task == NULL);
+	}
+}
+
 static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 				__releases(kernel_lock)
 				__acquires(kernel_lock)
@@ -2568,6 +2979,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	__u64 blocks_count;
 	int err;
 	unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
+	ext4_group_t first_not_zeroed;
 
 	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
 	if (!sbi)
@@ -2630,6 +3042,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 
 	/* Set defaults before we parse the mount options */
 	def_mount_opts = le32_to_cpu(es->s_default_mount_opts);
+	set_opt(sbi->s_mount_opt, INIT_INODE_TABLE);
 	if (def_mount_opts & EXT4_DEFM_DEBUG)
 		set_opt(sbi->s_mount_opt, DEBUG);
 	if (def_mount_opts & EXT4_DEFM_BSDGROUPS) {
@@ -2909,7 +3322,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 			goto failed_mount2;
 		}
 	}
-	if (!ext4_check_descriptors(sb)) {
+	if (!ext4_check_descriptors(sb, &first_not_zeroed)) {
 		ext4_msg(sb, KERN_ERR, "group descriptors corrupted!");
 		goto failed_mount2;
 	}
@@ -3130,6 +3543,10 @@ no_journal:
 		goto failed_mount4;
 	}
 
+	err = ext4_register_li_request(sb, first_not_zeroed);
+	if (err)
+		goto failed_mount4;
+
 	sbi->s_kobj.kset = ext4_kset;
 	init_completion(&sbi->s_kobj_unregister);
 	err = kobject_init_and_add(&sbi->s_kobj, &ext4_ktype, NULL,
@@ -3847,6 +4264,19 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
 			enable_quota = 1;
 		}
 	}
+
+	/*
+	 * Reinitialize lazy itable initialization thread based on
+	 * current settings
+	 */
+	if ((sb->s_flags & MS_RDONLY) || !test_opt(sb, INIT_INODE_TABLE))
+		ext4_unregister_li_request(sb);
+	else {
+		ext4_group_t first_not_zeroed;
+		first_not_zeroed = ext4_has_uninit_itable(sb);
+		ext4_register_li_request(sb, first_not_zeroed);
+	}
+
 	ext4_setup_system_zone(sb);
 	if (sbi->s_journal == NULL)
 		ext4_commit_super(sb, 1);
@@ -4317,6 +4747,9 @@ static int __init init_ext4_fs(void)
 	err = register_filesystem(&ext4_fs_type);
 	if (err)
 		goto out;
+
+	ext4_li_info = NULL;
+	mutex_init(&ext4_li_mtx);
 	return 0;
 out:
 	unregister_as_ext2();
@@ -4336,6 +4769,7 @@ out4:
 
 static void __exit exit_ext4_fs(void)
 {
+	ext4_destroy_lazyinit_thread();
 	unregister_as_ext2();
 	unregister_as_ext3();
 	unregister_filesystem(&ext4_fs_type);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6 v4] Lazy itable initialization for Ext4
  2010-10-03  2:43         ` Ted Ts'o
@ 2010-10-04  2:36           ` Ted Ts'o
  2010-10-04  7:31             ` Ted Ts'o
  2010-10-04 13:14             ` Lukas Czerner
  0 siblings, 2 replies; 22+ messages in thread
From: Ted Ts'o @ 2010-10-04  2:36 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, rwheeler, sandeen, adilger, snitzer

I've made some more changes.  This version updates the timing control.
The major changes are:

1) Time the it takes to clear the inode table with a barrier (once),
and then use it for the rest of the block groups for that file system.

2) s_li_wait_nult wasn't getting defaulted, so we weren't waiting any
time at all between sb_issue_zeroout calls.

3) Fix the timer arithmetic so it works across jiffies rollover.
(This means using time_before() instead of <)

						- Ted

>From 87fe012bfa04e1ac95a4a96f90b70c2a0983e228 Mon Sep 17 00:00:00 2001
From: Lukas Czerner <lczerner@redhat.com>
Date: Sun, 3 Oct 2010 22:31:15 -0400
Subject: [PATCH] ext4: Add support for lazy inode table initialization

When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out.  The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.

Hence, it is important for the inode tables to be initialized as soon
as possble.  This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.

This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed.  There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.

This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10).  We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).

We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.

This can be suppresed using the mount option no_init_itable.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 Documentation/filesystems/ext4.txt |   14 ++
 fs/ext4/ext4.h                     |   40 ++++
 fs/ext4/ialloc.c                   |  120 ++++++++++
 fs/ext4/super.c                    |  439 +++++++++++++++++++++++++++++++++++-
 4 files changed, 610 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index e1def17..6ab9442 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -353,6 +353,20 @@ noauto_da_alloc		replacing existing files via patterns such as
 			system crashes before the delayed allocation
 			blocks are forced to disk.
 
+noinit_itable		Do not initialize any uninitialized inode table
+			blocks in the background.  This feature may be
+			used by installation CD's so that the install
+			process can complete as quickly as possible; the
+			inode table initialization process would then be
+			deferred until the next time the  file system
+			is unmounted.
+
+init_itable=n		The lazy itable init code will wait n times the
+			number of milliseconds it took to zero out the
+			previous block group's inode table.  This
+			minimizes the impact on the systme performance
+			while file system's inode table is being initialized.
+
 discard		Controls whether ext4 should issue discard/TRIM
 nodiscard(*)		commands to the underlying block device when
 			blocks are freed.  This is useful for SSD devices
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b364b9d..0fe078d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -890,6 +890,7 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_DATA_ERR_ABORT	0x10000000 /* Abort on file data write */
 #define EXT4_MOUNT_BLOCK_VALIDITY	0x20000000 /* Block validity checking */
 #define EXT4_MOUNT_DISCARD		0x40000000 /* Issue DISCARD requests */
+#define EXT4_MOUNT_INIT_INODE_TABLE	0x80000000 /* Initialize uninitialized itables */
 
 #define clear_opt(o, opt)		o &= ~EXT4_MOUNT_##opt
 #define set_opt(o, opt)			o |= EXT4_MOUNT_##opt
@@ -1173,6 +1174,11 @@ struct ext4_sb_info {
 
 	/* timer for periodic error stats printing */
 	struct timer_list s_err_report;
+
+	/* Lazy inode table initialization info */
+	struct ext4_li_request *s_li_request;
+	/* Wait multiplier for lazy initialization thread */
+	unsigned int s_li_wait_mult;
 };
 
 static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
@@ -1537,6 +1543,38 @@ void ext4_get_group_no_and_offset(struct super_block *sb, ext4_fsblk_t blocknr,
 extern struct proc_dir_entry *ext4_proc_root;
 
 /*
+ * Timeout and state flag for lazy initialization inode thread.
+ */
+#define EXT4_DEF_LI_WAIT_MULT			10
+#define EXT4_DEF_LI_MAX_START_DELAY		5
+#define EXT4_LAZYINIT_QUIT			0x0001
+#define EXT4_LAZYINIT_RUNNING			0x0002
+
+/*
+ * Lazy inode table initialization info
+ */
+struct ext4_lazy_init {
+	unsigned long		li_state;
+
+	wait_queue_head_t	li_wait_daemon;
+	wait_queue_head_t	li_wait_task;
+	struct timer_list	li_timer;
+	struct task_struct	*li_task;
+
+	struct list_head	li_request_list;
+	struct mutex		li_list_mtx;
+};
+
+struct ext4_li_request {
+	struct super_block	*lr_super;
+	struct ext4_sb_info	*lr_sbi;
+	ext4_group_t		lr_next_group;
+	struct list_head	lr_request;
+	unsigned long		lr_next_sched;
+	unsigned long		lr_timeout;
+};
+
+/*
  * Function prototypes
  */
 
@@ -1611,6 +1649,8 @@ extern unsigned ext4_init_inode_bitmap(struct super_block *sb,
 				       ext4_group_t group,
 				       struct ext4_group_desc *desc);
 extern void mark_bitmap_end(int start_bit, int end_bit, char *bitmap);
+extern int ext4_init_inode_table(struct super_block *sb,
+				 ext4_group_t group, int barrier);
 
 /* mballoc.c */
 extern long ext4_mb_stats;
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 45853e0..e428f23 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -107,6 +107,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
 	desc = ext4_get_group_desc(sb, block_group, NULL);
 	if (!desc)
 		return NULL;
+
 	bitmap_blk = ext4_inode_bitmap(sb, desc);
 	bh = sb_getblk(sb, bitmap_blk);
 	if (unlikely(!bh)) {
@@ -123,6 +124,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
 		unlock_buffer(bh);
 		return bh;
 	}
+
 	ext4_lock_group(sb, block_group);
 	if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
 		ext4_init_inode_bitmap(sb, bh, block_group, desc);
@@ -133,6 +135,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
 		return bh;
 	}
 	ext4_unlock_group(sb, block_group);
+
 	if (buffer_uptodate(bh)) {
 		/*
 		 * if not uninit if bh is uptodate,
@@ -712,8 +715,17 @@ static int ext4_claim_inode(struct super_block *sb,
 {
 	int free = 0, retval = 0, count;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
 	struct ext4_group_desc *gdp = ext4_get_group_desc(sb, group, NULL);
 
+	/*
+	 * We have to be sure that new inode allocation does not race with
+	 * inode table initialization, because otherwise we may end up
+	 * allocating and writing new inode right before sb_issue_zeroout
+	 * takes place and overwriting our new inode with zeroes. So we
+	 * take alloc_sem to prevent it.
+	 */
+	down_read(&grp->alloc_sem);
 	ext4_lock_group(sb, group);
 	if (ext4_set_bit(ino, inode_bitmap_bh->b_data)) {
 		/* not a free inode */
@@ -724,6 +736,7 @@ static int ext4_claim_inode(struct super_block *sb,
 	if ((group == 0 && ino < EXT4_FIRST_INO(sb)) ||
 			ino > EXT4_INODES_PER_GROUP(sb)) {
 		ext4_unlock_group(sb, group);
+		up_read(&grp->alloc_sem);
 		ext4_error(sb, "reserved inode or inode > inodes count - "
 			   "block_group = %u, inode=%lu", group,
 			   ino + group * EXT4_INODES_PER_GROUP(sb));
@@ -772,6 +785,7 @@ static int ext4_claim_inode(struct super_block *sb,
 	gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
 err_ret:
 	ext4_unlock_group(sb, group);
+	up_read(&grp->alloc_sem);
 	return retval;
 }
 
@@ -1205,3 +1219,109 @@ unsigned long ext4_count_dirs(struct super_block * sb)
 	}
 	return count;
 }
+
+/*
+ * Zeroes not yet zeroed inode table - just write zeroes through the whole
+ * inode table. Must be called without any spinlock held. The only place
+ * where it is called from on active part of filesystem is ext4lazyinit
+ * thread, so we do not need any special locks, however we have to prevent
+ * inode allocation from the current group, so we take alloc_sem lock, to
+ * block ext4_claim_inode until we are finished.
+ */
+extern int ext4_init_inode_table(struct super_block *sb, ext4_group_t group,
+				 int barrier)
+{
+	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_group_desc *gdp = NULL;
+	struct buffer_head *group_desc_bh;
+	handle_t *handle;
+	ext4_fsblk_t blk;
+	int num, ret = 0, used_blks = 0;
+	unsigned long flags = BLKDEV_IFL_WAIT;
+
+	/* This should not happen, but just to be sure check this */
+	if (sb->s_flags & MS_RDONLY) {
+		ret = 1;
+		goto out;
+	}
+
+	gdp = ext4_get_group_desc(sb, group, &group_desc_bh);
+	if (!gdp)
+		goto out;
+
+	/*
+	 * We do not need to lock this, because we are the only one
+	 * handling this flag.
+	 */
+	if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED))
+		goto out;
+
+	handle = ext4_journal_start_sb(sb, 1);
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		goto out;
+	}
+
+	down_write(&grp->alloc_sem);
+	/*
+	 * If inode bitmap was already initialized there may be some
+	 * used inodes so we need to skip blocks with used inodes in
+	 * inode table.
+	 */
+	if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)))
+		used_blks = DIV_ROUND_UP((EXT4_INODES_PER_GROUP(sb) -
+			    ext4_itable_unused_count(sb, gdp)),
+			    sbi->s_inodes_per_block);
+
+	blk = ext4_inode_table(sb, gdp) + used_blks;
+	num = sbi->s_itb_per_group - used_blks;
+
+	BUFFER_TRACE(group_desc_bh, "get_write_access");
+	ret = ext4_journal_get_write_access(handle,
+					    group_desc_bh);
+	if (ret)
+		goto err_out;
+
+	if (unlikely(num > EXT4_INODES_PER_GROUP(sb))) {
+		ext4_error(sb, "Something is wrong with group %u\n"
+			   "Used itable blocks: %d"
+			   "Itable blocks per group: %lu\n",
+			   group, used_blks, sbi->s_itb_per_group);
+		ret = 1;
+		goto err_out;
+	}
+
+	/*
+	 * Skip zeroout if the inode table is full. But we set the ZEROED
+	 * flag anyway, because obviously, when it is full it does not need
+	 * further zeroing.
+	 */
+	if (unlikely(num == 0))
+		goto skip_zeroout;
+
+	ext4_debug("going to zero out inode table in group %d\n",
+		   group);
+	if (barrier)
+		flags |= BLKDEV_IFL_BARRIER;
+	ret = sb_issue_zeroout(sb, blk, num, GFP_NOFS, flags);
+	if (ret < 0)
+		goto err_out;
+
+skip_zeroout:
+	ext4_lock_group(sb, group);
+	gdp->bg_flags |= cpu_to_le16(EXT4_BG_INODE_ZEROED);
+	gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
+	ext4_unlock_group(sb, group);
+
+	BUFFER_TRACE(group_desc_bh,
+		     "call ext4_handle_dirty_metadata");
+	ret = ext4_handle_dirty_metadata(handle, NULL,
+					 group_desc_bh);
+
+err_out:
+	up_write(&grp->alloc_sem);
+	ext4_journal_stop(handle);
+out:
+	return ret;
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 751997d..c4b9984 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -41,6 +41,9 @@
 #include <linux/crc16.h>
 #include <asm/uaccess.h>
 
+#include <linux/kthread.h>
+#include <linux/freezer.h>
+
 #include "ext4.h"
 #include "ext4_jbd2.h"
 #include "xattr.h"
@@ -52,6 +55,8 @@
 
 struct proc_dir_entry *ext4_proc_root;
 static struct kset *ext4_kset;
+struct ext4_lazy_init *ext4_li_info;
+struct mutex ext4_li_mtx;
 
 static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
 			     unsigned long journal_devnum);
@@ -70,6 +75,8 @@ static void ext4_write_super(struct super_block *sb);
 static int ext4_freeze(struct super_block *sb);
 static int ext4_get_sb(struct file_system_type *fs_type, int flags,
 		       const char *dev_name, void *data, struct vfsmount *mnt);
+static void ext4_destroy_lazyinit_thread(void);
+static void ext4_unregister_li_request(struct super_block *sb);
 
 #if !defined(CONFIG_EXT3_FS) && !defined(CONFIG_EXT3_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT23)
 static struct file_system_type ext3_fs_type = {
@@ -720,6 +727,7 @@ static void ext4_put_super(struct super_block *sb)
 	}
 
 	del_timer(&sbi->s_err_report);
+	ext4_unregister_li_request(sb);
 	ext4_release_system_zone(sb);
 	ext4_mb_release(sb);
 	ext4_ext_release(sb);
@@ -1046,6 +1054,12 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
 	    !(def_mount_opts & EXT4_DEFM_BLOCK_VALIDITY))
 		seq_puts(seq, ",block_validity");
 
+	if (!test_opt(sb, INIT_INODE_TABLE))
+		seq_puts(seq, ",noinit_inode_table");
+	else if (sbi->s_li_wait_mult)
+		seq_printf(seq, ",init_inode_table=%u",
+			   (unsigned) sbi->s_li_wait_mult);
+
 	ext4_show_quota_options(seq, sb);
 
 	return 0;
@@ -1220,6 +1234,7 @@ enum {
 	Opt_inode_readahead_blks, Opt_journal_ioprio,
 	Opt_dioread_nolock, Opt_dioread_lock,
 	Opt_discard, Opt_nodiscard,
+	Opt_init_inode_table, Opt_noinit_inode_table,
 };
 
 static const match_table_t tokens = {
@@ -1290,6 +1305,9 @@ static const match_table_t tokens = {
 	{Opt_dioread_lock, "dioread_lock"},
 	{Opt_discard, "discard"},
 	{Opt_nodiscard, "nodiscard"},
+	{Opt_init_inode_table, "init_itable=%u"},
+	{Opt_init_inode_table, "init_itable"},
+	{Opt_noinit_inode_table, "noinit_itable"},
 	{Opt_err, NULL},
 };
 
@@ -1760,6 +1778,20 @@ set_qf_format:
 		case Opt_dioread_lock:
 			clear_opt(sbi->s_mount_opt, DIOREAD_NOLOCK);
 			break;
+		case Opt_init_inode_table:
+			set_opt(sbi->s_mount_opt, INIT_INODE_TABLE);
+			if (args[0].from) {
+				if (match_int(&args[0], &option))
+					return 0;
+			} else
+				option = EXT4_DEF_LI_WAIT_MULT;
+			if (option < 0)
+				return 0;
+			sbi->s_li_wait_mult = option;
+			break;
+		case Opt_noinit_inode_table:
+			clear_opt(sbi->s_mount_opt, INIT_INODE_TABLE);
+			break;
 		default:
 			ext4_msg(sb, KERN_ERR,
 			       "Unrecognized mount option \"%s\" "
@@ -1943,7 +1975,8 @@ int ext4_group_desc_csum_verify(struct ext4_sb_info *sbi, __u32 block_group,
 }
 
 /* Called at mount-time, super-block is locked */
-static int ext4_check_descriptors(struct super_block *sb)
+static int ext4_check_descriptors(struct super_block *sb,
+				  ext4_group_t *first_not_zeroed)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	ext4_fsblk_t first_block = le32_to_cpu(sbi->s_es->s_first_data_block);
@@ -1952,7 +1985,7 @@ static int ext4_check_descriptors(struct super_block *sb)
 	ext4_fsblk_t inode_bitmap;
 	ext4_fsblk_t inode_table;
 	int flexbg_flag = 0;
-	ext4_group_t i;
+	ext4_group_t i, grp = sbi->s_groups_count;
 
 	if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG))
 		flexbg_flag = 1;
@@ -1968,6 +2001,10 @@ static int ext4_check_descriptors(struct super_block *sb)
 			last_block = first_block +
 				(EXT4_BLOCKS_PER_GROUP(sb) - 1);
 
+		if ((grp == sbi->s_groups_count) &&
+		   !(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
+			grp = i;
+
 		block_bitmap = ext4_block_bitmap(sb, gdp);
 		if (block_bitmap < first_block || block_bitmap > last_block) {
 			ext4_msg(sb, KERN_ERR, "ext4_check_descriptors: "
@@ -2005,6 +2042,8 @@ static int ext4_check_descriptors(struct super_block *sb)
 		if (!flexbg_flag)
 			first_block += EXT4_BLOCKS_PER_GROUP(sb);
 	}
+	if (NULL != first_not_zeroed)
+		*first_not_zeroed = grp;
 
 	ext4_free_blocks_count_set(sbi->s_es, ext4_count_free_blocks(sb));
 	sbi->s_es->s_free_inodes_count =cpu_to_le32(ext4_count_free_inodes(sb));
@@ -2543,6 +2582,377 @@ static void print_daily_error_info(unsigned long arg)
 	mod_timer(&sbi->s_err_report, jiffies + 24*60*60*HZ);  /* Once a day */
 }
 
+static void ext4_lazyinode_timeout(unsigned long data)
+{
+	struct task_struct *p = (struct task_struct *)data;
+	wake_up_process(p);
+}
+
+/* Find next suitable group and run ext4_init_inode_table */
+static int ext4_run_li_request(struct ext4_li_request *elr)
+{
+	struct ext4_group_desc *gdp = NULL;
+	ext4_group_t group, ngroups;
+	struct super_block *sb;
+	unsigned long timeout = 0;
+	int ret = 0;
+
+	sb = elr->lr_super;
+	ngroups = EXT4_SB(sb)->s_groups_count;
+
+	for (group = elr->lr_next_group; group < ngroups; group++) {
+		gdp = ext4_get_group_desc(sb, group, NULL);
+		if (!gdp) {
+			ret = 1;
+			break;
+		}
+
+		if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
+			break;
+	}
+
+	if (group == ngroups)
+		ret = 1;
+
+	if (!ret) {
+		timeout = jiffies;
+		ret = ext4_init_inode_table(sb, group,
+					    elr->lr_timeout ? 0 : 1);
+		if (elr->lr_timeout == 0) {
+			timeout = jiffies - timeout;
+			if (elr->lr_sbi->s_li_wait_mult)
+				timeout *= elr->lr_sbi->s_li_wait_mult;
+			else
+				timeout *= 20;
+			elr->lr_timeout = timeout;
+		}
+		elr->lr_next_sched = jiffies + elr->lr_timeout;
+		elr->lr_next_group = group + 1;
+	}
+
+	return ret;
+}
+
+/*
+ * Remove lr_request from the list_request and free the
+ * request tructure. Should be called with li_list_mtx held
+ */
+static void ext4_remove_li_request(struct ext4_li_request *elr)
+{
+	struct ext4_sb_info *sbi;
+
+	if (!elr)
+		return;
+
+	sbi = elr->lr_sbi;
+
+	list_del(&elr->lr_request);
+	sbi->s_li_request = NULL;
+	kfree(elr);
+}
+
+static void ext4_unregister_li_request(struct super_block *sb)
+{
+	struct ext4_li_request *elr = EXT4_SB(sb)->s_li_request;
+
+	if (!ext4_li_info)
+		return;
+
+	mutex_lock(&ext4_li_info->li_list_mtx);
+	ext4_remove_li_request(elr);
+	mutex_unlock(&ext4_li_info->li_list_mtx);
+}
+
+/*
+ * This is the function where ext4lazyinit thread lives. It walks
+ * through the request list searching for next scheduled filesystem.
+ * When such a fs is found, run the lazy initialization request
+ * (ext4_rn_li_request) and keep track of the time spend in this
+ * function. Based on that time we compute next schedule time of
+ * the request. When walking through the list is complete, compute
+ * next waking time and put itself into sleep.
+ */
+static int ext4_lazyinit_thread(void *arg)
+{
+	struct ext4_lazy_init *eli = (struct ext4_lazy_init *)arg;
+	struct list_head *pos, *n;
+	struct ext4_li_request *elr;
+	unsigned long next_wakeup;
+	DEFINE_WAIT(wait);
+	int ret;
+
+	BUG_ON(NULL == eli);
+
+	eli->li_timer.data = (unsigned long)current;
+	eli->li_timer.function = ext4_lazyinode_timeout;
+
+	eli->li_task = current;
+	wake_up(&eli->li_wait_task);
+
+cont_thread:
+	while (true) {
+		next_wakeup = jiffies-1;
+
+		mutex_lock(&eli->li_list_mtx);
+		if (list_empty(&eli->li_request_list)) {
+			mutex_unlock(&eli->li_list_mtx);
+			goto exit_thread;
+		}
+
+		list_for_each_safe(pos, n, &eli->li_request_list) {
+			elr = list_entry(pos, struct ext4_li_request,
+					 lr_request);
+
+			if (time_before_eq(jiffies, elr->lr_next_sched))
+				continue;
+
+			if ((ret = ext4_run_li_request(elr)) != 0) {
+				ext4_remove_li_request(elr);
+				continue;
+			}
+
+			if (time_before(elr->lr_next_sched, next_wakeup))
+				next_wakeup = elr->lr_next_sched;
+		}
+		mutex_unlock(&eli->li_list_mtx);
+
+		if (freezing(current))
+			refrigerator();
+
+		if (jiffies >= next_wakeup) {
+			cond_resched();
+			continue;
+		}
+
+		eli->li_timer.expires = next_wakeup;
+		add_timer(&eli->li_timer);
+		prepare_to_wait(&eli->li_wait_daemon, &wait,
+				TASK_INTERRUPTIBLE);
+		if (time_before(jiffies, next_wakeup))
+			schedule();
+		finish_wait(&eli->li_wait_daemon, &wait);
+	}
+
+exit_thread:
+	/*
+	 * It looks like the request list is empty, but we need
+	 * to check it under the li_list_mtx lock, to prevent any
+	 * additions into it, and of course we should lock ext4_li_mtx
+	 * to atomically free the list and ext4_li_info, because at
+	 * this point another ext4 filesystem could be registering
+	 * new one.
+	 */
+	mutex_lock(&ext4_li_mtx);
+	mutex_lock(&eli->li_list_mtx);
+	if (!list_empty(&eli->li_request_list)) {
+		mutex_unlock(&eli->li_list_mtx);
+		mutex_unlock(&ext4_li_mtx);
+		goto cont_thread;
+	}
+	mutex_unlock(&eli->li_list_mtx);
+	del_timer_sync(&ext4_li_info->li_timer);
+	eli->li_task = NULL;
+	wake_up(&eli->li_wait_task);
+
+	kfree(ext4_li_info);
+	ext4_li_info = NULL;
+	mutex_unlock(&ext4_li_mtx);
+
+	return 0;
+}
+
+static void ext4_clear_request_list(void)
+{
+	struct list_head *pos, *n;
+	struct ext4_li_request *elr;
+
+	mutex_lock(&ext4_li_info->li_list_mtx);
+	if (list_empty(&ext4_li_info->li_request_list))
+		return;
+
+	list_for_each_safe(pos, n, &ext4_li_info->li_request_list) {
+		elr = list_entry(pos, struct ext4_li_request,
+				 lr_request);
+		ext4_remove_li_request(elr);
+	}
+	mutex_unlock(&ext4_li_info->li_list_mtx);
+}
+
+static int ext4_run_lazyinit_thread(void)
+{
+	struct task_struct *t;
+
+	t = kthread_run(ext4_lazyinit_thread, ext4_li_info, "ext4lazyinit");
+	if (IS_ERR(t)) {
+		int err = PTR_ERR(t);
+		ext4_clear_request_list();
+		del_timer_sync(&ext4_li_info->li_timer);
+		kfree(ext4_li_info);
+		ext4_li_info = NULL;
+		printk(KERN_CRIT "EXT4: error %d creating inode table "
+				 "initialization thread\n",
+				 err);
+		return err;
+	}
+	ext4_li_info->li_state |= EXT4_LAZYINIT_RUNNING;
+
+	wait_event(ext4_li_info->li_wait_task, ext4_li_info->li_task != NULL);
+	return 0;
+}
+
+/*
+ * Check whether it make sense to run itable init. thread or not.
+ * If there is at least one uninitialized inode table, return
+ * corresponding group number, else the loop goes through all
+ * groups and return total number of groups.
+ */
+static ext4_group_t ext4_has_uninit_itable(struct super_block *sb)
+{
+	ext4_group_t group, ngroups = EXT4_SB(sb)->s_groups_count;
+	struct ext4_group_desc *gdp = NULL;
+
+	for (group = 0; group < ngroups; group++) {
+		gdp = ext4_get_group_desc(sb, group, NULL);
+		if (!gdp)
+			continue;
+
+		if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
+			break;
+	}
+
+	return group;
+}
+
+static int ext4_li_info_new(void)
+{
+	struct ext4_lazy_init *eli = NULL;
+
+	eli = kzalloc(sizeof(*eli), GFP_KERNEL);
+	if (!eli)
+		return -ENOMEM;
+
+	eli->li_task = NULL;
+	INIT_LIST_HEAD(&eli->li_request_list);
+	mutex_init(&eli->li_list_mtx);
+
+	init_waitqueue_head(&eli->li_wait_daemon);
+	init_waitqueue_head(&eli->li_wait_task);
+	init_timer(&eli->li_timer);
+	eli->li_state |= EXT4_LAZYINIT_QUIT;
+
+	ext4_li_info = eli;
+
+	return 0;
+}
+
+static struct ext4_li_request *ext4_li_request_new(struct super_block *sb,
+					    ext4_group_t start)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_li_request *elr;
+	unsigned long rnd;
+
+	elr = kzalloc(sizeof(*elr), GFP_KERNEL);
+	if (!elr)
+		return NULL;
+
+	elr->lr_super = sb;
+	elr->lr_sbi = sbi;
+	elr->lr_next_group = start;
+
+	/*
+	 * Randomize first schedule time of the request to
+	 * spread the inode table initialization requests
+	 * better.
+	 */
+	get_random_bytes(&rnd, sizeof(rnd));
+	elr->lr_next_sched = jiffies + (unsigned long)rnd %
+			     (EXT4_DEF_LI_MAX_START_DELAY * HZ);
+
+	return elr;
+}
+
+static int ext4_register_li_request(struct super_block *sb,
+				    ext4_group_t first_not_zeroed)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_li_request *elr;
+	ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
+	int ret = 0;
+
+	if (sbi->s_li_request != NULL)
+		goto out;
+
+	if (first_not_zeroed == ngroups ||
+	    (sb->s_flags & MS_RDONLY) ||
+	    !test_opt(sb, INIT_INODE_TABLE)) {
+		sbi->s_li_request = NULL;
+		goto out;
+	}
+
+	if (first_not_zeroed == ngroups) {
+		sbi->s_li_request = NULL;
+		goto out;
+	}
+
+	elr = ext4_li_request_new(sb, first_not_zeroed);
+	if (!elr) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	mutex_lock(&ext4_li_mtx);
+
+	if (NULL == ext4_li_info) {
+		ret = ext4_li_info_new();
+		if (ret)
+			goto out;
+	}
+
+	mutex_lock(&ext4_li_info->li_list_mtx);
+	list_add(&elr->lr_request, &ext4_li_info->li_request_list);
+	mutex_unlock(&ext4_li_info->li_list_mtx);
+
+	sbi->s_li_request = elr;
+
+	if (!(ext4_li_info->li_state & EXT4_LAZYINIT_RUNNING)) {
+		ret = ext4_run_lazyinit_thread();
+		if (ret)
+			goto out;
+	}
+
+	mutex_unlock(&ext4_li_mtx);
+
+out:
+	if (ret) {
+		mutex_unlock(&ext4_li_mtx);
+		kfree(elr);
+	}
+	return ret;
+}
+
+/*
+ * We do not need to lock anything since this is called on
+ * module unload.
+ */
+static void ext4_destroy_lazyinit_thread(void)
+{
+	/*
+	 * If thread exited earlier
+	 * there's nothing to be done.
+	 */
+	if (!ext4_li_info)
+		return;
+
+	ext4_clear_request_list();
+
+	while (ext4_li_info->li_task) {
+		wake_up(&ext4_li_info->li_wait_daemon);
+		wait_event(ext4_li_info->li_wait_task,
+			   ext4_li_info->li_task == NULL);
+	}
+}
+
 static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 				__releases(kernel_lock)
 				__acquires(kernel_lock)
@@ -2568,6 +2978,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	__u64 blocks_count;
 	int err;
 	unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
+	ext4_group_t first_not_zeroed;
 
 	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
 	if (!sbi)
@@ -2630,6 +3041,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 
 	/* Set defaults before we parse the mount options */
 	def_mount_opts = le32_to_cpu(es->s_default_mount_opts);
+	set_opt(sbi->s_mount_opt, INIT_INODE_TABLE);
 	if (def_mount_opts & EXT4_DEFM_DEBUG)
 		set_opt(sbi->s_mount_opt, DEBUG);
 	if (def_mount_opts & EXT4_DEFM_BSDGROUPS) {
@@ -2909,7 +3321,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 			goto failed_mount2;
 		}
 	}
-	if (!ext4_check_descriptors(sb)) {
+	if (!ext4_check_descriptors(sb, &first_not_zeroed)) {
 		ext4_msg(sb, KERN_ERR, "group descriptors corrupted!");
 		goto failed_mount2;
 	}
@@ -3130,6 +3542,10 @@ no_journal:
 		goto failed_mount4;
 	}
 
+	err = ext4_register_li_request(sb, first_not_zeroed);
+	if (err)
+		goto failed_mount4;
+
 	sbi->s_kobj.kset = ext4_kset;
 	init_completion(&sbi->s_kobj_unregister);
 	err = kobject_init_and_add(&sbi->s_kobj, &ext4_ktype, NULL,
@@ -3847,6 +4263,19 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
 			enable_quota = 1;
 		}
 	}
+
+	/*
+	 * Reinitialize lazy itable initialization thread based on
+	 * current settings
+	 */
+	if ((sb->s_flags & MS_RDONLY) || !test_opt(sb, INIT_INODE_TABLE))
+		ext4_unregister_li_request(sb);
+	else {
+		ext4_group_t first_not_zeroed;
+		first_not_zeroed = ext4_has_uninit_itable(sb);
+		ext4_register_li_request(sb, first_not_zeroed);
+	}
+
 	ext4_setup_system_zone(sb);
 	if (sbi->s_journal == NULL)
 		ext4_commit_super(sb, 1);
@@ -4317,6 +4746,9 @@ static int __init init_ext4_fs(void)
 	err = register_filesystem(&ext4_fs_type);
 	if (err)
 		goto out;
+
+	ext4_li_info = NULL;
+	mutex_init(&ext4_li_mtx);
 	return 0;
 out:
 	unregister_as_ext2();
@@ -4336,6 +4768,7 @@ out4:
 
 static void __exit exit_ext4_fs(void)
 {
+	ext4_destroy_lazyinit_thread();
 	unregister_as_ext2();
 	unregister_as_ext3();
 	unregister_filesystem(&ext4_fs_type);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6 v4] Lazy itable initialization for Ext4
  2010-10-04  2:36           ` Ted Ts'o
@ 2010-10-04  7:31             ` Ted Ts'o
  2010-10-04 13:14             ` Lukas Czerner
  1 sibling, 0 replies; 22+ messages in thread
From: Ted Ts'o @ 2010-10-04  7:31 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, rwheeler, sandeen, adilger, snitzer

On Sun, Oct 03, 2010 at 10:36:49PM -0400, Ted Ts'o wrote:
> +
> +		if (jiffies >= next_wakeup) {
> +			cond_resched();
> +			continue;
> +		}

This should be a time_after_eq(jiffies, next_wakeup), of course...

					- Ted


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6 v4] Lazy itable initialization for Ext4
  2010-10-04  2:36           ` Ted Ts'o
  2010-10-04  7:31             ` Ted Ts'o
@ 2010-10-04 13:14             ` Lukas Czerner
  2010-10-04 13:19               ` Lukas Czerner
  1 sibling, 1 reply; 22+ messages in thread
From: Lukas Czerner @ 2010-10-04 13:14 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Lukas Czerner, linux-ext4, rwheeler, sandeen, adilger, snitzer

Hi Ted,

first of all, thank you very much for tracking down those issues and for
all the improvement you have done on this. Now, I have some questions
about changes you have introduced with this patch.

On Sun, 3 Oct 2010, Ted Ts'o wrote:

> I've made some more changes.  This version updates the timing control.
> The major changes are:
> 
> 1) Time the it takes to clear the inode table with a barrier (once),
> and then use it for the rest of the block groups for that file system.

So if I understand this right it means that we measure the time it takes
to zeroout inode table just once (set the lr_timeout) and then we use
this value for all the following zeroouts.

Initially I have done this "measuring time" thing to adaptively balance
the load it generates and thus do not disturb other ongoing I/O very
much. So this change does not really make sense to me, because when we
measure the time right after the mount (just once) and the system is
relatively still we end up with rather small lr_timeout and then, when
system is under heavy load it will keep the same zeroout rate as when
system was still - resulting in much more impact on performance than my
previous solution.

Conversely, when the system is under heavy load when the filesystem
with init_itable option is mounted the zeroing will proceed very slowly
even if the system is relatively still later on.

> 
> 2) s_li_wait_nult wasn't getting defaulted, so we weren't waiting any
> time at all between sb_issue_zeroout calls.

Actually it is getting defaulted:

+		case Opt_init_inode_table:
+			set_opt(sbi->s_mount_opt, INIT_INODE_TABLE);
+			if (args[0].from) {
+				if (match_int(&args[0], &option))
+					return 0;
+			} else
+				option = EXT4_DEF_LI_WAIT_MULT;
+			if (option < 0)
+				return 0;
+			sbi->s_li_wait_mult = option;
+			break;

EXT4_DEF_LI_WAIT_MULT is the default value for s_li_wait_mult.


Some comments in the code below...

> 
> 3) Fix the timer arithmetic so it works across jiffies rollover.
> (This means using time_before() instead of <)
> 
> 						- Ted
> 
> From 87fe012bfa04e1ac95a4a96f90b70c2a0983e228 Mon Sep 17 00:00:00 2001
> From: Lukas Czerner <lczerner@redhat.com>
> Date: Sun, 3 Oct 2010 22:31:15 -0400
> Subject: [PATCH] ext4: Add support for lazy inode table initialization
> 
> When the lazy_itable_init extended option is passed to mke2fs, it
> considerably speeds up filesystem creation because inode tables are
> not zeroed out.  The fact that parts of the inode table are
> uninitialized is not a problem so long as the block group descriptors,
> which contain information regarding how much of the inode table has
> been initialized, has not been corrupted However, if the block group
> checksums are not valid, e2fsck must scan the entire inode table, and
> the the old, uninitialized data could potentially cause e2fsck to
> report false problems.
> 
> Hence, it is important for the inode tables to be initialized as soon
> as possble.  This commit adds this feature so that mke2fs can safely
> use the lazy inode table initialization feature to speed up formatting
> file systems.
> 
> This is done via a new new kernel thread called ext4lazyinit, which is
> created on demand and destroyed, when it is no longer needed.  There
> is only one thread for all ext4 filesystems in the system. When the
> first filesystem with inititable mount option is mounted, ext4lazyinit
> thread is created, then the filesystem can register its request in the
> request list.
> 
> This thread then walks through the list of requests picking up
> scheduled requests and invoking ext4_init_inode_table(). Next schedule
> time for the request is computed by multiplying the time it took to
> zero out last inode table with wait multiplier, which can be set with
> the (init_itable=n) mount option (default is 10).  We are doing
> this so we do not take the whole I/O bandwidth. When the thread is no
> longer necessary (request list is empty) it frees the appropriate
> structures and exits (and can be created later later by another
> filesystem).
> 
> We do not disturb regular inode allocations in any way, it just do not
> care whether the inode table is, or is not zeroed. But when zeroing, we
> have to skip used inodes, obviously. Also we should prevent new inode
> allocations from the group, while zeroing is on the way. For that we
> take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
> in the ext4_claim_inode, so when we are unlucky and allocator hits the
> group which is currently being zeroed, it just has to wait.
> 
> This can be suppresed using the mount option no_init_itable.
> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
>  Documentation/filesystems/ext4.txt |   14 ++
>  fs/ext4/ext4.h                     |   40 ++++
>  fs/ext4/ialloc.c                   |  120 ++++++++++
>  fs/ext4/super.c                    |  439 +++++++++++++++++++++++++++++++++++-
>  4 files changed, 610 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
> index e1def17..6ab9442 100644
> --- a/Documentation/filesystems/ext4.txt
> +++ b/Documentation/filesystems/ext4.txt
> @@ -353,6 +353,20 @@ noauto_da_alloc		replacing existing files via patterns such as
>  			system crashes before the delayed allocation
>  			blocks are forced to disk.
>  
> +noinit_itable		Do not initialize any uninitialized inode table
> +			blocks in the background.  This feature may be
> +			used by installation CD's so that the install
> +			process can complete as quickly as possible; the
> +			inode table initialization process would then be
> +			deferred until the next time the  file system
> +			is unmounted.
> +
> +init_itable=n		The lazy itable init code will wait n times the
> +			number of milliseconds it took to zero out the
> +			previous block group's inode table.  This
> +			minimizes the impact on the systme performance
> +			while file system's inode table is being initialized.
> +
>  discard		Controls whether ext4 should issue discard/TRIM
>  nodiscard(*)		commands to the underlying block device when
>  			blocks are freed.  This is useful for SSD devices
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index b364b9d..0fe078d 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -890,6 +890,7 @@ struct ext4_inode_info {
>  #define EXT4_MOUNT_DATA_ERR_ABORT	0x10000000 /* Abort on file data write */
>  #define EXT4_MOUNT_BLOCK_VALIDITY	0x20000000 /* Block validity checking */
>  #define EXT4_MOUNT_DISCARD		0x40000000 /* Issue DISCARD requests */
> +#define EXT4_MOUNT_INIT_INODE_TABLE	0x80000000 /* Initialize uninitialized itables */
>  
>  #define clear_opt(o, opt)		o &= ~EXT4_MOUNT_##opt
>  #define set_opt(o, opt)			o |= EXT4_MOUNT_##opt
> @@ -1173,6 +1174,11 @@ struct ext4_sb_info {
>  
>  	/* timer for periodic error stats printing */
>  	struct timer_list s_err_report;
> +
> +	/* Lazy inode table initialization info */
> +	struct ext4_li_request *s_li_request;
> +	/* Wait multiplier for lazy initialization thread */
> +	unsigned int s_li_wait_mult;
>  };
>  
>  static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
> @@ -1537,6 +1543,38 @@ void ext4_get_group_no_and_offset(struct super_block *sb, ext4_fsblk_t blocknr,
>  extern struct proc_dir_entry *ext4_proc_root;
>  
>  /*
> + * Timeout and state flag for lazy initialization inode thread.
> + */
> +#define EXT4_DEF_LI_WAIT_MULT			10
> +#define EXT4_DEF_LI_MAX_START_DELAY		5
> +#define EXT4_LAZYINIT_QUIT			0x0001
> +#define EXT4_LAZYINIT_RUNNING			0x0002
> +
> +/*
> + * Lazy inode table initialization info
> + */
> +struct ext4_lazy_init {
> +	unsigned long		li_state;
> +
> +	wait_queue_head_t	li_wait_daemon;
> +	wait_queue_head_t	li_wait_task;
> +	struct timer_list	li_timer;
> +	struct task_struct	*li_task;
> +
> +	struct list_head	li_request_list;
> +	struct mutex		li_list_mtx;
> +};
> +
> +struct ext4_li_request {
> +	struct super_block	*lr_super;
> +	struct ext4_sb_info	*lr_sbi;
> +	ext4_group_t		lr_next_group;
> +	struct list_head	lr_request;
> +	unsigned long		lr_next_sched;
> +	unsigned long		lr_timeout;
> +};
> +
> +/*
>   * Function prototypes
>   */
>  
> @@ -1611,6 +1649,8 @@ extern unsigned ext4_init_inode_bitmap(struct super_block *sb,
>  				       ext4_group_t group,
>  				       struct ext4_group_desc *desc);
>  extern void mark_bitmap_end(int start_bit, int end_bit, char *bitmap);
> +extern int ext4_init_inode_table(struct super_block *sb,
> +				 ext4_group_t group, int barrier);
>  
>  /* mballoc.c */
>  extern long ext4_mb_stats;
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 45853e0..e428f23 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -107,6 +107,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
>  	desc = ext4_get_group_desc(sb, block_group, NULL);
>  	if (!desc)
>  		return NULL;
> +
>  	bitmap_blk = ext4_inode_bitmap(sb, desc);
>  	bh = sb_getblk(sb, bitmap_blk);
>  	if (unlikely(!bh)) {
> @@ -123,6 +124,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
>  		unlock_buffer(bh);
>  		return bh;
>  	}
> +
>  	ext4_lock_group(sb, block_group);
>  	if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
>  		ext4_init_inode_bitmap(sb, bh, block_group, desc);
> @@ -133,6 +135,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
>  		return bh;
>  	}
>  	ext4_unlock_group(sb, block_group);
> +
>  	if (buffer_uptodate(bh)) {
>  		/*
>  		 * if not uninit if bh is uptodate,
> @@ -712,8 +715,17 @@ static int ext4_claim_inode(struct super_block *sb,
>  {
>  	int free = 0, retval = 0, count;
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
>  	struct ext4_group_desc *gdp = ext4_get_group_desc(sb, group, NULL);
>  
> +	/*
> +	 * We have to be sure that new inode allocation does not race with
> +	 * inode table initialization, because otherwise we may end up
> +	 * allocating and writing new inode right before sb_issue_zeroout
> +	 * takes place and overwriting our new inode with zeroes. So we
> +	 * take alloc_sem to prevent it.
> +	 */
> +	down_read(&grp->alloc_sem);
>  	ext4_lock_group(sb, group);
>  	if (ext4_set_bit(ino, inode_bitmap_bh->b_data)) {
>  		/* not a free inode */
> @@ -724,6 +736,7 @@ static int ext4_claim_inode(struct super_block *sb,
>  	if ((group == 0 && ino < EXT4_FIRST_INO(sb)) ||
>  			ino > EXT4_INODES_PER_GROUP(sb)) {
>  		ext4_unlock_group(sb, group);
> +		up_read(&grp->alloc_sem);
>  		ext4_error(sb, "reserved inode or inode > inodes count - "
>  			   "block_group = %u, inode=%lu", group,
>  			   ino + group * EXT4_INODES_PER_GROUP(sb));
> @@ -772,6 +785,7 @@ static int ext4_claim_inode(struct super_block *sb,
>  	gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
>  err_ret:
>  	ext4_unlock_group(sb, group);
> +	up_read(&grp->alloc_sem);
>  	return retval;
>  }
>  
> @@ -1205,3 +1219,109 @@ unsigned long ext4_count_dirs(struct super_block * sb)
>  	}
>  	return count;
>  }
> +
> +/*
> + * Zeroes not yet zeroed inode table - just write zeroes through the whole
> + * inode table. Must be called without any spinlock held. The only place
> + * where it is called from on active part of filesystem is ext4lazyinit
> + * thread, so we do not need any special locks, however we have to prevent
> + * inode allocation from the current group, so we take alloc_sem lock, to
> + * block ext4_claim_inode until we are finished.
> + */
> +extern int ext4_init_inode_table(struct super_block *sb, ext4_group_t group,
> +				 int barrier)
> +{
> +	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_group_desc *gdp = NULL;
> +	struct buffer_head *group_desc_bh;
> +	handle_t *handle;
> +	ext4_fsblk_t blk;
> +	int num, ret = 0, used_blks = 0;
> +	unsigned long flags = BLKDEV_IFL_WAIT;
> +
> +	/* This should not happen, but just to be sure check this */
> +	if (sb->s_flags & MS_RDONLY) {
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	gdp = ext4_get_group_desc(sb, group, &group_desc_bh);
> +	if (!gdp)
> +		goto out;
> +
> +	/*
> +	 * We do not need to lock this, because we are the only one
> +	 * handling this flag.
> +	 */
> +	if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED))
> +		goto out;
> +
> +	handle = ext4_journal_start_sb(sb, 1);
> +	if (IS_ERR(handle)) {
> +		ret = PTR_ERR(handle);
> +		goto out;
> +	}
> +
> +	down_write(&grp->alloc_sem);
> +	/*
> +	 * If inode bitmap was already initialized there may be some
> +	 * used inodes so we need to skip blocks with used inodes in
> +	 * inode table.
> +	 */
> +	if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)))
> +		used_blks = DIV_ROUND_UP((EXT4_INODES_PER_GROUP(sb) -
> +			    ext4_itable_unused_count(sb, gdp)),
> +			    sbi->s_inodes_per_block);
> +
> +	blk = ext4_inode_table(sb, gdp) + used_blks;
> +	num = sbi->s_itb_per_group - used_blks;
> +
> +	BUFFER_TRACE(group_desc_bh, "get_write_access");
> +	ret = ext4_journal_get_write_access(handle,
> +					    group_desc_bh);
> +	if (ret)
> +		goto err_out;
> +
> +	if (unlikely(num > EXT4_INODES_PER_GROUP(sb))) {
> +		ext4_error(sb, "Something is wrong with group %u\n"
> +			   "Used itable blocks: %d"
> +			   "Itable blocks per group: %lu\n",
> +			   group, used_blks, sbi->s_itb_per_group);
> +		ret = 1;
> +		goto err_out;
> +	}
> +
> +	/*
> +	 * Skip zeroout if the inode table is full. But we set the ZEROED
> +	 * flag anyway, because obviously, when it is full it does not need
> +	 * further zeroing.
> +	 */
> +	if (unlikely(num == 0))
> +		goto skip_zeroout;
> +
> +	ext4_debug("going to zero out inode table in group %d\n",
> +		   group);
> +	if (barrier)
> +		flags |= BLKDEV_IFL_BARRIER;
> +	ret = sb_issue_zeroout(sb, blk, num, GFP_NOFS, flags);
> +	if (ret < 0)
> +		goto err_out;
> +
> +skip_zeroout:
> +	ext4_lock_group(sb, group);
> +	gdp->bg_flags |= cpu_to_le16(EXT4_BG_INODE_ZEROED);
> +	gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
> +	ext4_unlock_group(sb, group);
> +
> +	BUFFER_TRACE(group_desc_bh,
> +		     "call ext4_handle_dirty_metadata");
> +	ret = ext4_handle_dirty_metadata(handle, NULL,
> +					 group_desc_bh);
> +
> +err_out:
> +	up_write(&grp->alloc_sem);
> +	ext4_journal_stop(handle);
> +out:
> +	return ret;
> +}
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 751997d..c4b9984 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -41,6 +41,9 @@
>  #include <linux/crc16.h>
>  #include <asm/uaccess.h>
>  
> +#include <linux/kthread.h>
> +#include <linux/freezer.h>
> +
>  #include "ext4.h"
>  #include "ext4_jbd2.h"
>  #include "xattr.h"
> @@ -52,6 +55,8 @@
>  
>  struct proc_dir_entry *ext4_proc_root;
>  static struct kset *ext4_kset;
> +struct ext4_lazy_init *ext4_li_info;
> +struct mutex ext4_li_mtx;
>  
>  static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
>  			     unsigned long journal_devnum);
> @@ -70,6 +75,8 @@ static void ext4_write_super(struct super_block *sb);
>  static int ext4_freeze(struct super_block *sb);
>  static int ext4_get_sb(struct file_system_type *fs_type, int flags,
>  		       const char *dev_name, void *data, struct vfsmount *mnt);
> +static void ext4_destroy_lazyinit_thread(void);
> +static void ext4_unregister_li_request(struct super_block *sb);
>  
>  #if !defined(CONFIG_EXT3_FS) && !defined(CONFIG_EXT3_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT23)
>  static struct file_system_type ext3_fs_type = {
> @@ -720,6 +727,7 @@ static void ext4_put_super(struct super_block *sb)
>  	}
>  
>  	del_timer(&sbi->s_err_report);
> +	ext4_unregister_li_request(sb);
>  	ext4_release_system_zone(sb);
>  	ext4_mb_release(sb);
>  	ext4_ext_release(sb);
> @@ -1046,6 +1054,12 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
>  	    !(def_mount_opts & EXT4_DEFM_BLOCK_VALIDITY))
>  		seq_puts(seq, ",block_validity");
>  
> +	if (!test_opt(sb, INIT_INODE_TABLE))
> +		seq_puts(seq, ",noinit_inode_table");
> +	else if (sbi->s_li_wait_mult)
> +		seq_printf(seq, ",init_inode_table=%u",
> +			   (unsigned) sbi->s_li_wait_mult);
> +
>  	ext4_show_quota_options(seq, sb);
>  
>  	return 0;
> @@ -1220,6 +1234,7 @@ enum {
>  	Opt_inode_readahead_blks, Opt_journal_ioprio,
>  	Opt_dioread_nolock, Opt_dioread_lock,
>  	Opt_discard, Opt_nodiscard,
> +	Opt_init_inode_table, Opt_noinit_inode_table,
>  };
>  
>  static const match_table_t tokens = {
> @@ -1290,6 +1305,9 @@ static const match_table_t tokens = {
>  	{Opt_dioread_lock, "dioread_lock"},
>  	{Opt_discard, "discard"},
>  	{Opt_nodiscard, "nodiscard"},
> +	{Opt_init_inode_table, "init_itable=%u"},
> +	{Opt_init_inode_table, "init_itable"},
> +	{Opt_noinit_inode_table, "noinit_itable"},
>  	{Opt_err, NULL},
>  };
>  
> @@ -1760,6 +1778,20 @@ set_qf_format:
>  		case Opt_dioread_lock:
>  			clear_opt(sbi->s_mount_opt, DIOREAD_NOLOCK);
>  			break;
> +		case Opt_init_inode_table:
> +			set_opt(sbi->s_mount_opt, INIT_INODE_TABLE);
> +			if (args[0].from) {
> +				if (match_int(&args[0], &option))
> +					return 0;
> +			} else
> +				option = EXT4_DEF_LI_WAIT_MULT;
> +			if (option < 0)
> +				return 0;
> +			sbi->s_li_wait_mult = option;
> +			break;
> +		case Opt_noinit_inode_table:
> +			clear_opt(sbi->s_mount_opt, INIT_INODE_TABLE);
> +			break;
>  		default:
>  			ext4_msg(sb, KERN_ERR,
>  			       "Unrecognized mount option \"%s\" "
> @@ -1943,7 +1975,8 @@ int ext4_group_desc_csum_verify(struct ext4_sb_info *sbi, __u32 block_group,
>  }
>  
>  /* Called at mount-time, super-block is locked */
> -static int ext4_check_descriptors(struct super_block *sb)
> +static int ext4_check_descriptors(struct super_block *sb,
> +				  ext4_group_t *first_not_zeroed)
>  {
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
>  	ext4_fsblk_t first_block = le32_to_cpu(sbi->s_es->s_first_data_block);
> @@ -1952,7 +1985,7 @@ static int ext4_check_descriptors(struct super_block *sb)
>  	ext4_fsblk_t inode_bitmap;
>  	ext4_fsblk_t inode_table;
>  	int flexbg_flag = 0;
> -	ext4_group_t i;
> +	ext4_group_t i, grp = sbi->s_groups_count;
>  
>  	if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG))
>  		flexbg_flag = 1;
> @@ -1968,6 +2001,10 @@ static int ext4_check_descriptors(struct super_block *sb)
>  			last_block = first_block +
>  				(EXT4_BLOCKS_PER_GROUP(sb) - 1);
>  
> +		if ((grp == sbi->s_groups_count) &&
> +		   !(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
> +			grp = i;
> +
>  		block_bitmap = ext4_block_bitmap(sb, gdp);
>  		if (block_bitmap < first_block || block_bitmap > last_block) {
>  			ext4_msg(sb, KERN_ERR, "ext4_check_descriptors: "
> @@ -2005,6 +2042,8 @@ static int ext4_check_descriptors(struct super_block *sb)
>  		if (!flexbg_flag)
>  			first_block += EXT4_BLOCKS_PER_GROUP(sb);
>  	}
> +	if (NULL != first_not_zeroed)
> +		*first_not_zeroed = grp;
>  
>  	ext4_free_blocks_count_set(sbi->s_es, ext4_count_free_blocks(sb));
>  	sbi->s_es->s_free_inodes_count =cpu_to_le32(ext4_count_free_inodes(sb));
> @@ -2543,6 +2582,377 @@ static void print_daily_error_info(unsigned long arg)
>  	mod_timer(&sbi->s_err_report, jiffies + 24*60*60*HZ);  /* Once a day */
>  }
>  
> +static void ext4_lazyinode_timeout(unsigned long data)
> +{
> +	struct task_struct *p = (struct task_struct *)data;
> +	wake_up_process(p);
> +}
> +
> +/* Find next suitable group and run ext4_init_inode_table */
> +static int ext4_run_li_request(struct ext4_li_request *elr)
> +{
> +	struct ext4_group_desc *gdp = NULL;
> +	ext4_group_t group, ngroups;
> +	struct super_block *sb;
> +	unsigned long timeout = 0;
> +	int ret = 0;
> +
> +	sb = elr->lr_super;
> +	ngroups = EXT4_SB(sb)->s_groups_count;
> +
> +	for (group = elr->lr_next_group; group < ngroups; group++) {
> +		gdp = ext4_get_group_desc(sb, group, NULL);
> +		if (!gdp) {
> +			ret = 1;
> +			break;
> +		}
> +
> +		if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
> +			break;
> +	}
> +
> +	if (group == ngroups)
> +		ret = 1;
> +
> +	if (!ret) {
> +		timeout = jiffies;
> +		ret = ext4_init_inode_table(sb, group,
> +					    elr->lr_timeout ? 0 : 1);
> +		if (elr->lr_timeout == 0) {
> +			timeout = jiffies - timeout;
> +			if (elr->lr_sbi->s_li_wait_mult)
> +				timeout *= elr->lr_sbi->s_li_wait_mult;
> +			else
> +				timeout *= 20;
> +			elr->lr_timeout = timeout;
> +		}
> +		elr->lr_next_sched = jiffies + elr->lr_timeout;
> +		elr->lr_next_group = group + 1;
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * Remove lr_request from the list_request and free the
> + * request tructure. Should be called with li_list_mtx held
> + */
> +static void ext4_remove_li_request(struct ext4_li_request *elr)
> +{
> +	struct ext4_sb_info *sbi;
> +
> +	if (!elr)
> +		return;
> +
> +	sbi = elr->lr_sbi;
> +
> +	list_del(&elr->lr_request);
> +	sbi->s_li_request = NULL;
> +	kfree(elr);
> +}
> +
> +static void ext4_unregister_li_request(struct super_block *sb)
> +{
> +	struct ext4_li_request *elr = EXT4_SB(sb)->s_li_request;
> +
> +	if (!ext4_li_info)
> +		return;
> +
> +	mutex_lock(&ext4_li_info->li_list_mtx);
> +	ext4_remove_li_request(elr);
> +	mutex_unlock(&ext4_li_info->li_list_mtx);
> +}
> +
> +/*
> + * This is the function where ext4lazyinit thread lives. It walks
> + * through the request list searching for next scheduled filesystem.
> + * When such a fs is found, run the lazy initialization request
> + * (ext4_rn_li_request) and keep track of the time spend in this
> + * function. Based on that time we compute next schedule time of
> + * the request. When walking through the list is complete, compute
> + * next waking time and put itself into sleep.
> + */
> +static int ext4_lazyinit_thread(void *arg)
> +{
> +	struct ext4_lazy_init *eli = (struct ext4_lazy_init *)arg;
> +	struct list_head *pos, *n;
> +	struct ext4_li_request *elr;
> +	unsigned long next_wakeup;
> +	DEFINE_WAIT(wait);
> +	int ret;
> +
> +	BUG_ON(NULL == eli);
> +
> +	eli->li_timer.data = (unsigned long)current;
> +	eli->li_timer.function = ext4_lazyinode_timeout;
> +
> +	eli->li_task = current;
> +	wake_up(&eli->li_wait_task);
> +
> +cont_thread:
> +	while (true) {
> +		next_wakeup = jiffies-1;
> +
> +		mutex_lock(&eli->li_list_mtx);
> +		if (list_empty(&eli->li_request_list)) {
> +			mutex_unlock(&eli->li_list_mtx);
> +			goto exit_thread;
> +		}
> +
> +		list_for_each_safe(pos, n, &eli->li_request_list) {
> +			elr = list_entry(pos, struct ext4_li_request,
> +					 lr_request);
> +
> +			if (time_before_eq(jiffies, elr->lr_next_sched))
> +				continue;
> +
> +			if ((ret = ext4_run_li_request(elr)) != 0) {
> +				ext4_remove_li_request(elr);
> +				continue;
> +			}
> +
> +			if (time_before(elr->lr_next_sched, next_wakeup))
> +				next_wakeup = elr->lr_next_sched;
> +		}
> +		mutex_unlock(&eli->li_list_mtx);
> +
> +		if (freezing(current))
> +			refrigerator();
> +
> +		if (jiffies >= next_wakeup) {
> +			cond_resched();
> +			continue;
> +		}
When we are using this time functions (with really confusing names) we can
use one here as well I think.

		if (time_after_eq(jiffies, next_wakeup) {
			cond_resched();
			continue;
		}


> +
> +		eli->li_timer.expires = next_wakeup;
> +		add_timer(&eli->li_timer);
> +		prepare_to_wait(&eli->li_wait_daemon, &wait,
> +				TASK_INTERRUPTIBLE);
> +		if (time_before(jiffies, next_wakeup))
> +			schedule();
> +		finish_wait(&eli->li_wait_daemon, &wait);
> +	}
> +
> +exit_thread:
> +	/*
> +	 * It looks like the request list is empty, but we need
> +	 * to check it under the li_list_mtx lock, to prevent any
> +	 * additions into it, and of course we should lock ext4_li_mtx
> +	 * to atomically free the list and ext4_li_info, because at
> +	 * this point another ext4 filesystem could be registering
> +	 * new one.
> +	 */
> +	mutex_lock(&ext4_li_mtx);
> +	mutex_lock(&eli->li_list_mtx);
> +	if (!list_empty(&eli->li_request_list)) {
> +		mutex_unlock(&eli->li_list_mtx);
> +		mutex_unlock(&ext4_li_mtx);
> +		goto cont_thread;
> +	}
> +	mutex_unlock(&eli->li_list_mtx);
> +	del_timer_sync(&ext4_li_info->li_timer);
> +	eli->li_task = NULL;
> +	wake_up(&eli->li_wait_task);
> +
> +	kfree(ext4_li_info);
> +	ext4_li_info = NULL;
> +	mutex_unlock(&ext4_li_mtx);
> +
> +	return 0;
> +}
> +
> +static void ext4_clear_request_list(void)
> +{
> +	struct list_head *pos, *n;
> +	struct ext4_li_request *elr;
> +
> +	mutex_lock(&ext4_li_info->li_list_mtx);
> +	if (list_empty(&ext4_li_info->li_request_list))
> +		return;
> +
> +	list_for_each_safe(pos, n, &ext4_li_info->li_request_list) {
> +		elr = list_entry(pos, struct ext4_li_request,
> +				 lr_request);
> +		ext4_remove_li_request(elr);
> +	}
> +	mutex_unlock(&ext4_li_info->li_list_mtx);
> +}
> +
> +static int ext4_run_lazyinit_thread(void)
> +{
> +	struct task_struct *t;
> +
> +	t = kthread_run(ext4_lazyinit_thread, ext4_li_info, "ext4lazyinit");
> +	if (IS_ERR(t)) {
> +		int err = PTR_ERR(t);
> +		ext4_clear_request_list();
> +		del_timer_sync(&ext4_li_info->li_timer);
> +		kfree(ext4_li_info);
> +		ext4_li_info = NULL;
> +		printk(KERN_CRIT "EXT4: error %d creating inode table "
> +				 "initialization thread\n",
> +				 err);
> +		return err;
> +	}
> +	ext4_li_info->li_state |= EXT4_LAZYINIT_RUNNING;
> +
> +	wait_event(ext4_li_info->li_wait_task, ext4_li_info->li_task != NULL);
> +	return 0;
> +}
> +
> +/*
> + * Check whether it make sense to run itable init. thread or not.
> + * If there is at least one uninitialized inode table, return
> + * corresponding group number, else the loop goes through all
> + * groups and return total number of groups.
> + */
> +static ext4_group_t ext4_has_uninit_itable(struct super_block *sb)
> +{
> +	ext4_group_t group, ngroups = EXT4_SB(sb)->s_groups_count;
> +	struct ext4_group_desc *gdp = NULL;
> +
> +	for (group = 0; group < ngroups; group++) {
> +		gdp = ext4_get_group_desc(sb, group, NULL);
> +		if (!gdp)
> +			continue;
> +
> +		if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
> +			break;
> +	}
> +
> +	return group;
> +}
> +
> +static int ext4_li_info_new(void)
> +{
> +	struct ext4_lazy_init *eli = NULL;
> +
> +	eli = kzalloc(sizeof(*eli), GFP_KERNEL);
> +	if (!eli)
> +		return -ENOMEM;
> +
> +	eli->li_task = NULL;
> +	INIT_LIST_HEAD(&eli->li_request_list);
> +	mutex_init(&eli->li_list_mtx);
> +
> +	init_waitqueue_head(&eli->li_wait_daemon);
> +	init_waitqueue_head(&eli->li_wait_task);
> +	init_timer(&eli->li_timer);
> +	eli->li_state |= EXT4_LAZYINIT_QUIT;
> +
> +	ext4_li_info = eli;
> +
> +	return 0;
> +}
> +
> +static struct ext4_li_request *ext4_li_request_new(struct super_block *sb,
> +					    ext4_group_t start)
> +{
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_li_request *elr;
> +	unsigned long rnd;
> +
> +	elr = kzalloc(sizeof(*elr), GFP_KERNEL);
> +	if (!elr)
> +		return NULL;
> +
> +	elr->lr_super = sb;
> +	elr->lr_sbi = sbi;
> +	elr->lr_next_group = start;
> +
> +	/*
> +	 * Randomize first schedule time of the request to
> +	 * spread the inode table initialization requests
> +	 * better.
> +	 */
> +	get_random_bytes(&rnd, sizeof(rnd));
> +	elr->lr_next_sched = jiffies + (unsigned long)rnd %
> +			     (EXT4_DEF_LI_MAX_START_DELAY * HZ);
> +
> +	return elr;
> +}
> +
> +static int ext4_register_li_request(struct super_block *sb,
> +				    ext4_group_t first_not_zeroed)
> +{
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_li_request *elr;
> +	ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
> +	int ret = 0;
> +
> +	if (sbi->s_li_request != NULL)
> +		goto out;
> +
> +	if (first_not_zeroed == ngroups ||
> +	    (sb->s_flags & MS_RDONLY) ||
> +	    !test_opt(sb, INIT_INODE_TABLE)) {
> +		sbi->s_li_request = NULL;
> +		goto out;
> +	}
> +
> +	if (first_not_zeroed == ngroups) {
> +		sbi->s_li_request = NULL;
> +		goto out;
> +	}
I do not know why I did this, but apparently we do not need to test
first_not_zeroed again since we just did.


> +
> +	elr = ext4_li_request_new(sb, first_not_zeroed);
> +	if (!elr) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	mutex_lock(&ext4_li_mtx);
> +
> +	if (NULL == ext4_li_info) {
> +		ret = ext4_li_info_new();
> +		if (ret)
> +			goto out;
> +	}
> +
> +	mutex_lock(&ext4_li_info->li_list_mtx);
> +	list_add(&elr->lr_request, &ext4_li_info->li_request_list);
> +	mutex_unlock(&ext4_li_info->li_list_mtx);
> +
> +	sbi->s_li_request = elr;
> +
> +	if (!(ext4_li_info->li_state & EXT4_LAZYINIT_RUNNING)) {
> +		ret = ext4_run_lazyinit_thread();
> +		if (ret)
> +			goto out;
> +	}
> +
> +	mutex_unlock(&ext4_li_mtx);
> +
> +out:
> +	if (ret) {
> +		mutex_unlock(&ext4_li_mtx);
> +		kfree(elr);
> +	}
> +	return ret;
> +}
> +
> +/*
> + * We do not need to lock anything since this is called on
> + * module unload.
> + */
> +static void ext4_destroy_lazyinit_thread(void)
> +{
> +	/*
> +	 * If thread exited earlier
> +	 * there's nothing to be done.
> +	 */
> +	if (!ext4_li_info)
> +		return;
> +
> +	ext4_clear_request_list();
> +
> +	while (ext4_li_info->li_task) {
> +		wake_up(&ext4_li_info->li_wait_daemon);
> +		wait_event(ext4_li_info->li_wait_task,
> +			   ext4_li_info->li_task == NULL);
> +	}
> +}
> +
>  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  				__releases(kernel_lock)
>  				__acquires(kernel_lock)
> @@ -2568,6 +2978,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  	__u64 blocks_count;
>  	int err;
>  	unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
> +	ext4_group_t first_not_zeroed;
>  
>  	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
>  	if (!sbi)
> @@ -2630,6 +3041,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  
>  	/* Set defaults before we parse the mount options */
>  	def_mount_opts = le32_to_cpu(es->s_default_mount_opts);
> +	set_opt(sbi->s_mount_opt, INIT_INODE_TABLE);
>  	if (def_mount_opts & EXT4_DEFM_DEBUG)
>  		set_opt(sbi->s_mount_opt, DEBUG);
>  	if (def_mount_opts & EXT4_DEFM_BSDGROUPS) {
> @@ -2909,7 +3321,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  			goto failed_mount2;
>  		}
>  	}
> -	if (!ext4_check_descriptors(sb)) {
> +	if (!ext4_check_descriptors(sb, &first_not_zeroed)) {
>  		ext4_msg(sb, KERN_ERR, "group descriptors corrupted!");
>  		goto failed_mount2;
>  	}
> @@ -3130,6 +3542,10 @@ no_journal:
>  		goto failed_mount4;
>  	}
>  
> +	err = ext4_register_li_request(sb, first_not_zeroed);
> +	if (err)
> +		goto failed_mount4;
> +
>  	sbi->s_kobj.kset = ext4_kset;
>  	init_completion(&sbi->s_kobj_unregister);
>  	err = kobject_init_and_add(&sbi->s_kobj, &ext4_ktype, NULL,
> @@ -3847,6 +4263,19 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
>  			enable_quota = 1;
>  		}
>  	}
> +
> +	/*
> +	 * Reinitialize lazy itable initialization thread based on
> +	 * current settings
> +	 */
> +	if ((sb->s_flags & MS_RDONLY) || !test_opt(sb, INIT_INODE_TABLE))
> +		ext4_unregister_li_request(sb);
> +	else {
> +		ext4_group_t first_not_zeroed;
> +		first_not_zeroed = ext4_has_uninit_itable(sb);
> +		ext4_register_li_request(sb, first_not_zeroed);
> +	}
> +
>  	ext4_setup_system_zone(sb);
>  	if (sbi->s_journal == NULL)
>  		ext4_commit_super(sb, 1);
> @@ -4317,6 +4746,9 @@ static int __init init_ext4_fs(void)
>  	err = register_filesystem(&ext4_fs_type);
>  	if (err)
>  		goto out;
> +
> +	ext4_li_info = NULL;
> +	mutex_init(&ext4_li_mtx);
>  	return 0;
>  out:
>  	unregister_as_ext2();
> @@ -4336,6 +4768,7 @@ out4:
>  
>  static void __exit exit_ext4_fs(void)
>  {
> +	ext4_destroy_lazyinit_thread();
>  	unregister_as_ext2();
>  	unregister_as_ext3();
>  	unregister_filesystem(&ext4_fs_type);
> 

-Lukas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6 v4] Lazy itable initialization for Ext4
  2010-10-04 13:14             ` Lukas Czerner
@ 2010-10-04 13:19               ` Lukas Czerner
  0 siblings, 0 replies; 22+ messages in thread
From: Lukas Czerner @ 2010-10-04 13:19 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: Ted Ts'o, linux-ext4, rwheeler, sandeen, adilger, snitzer

On Mon, 4 Oct 2010, Lukas Czerner wrote:

> Hi Ted,
> 
> first of all, thank you very much for tracking down those issues and for
> all the improvement you have done on this. Now, I have some questions
> about changes you have introduced with this patch.
> 
> On Sun, 3 Oct 2010, Ted Ts'o wrote:
> 
> > I've made some more changes.  This version updates the timing control.
> > The major changes are:
> > 
> > 1) Time the it takes to clear the inode table with a barrier (once),
> > and then use it for the rest of the block groups for that file system.
> 
> So if I understand this right it means that we measure the time it takes
> to zeroout inode table just once (set the lr_timeout) and then we use
> this value for all the following zeroouts.
> 
> Initially I have done this "measuring time" thing to adaptively balance
> the load it generates and thus do not disturb other ongoing I/O very
> much. So this change does not really make sense to me, because when we
> measure the time right after the mount (just once) and the system is
> relatively still we end up with rather small lr_timeout and then, when
> system is under heavy load it will keep the same zeroout rate as when
> system was still - resulting in much more impact on performance than my
> previous solution.
> 
> Conversely, when the system is under heavy load when the filesystem
> with init_itable option is mounted the zeroing will proceed very slowly
> even if the system is relatively still later on.
> 
> > 
> > 2) s_li_wait_nult wasn't getting defaulted, so we weren't waiting any
> > time at all between sb_issue_zeroout calls.
> 
> Actually it is getting defaulted:
> 
> +		case Opt_init_inode_table:
> +			set_opt(sbi->s_mount_opt, INIT_INODE_TABLE);
> +			if (args[0].from) {
> +				if (match_int(&args[0], &option))
> +					return 0;
> +			} else
> +				option = EXT4_DEF_LI_WAIT_MULT;
> +			if (option < 0)
> +				return 0;
> +			sbi->s_li_wait_mult = option;
> +			break;

Ok, this is not right:) and it is probably why do you thought that it was
not getting defaulted at all (because it actually did not). It should be:

		case Opt_init_inode_table:
			set_opt(sbi->s_mount_opt, INIT_INODE_TABLE);
			option = EXT4_DEF_LI_WAIT_MULT;
			if (args[0].from) {
				if (match_int(&args[0], &option))
					return 0;
			if (option < 0)
				return 0;
			sbi->s_li_wait_mult = option;
			break;

> 
> EXT4_DEF_LI_WAIT_MULT is the default value for s_li_wait_mult.
> 

-Lukas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/6] Add inititable/noinititable mount options for ext4
  2010-09-15 16:36 [PATCH 0/6 v3] " Lukas Czerner
@ 2010-09-15 16:36 ` Lukas Czerner
  0 siblings, 0 replies; 22+ messages in thread
From: Lukas Czerner @ 2010-09-15 16:36 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, rwheeler, sandeen, adilger, lczerner

Add new mount flag EXT4_MOUNT_INIT_INODE_TABLE and add new pair of mount
options (inititable/noinititable). When mounted with inititable file
system should try to initialize uninitialized inode tables, otherwise it
should prevent initializing inode tables. For now, default is noinittable.

One can also specify inititable=n where n is a number that will be used
as the wait multiplier (see "Add inode table initialization code into
Ext4" patch for more info). Bigger number means slower inode table
initialization thus less impact on performance, but longer
inititalization (default is 10).

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/ext4.h  |    1 +
 fs/ext4/super.c |   22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 19a4de5..dbd6760 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -885,6 +885,7 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_DATA_ERR_ABORT	0x10000000 /* Abort on file data write */
 #define EXT4_MOUNT_BLOCK_VALIDITY	0x20000000 /* Block validity checking */
 #define EXT4_MOUNT_DISCARD		0x40000000 /* Issue DISCARD requests */
+#define EXT4_MOUNT_INIT_INODE_TABLE	0x80000000 /* Initialize uninitialized itables */
 
 #define clear_opt(o, opt)		o &= ~EXT4_MOUNT_##opt
 #define set_opt(o, opt)			o |= EXT4_MOUNT_##opt
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4e8983a..3dbae36 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -986,6 +986,10 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
 	if (test_opt(sb, DIOREAD_NOLOCK))
 		seq_puts(seq, ",dioread_nolock");
 
+	if (test_opt(sb, INIT_INODE_TABLE))
+		seq_printf(seq, ",init_inode_table=%u",
+			   (unsigned) sbi->s_li_wait_mult);
+
 	ext4_show_quota_options(seq, sb);
 
 	return 0;
@@ -1161,6 +1165,7 @@ enum {
 	Opt_inode_readahead_blks, Opt_journal_ioprio,
 	Opt_dioread_nolock, Opt_dioread_lock,
 	Opt_discard, Opt_nodiscard,
+	Opt_init_inode_table, Opt_noinit_inode_table,
 };
 
 static const match_table_t tokens = {
@@ -1231,6 +1236,9 @@ static const match_table_t tokens = {
 	{Opt_dioread_lock, "dioread_lock"},
 	{Opt_discard, "discard"},
 	{Opt_nodiscard, "nodiscard"},
+	{Opt_init_inode_table, "inititable=%u"},
+	{Opt_init_inode_table, "inititable"},
+	{Opt_noinit_inode_table, "noinititable"},
 	{Opt_err, NULL},
 };
 
@@ -1699,6 +1707,20 @@ set_qf_format:
 		case Opt_dioread_lock:
 			clear_opt(sbi->s_mount_opt, DIOREAD_NOLOCK);
 			break;
+		case Opt_init_inode_table:
+			set_opt(sbi->s_mount_opt, INIT_INODE_TABLE);
+			if (args[0].from) {
+				if (match_int(&args[0], &option))
+					return 0;
+			} else
+				option = EXT4_DEF_LI_WAIT_MULT;
+			if (option < 0)
+				return 0;
+			sbi->s_li_wait_mult = option;
+			break;
+		case Opt_noinit_inode_table:
+			clear_opt(sbi->s_mount_opt, INIT_INODE_TABLE);
+			break;
 		default:
 			ext4_msg(sb, KERN_ERR,
 			       "Unrecognized mount option \"%s\" "
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2010-10-04 13:19 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-16 12:47 [PATCH 0/6 v4] Lazy itable initialization for Ext4 Lukas Czerner
2010-09-16 12:47 ` [PATCH 1/6] Add helper function for blkdev_issue_zeroout Lukas Czerner
2010-09-16 12:47 ` [PATCH 2/6] Add inititable/noinititable mount options for ext4 Lukas Czerner
2010-09-27 18:35   ` Ted Ts'o
2010-09-16 12:47 ` [PATCH 3/6] Add inode table initialization code for Ext4 Lukas Czerner
2010-09-16 12:47 ` [PATCH 4/6] Use sb_issue_zeroout in setup_new_group_blocks Lukas Czerner
2010-09-29 14:12   ` Lukas Czerner
2010-09-29 14:14     ` Lukas Czerner
2010-10-01 16:00       ` [PATCH 4/6 fixed] " Lukas Czerner
2010-09-16 12:47 ` [PATCH 5/6] Use sb_issue_zeroout in ext4_ext_zeroout Lukas Czerner
2010-09-16 12:47 ` [PATCH 6/6] Add interface to advertise ext4 features in sysfs Lukas Czerner
2010-09-28  4:01 ` [PATCH 0/6 v4] Lazy itable initialization for Ext4 Ted Ts'o
2010-09-28 15:05   ` Ted Ts'o
2010-09-29 13:37   ` Lukas Czerner
2010-10-01 15:58     ` Lukas Czerner
2010-10-02 19:55       ` Ted Ts'o
2010-10-03  2:43         ` Ted Ts'o
2010-10-04  2:36           ` Ted Ts'o
2010-10-04  7:31             ` Ted Ts'o
2010-10-04 13:14             ` Lukas Czerner
2010-10-04 13:19               ` Lukas Czerner
  -- strict thread matches above, loose matches on Subject: below --
2010-09-15 16:36 [PATCH 0/6 v3] " Lukas Czerner
2010-09-15 16:36 ` [PATCH 2/6] Add inititable/noinititable mount options for ext4 Lukas Czerner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.