[lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+

All of lore.kernel.org
 help / color / mirror / Atom feed

* [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+
@ 2019-07-22  1:23 James Simmons
  2019-07-22  1:23 ` [lustre-devel] [PATCH 01/22] ext4: add i_fs_version James Simmons
                   ` (21 more replies)
  0 siblings, 22 replies; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

From: James Simmons <uja.ornl@yahoo.com>

With the work of Shaun Tancheff from Cray to bring the ldiskfs patches
in sync with the kernel in Neil's kernel source tree it was a easy
port from OpenSFS to the kernel proper. This is just to start the
discussion on how to move forward main streaming what Lustre has
done with ext4. So let the flames begin!!!!

James Simmons (22):
  ext4: add i_fs_version
  ext4: use d_find_alias() in ext4_lookup
  ext4: prealloc table optimization
  ext4: export inode management
  ext4: various misc changes
  ext4: add extra checks for mballoc
  ext4: update .. for hash indexed directory
  ext4: kill off struct dx_root
  ext4: fix mballoc pa free mismatch
  ext4: add data in dentry feature
  ext4: over ride current_time
  ext4: add htree lock implementation
  ext4: Add a proc interface for max_dir_size.
  ext4: remove inode_lock handling
  ext4: remove bitmap corruption warnings
  ext4: add warning for directory htree growth
  ext4: optimize ext4_journal_callback_add
  ext4: attach jinode in writepages
  ext4: don't check before replay
  ext4: use GFP_NOFS in ext4_inode_attach_jinode
  ext4: export ext4_orphan_add
  ext4: export mb stream allocator variables

 fs/ext4/Makefile     |   4 +-
 fs/ext4/balloc.c     |  11 +-
 fs/ext4/dir.c        |  14 +-
 fs/ext4/ext4.h       | 225 +++++++++++-
 fs/ext4/ext4_jbd2.c  |   4 +
 fs/ext4/ext4_jbd2.h  |   2 +-
 fs/ext4/hash.c       |   1 +
 fs/ext4/htree_lock.c | 891 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/htree_lock.h | 187 ++++++++++
 fs/ext4/ialloc.c     |  10 +-
 fs/ext4/inline.c     |  16 +-
 fs/ext4/inode.c      |  24 +-
 fs/ext4/mballoc.c    | 479 +++++++++++++++++++------
 fs/ext4/mballoc.h    |   4 +-
 fs/ext4/namei.c      | 978 +++++++++++++++++++++++++++++++++++++++++++--------
 fs/ext4/super.c      |  27 +-
 fs/ext4/sysfs.c      |  20 +-
 17 files changed, 2601 insertions(+), 296 deletions(-)
 create mode 100644 fs/ext4/htree_lock.c
 create mode 100644 fs/ext4/htree_lock.h

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 01/22] ext4: add i_fs_version
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  4:13   ` NeilBrown
  2019-07-22  1:23 ` [lustre-devel] [PATCH 02/22] ext4: use d_find_alias() in ext4_lookup James Simmons
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

From: James Simmons <uja.ornl@yahoo.com>

Add inode version field that osd-ldiskfs uses.

Signed-off-by: James Simmons <uja.ornl@yahoo.com>
---
 fs/ext4/ext4.h   | 2 ++
 fs/ext4/ialloc.c | 1 +
 fs/ext4/inode.c  | 4 ++--
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1cb6785..8abbcab 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1063,6 +1063,8 @@ struct ext4_inode_info {
 	struct dquot *i_dquot[MAXQUOTAS];
 #endif
 
+	u64 i_fs_version;
+
 	/* Precomputed uuid+inum+igen checksum for seeding inode checksums */
 	__u32 i_csum_seed;
 
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 764ff4c..09ae4a4 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1100,6 +1100,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 	ei->i_dtime = 0;
 	ei->i_block_group = group;
 	ei->i_last_alloc_group = ~0;
+	ei->i_fs_version = 0;
 
 	ext4_set_inode_flags(inode);
 	if (IS_DIRSYNC(inode))
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c7f77c6..6e66175 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4806,14 +4806,14 @@ static inline void ext4_inode_set_iversion_queried(struct inode *inode, u64 val)
 	if (unlikely(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
 		inode_set_iversion_raw(inode, val);
 	else
-		inode_set_iversion_queried(inode, val);
+		EXT4_I(inode)->i_fs_version = val;
 }
 static inline u64 ext4_inode_peek_iversion(const struct inode *inode)
 {
 	if (unlikely(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
 		return inode_peek_iversion_raw(inode);
 	else
-		return inode_peek_iversion(inode);
+		return EXT4_I(inode)->i_fs_version;
 }
 
 struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 02/22] ext4: use d_find_alias() in ext4_lookup
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
  2019-07-22  1:23 ` [lustre-devel] [PATCH 01/22] ext4: add i_fs_version James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  4:16   ` NeilBrown
  2019-07-22  1:23 ` [lustre-devel] [PATCH 03/22] ext4: prealloc table optimization James Simmons
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

From: James Simmons <uja.ornl@yahoo.com>

Really old bug that might not even exist anymore?

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/namei.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index cd01c4a..a616f58 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1664,6 +1664,35 @@ static struct dentry *ext4_lookup(struct inode *dir, struct dentry *dentry, unsi
 		}
 	}
 
+	/* ".." shouldn't go into dcache to preserve dcache hierarchy
+	 * otherwise we'll get parent being a child of actual child.
+	 * see bug 10458 for details -bzzz
+	 */
+	if (inode && (dentry->d_name.name[0] == '.' &&
+		      (dentry->d_name.len == 1 || (dentry->d_name.len == 2 &&
+						   dentry->d_name.name[1] == '.')))) {
+		struct dentry *goal = NULL;
+
+		/* first, look for an existing dentry - any one is good */
+		goal = d_find_any_alias(inode);
+		if (!goal) {
+			spin_lock(&dentry->d_lock);
+			/* there is no alias, we need to make current dentry:
+			 *  a) inaccessible for __d_lookup()
+			 *  b) inaccessible for iopen
+			 */
+			J_ASSERT(hlist_unhashed(&dentry->d_u.d_alias));
+			dentry->d_flags |= DCACHE_NFSFS_RENAMED;
+			/* this is d_instantiate() ... */
+			hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
+			dentry->d_inode = inode;
+			spin_unlock(&dentry->d_lock);
+		}
+		if (goal)
+			iput(inode);
+		return goal;
+	}
+
 #ifdef CONFIG_UNICODE
 	if (!inode && IS_CASEFOLDED(dir)) {
 		/* Eventually we want to call d_add_ci(dentry, NULL)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 03/22] ext4: prealloc table optimization
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
  2019-07-22  1:23 ` [lustre-devel] [PATCH 01/22] ext4: add i_fs_version James Simmons
  2019-07-22  1:23 ` [lustre-devel] [PATCH 02/22] ext4: use d_find_alias() in ext4_lookup James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  4:29   ` NeilBrown
  2019-08-05  7:07   ` Artem Blagodarenko
  2019-07-22  1:23 ` [lustre-devel] [PATCH 04/22] ext4: export inode management James Simmons
                   ` (18 subsequent siblings)
  21 siblings, 2 replies; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

Optimize prealloc table

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/ext4.h    |   7 +-
 fs/ext4/inode.c   |   3 +
 fs/ext4/mballoc.c | 221 +++++++++++++++++++++++++++++++++++++++++-------------
 fs/ext4/namei.c   |   4 +-
 fs/ext4/sysfs.c   |   8 +-
 5 files changed, 186 insertions(+), 57 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8abbcab..423ab4d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1190,6 +1190,8 @@ struct ext4_inode_info {
 /* Metadata checksum algorithm codes */
 #define EXT4_CRC32C_CHKSUM		1
 
+#define EXT4_MAX_PREALLOC_TABLE		64
+
 /*
  * Structure of the super block
  */
@@ -1447,12 +1449,14 @@ struct ext4_sb_info {
 
 	/* tunables */
 	unsigned long s_stripe;
-	unsigned int s_mb_stream_request;
+	unsigned long s_mb_small_req;
+	unsigned long s_mb_large_req;
 	unsigned int s_mb_max_to_scan;
 	unsigned int s_mb_min_to_scan;
 	unsigned int s_mb_stats;
 	unsigned int s_mb_order2_reqs;
 	unsigned int s_mb_group_prealloc;
+	unsigned long *s_mb_prealloc_table;
 	unsigned int s_max_dir_size_kb;
 	/* where last allocation was done - for stream allocation */
 	unsigned long s_mb_last_group;
@@ -2457,6 +2461,7 @@ extern int ext4_init_inode_table(struct super_block *sb,
 extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate);
 
 /* mballoc.c */
+extern const struct file_operations ext4_seq_prealloc_table_fops;
 extern const struct seq_operations ext4_mb_seq_groups_ops;
 extern long ext4_mb_stats;
 extern long ext4_mb_max_to_scan;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6e66175..c37418a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2796,6 +2796,9 @@ static int ext4_writepages(struct address_space *mapping,
 		ext4_journal_stop(handle);
 	}
 
+	if (wbc->nr_to_write < sbi->s_mb_small_req)
+		wbc->nr_to_write = sbi->s_mb_small_req;
+
 	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 		range_whole = 1;
 
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 99ba720..3be3bef 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2339,6 +2339,101 @@ static void ext4_mb_seq_groups_stop(struct seq_file *seq, void *v)
 	.show   = ext4_mb_seq_groups_show,
 };
 
+static int ext4_mb_check_and_update_prealloc(struct ext4_sb_info *sbi,
+					     char *str, size_t cnt,
+					     int update)
+{
+	unsigned long value;
+	unsigned long prev = 0;
+	char *cur;
+	char *next;
+	char *end;
+	int num = 0;
+
+	cur = str;
+	end = str + cnt;
+	while (cur < end) {
+		while ((cur < end) && (*cur == ' ')) cur++;
+		/* Yuck - simple_strtol */
+		value = simple_strtol(cur, &next, 0);
+		if (value == 0)
+			break;
+		if (cur == next)
+			return -EINVAL;
+
+		cur = next;
+
+		if (value > (sbi->s_blocks_per_group - 1 - 1 - sbi->s_itb_per_group))
+			return -EINVAL;
+
+		/* they should add values in order */
+		if (value <= prev)
+			return -EINVAL;
+
+		if (update)
+			sbi->s_mb_prealloc_table[num] = value;
+
+		prev = value;
+		num++;
+	}
+
+	if (num > EXT4_MAX_PREALLOC_TABLE - 1)
+		return -EOVERFLOW;
+
+	if (update)
+		sbi->s_mb_prealloc_table[num] = 0;
+
+	return 0;
+}
+
+static ssize_t ext4_mb_prealloc_table_proc_write(struct file *file,
+						 const char __user *buf,
+						 size_t cnt, loff_t *pos)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(PDE_DATA(file_inode(file)));
+	char str[128];
+	int rc;
+
+	if (cnt >= sizeof(str))
+		return -EINVAL;
+	if (copy_from_user(str, buf, cnt))
+		return -EFAULT;
+
+	rc = ext4_mb_check_and_update_prealloc(sbi, str, cnt, 0);
+	if (rc)
+		return rc;
+
+	rc = ext4_mb_check_and_update_prealloc(sbi, str, cnt, 1);
+	return rc ? rc : cnt;
+}
+
+static int mb_prealloc_table_seq_show(struct seq_file *m, void *v)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(m->private);
+	int i;
+
+	for (i = 0; i < EXT4_MAX_PREALLOC_TABLE &&
+			sbi->s_mb_prealloc_table[i] != 0; i++)
+		seq_printf(m, "%ld ", sbi->s_mb_prealloc_table[i]);
+	seq_printf(m, "\n");
+
+	return 0;
+}
+
+static int mb_prealloc_table_seq_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, mb_prealloc_table_seq_show, PDE_DATA(inode));
+}
+
+const struct file_operations ext4_seq_prealloc_table_fops = {
+	.owner	 = THIS_MODULE,
+	.open	 = mb_prealloc_table_seq_open,
+	.read	 = seq_read,
+	.llseek	 = seq_lseek,
+	.release = single_release,
+	.write	 = ext4_mb_prealloc_table_proc_write,
+};
+
 static struct kmem_cache *get_groupinfo_cache(int blocksize_bits)
 {
 	int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE;
@@ -2567,7 +2662,7 @@ static int ext4_groupinfo_create_slab(size_t size)
 int ext4_mb_init(struct super_block *sb)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
-	unsigned i, j;
+	unsigned i, j, k, l;
 	unsigned offset, offset_incr;
 	unsigned max;
 	int ret;
@@ -2616,7 +2711,6 @@ int ext4_mb_init(struct super_block *sb)
 	sbi->s_mb_max_to_scan = MB_DEFAULT_MAX_TO_SCAN;
 	sbi->s_mb_min_to_scan = MB_DEFAULT_MIN_TO_SCAN;
 	sbi->s_mb_stats = MB_DEFAULT_STATS;
-	sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD;
 	sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS;
 	/*
 	 * The default group preallocation is 512, which for 4k block
@@ -2640,9 +2734,28 @@ int ext4_mb_init(struct super_block *sb)
 	 * RAID stripe size so that preallocations don't fragment
 	 * the stripes.
 	 */
-	if (sbi->s_stripe > 1) {
-		sbi->s_mb_group_prealloc = roundup(
-			sbi->s_mb_group_prealloc, sbi->s_stripe);
+	/* Allocate table once */
+	sbi->s_mb_prealloc_table = kzalloc(
+		EXT4_MAX_PREALLOC_TABLE * sizeof(unsigned long), GFP_NOFS);
+	if (!sbi->s_mb_prealloc_table) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (sbi->s_stripe == 0) {
+		for (k = 0, l = 4; k <= 9; ++k, l *= 2)
+			sbi->s_mb_prealloc_table[k] = l;
+
+		sbi->s_mb_small_req = 256;
+		sbi->s_mb_large_req = 1024;
+		sbi->s_mb_group_prealloc = 512;
+	} else {
+		for (k = 0, l = sbi->s_stripe; k <= 2; ++k, l *= 2)
+			sbi->s_mb_prealloc_table[k] = l;
+
+		sbi->s_mb_small_req = sbi->s_stripe;
+		sbi->s_mb_large_req = sbi->s_stripe * 8;
+		sbi->s_mb_group_prealloc = sbi->s_stripe * 4;
 	}
 
 	sbi->s_locality_groups = alloc_percpu(struct ext4_locality_group);
@@ -2670,6 +2783,7 @@ int ext4_mb_init(struct super_block *sb)
 	free_percpu(sbi->s_locality_groups);
 	sbi->s_locality_groups = NULL;
 out:
+	kfree(sbi->s_mb_prealloc_table);
 	kfree(sbi->s_mb_offsets);
 	sbi->s_mb_offsets = NULL;
 	kfree(sbi->s_mb_maxs);
@@ -2932,7 +3046,6 @@ void ext4_exit_mballoc(void)
 	int err, len;
 
 	BUG_ON(ac->ac_status != AC_STATUS_FOUND);
-	BUG_ON(ac->ac_b_ex.fe_len <= 0);
 
 	sb = ac->ac_sb;
 	sbi = EXT4_SB(sb);
@@ -3062,13 +3175,14 @@ static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac)
 				struct ext4_allocation_request *ar)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
-	int bsbits, max;
+	int bsbits, i, wind;
 	ext4_lblk_t end;
-	loff_t size, start_off;
+	loff_t size;
 	loff_t orig_size __maybe_unused;
 	ext4_lblk_t start;
 	struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
 	struct ext4_prealloc_space *pa;
+	unsigned long value, last_non_zero;
 
 	/* do normalize only data requests, metadata requests
 	   do not need preallocation */
@@ -3097,51 +3211,47 @@ static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac)
 	size = size << bsbits;
 	if (size < i_size_read(ac->ac_inode))
 		size = i_size_read(ac->ac_inode);
-	orig_size = size;
+	size = (size + ac->ac_sb->s_blocksize - 1) >> bsbits;
 
-	/* max size of free chunks */
-	max = 2 << bsbits;
-
-#define NRL_CHECK_SIZE(req, size, max, chunk_size)	\
-		(req <= (size) || max <= (chunk_size))
-
-	/* first, try to predict filesize */
-	/* XXX: should this table be tunable? */
-	start_off = 0;
-	if (size <= 16 * 1024) {
-		size = 16 * 1024;
-	} else if (size <= 32 * 1024) {
-		size = 32 * 1024;
-	} else if (size <= 64 * 1024) {
-		size = 64 * 1024;
-	} else if (size <= 128 * 1024) {
-		size = 128 * 1024;
-	} else if (size <= 256 * 1024) {
-		size = 256 * 1024;
-	} else if (size <= 512 * 1024) {
-		size = 512 * 1024;
-	} else if (size <= 1024 * 1024) {
-		size = 1024 * 1024;
-	} else if (NRL_CHECK_SIZE(size, 4 * 1024 * 1024, max, 2 * 1024)) {
-		start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
-						(21 - bsbits)) << 21;
-		size = 2 * 1024 * 1024;
-	} else if (NRL_CHECK_SIZE(size, 8 * 1024 * 1024, max, 4 * 1024)) {
-		start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
-							(22 - bsbits)) << 22;
-		size = 4 * 1024 * 1024;
-	} else if (NRL_CHECK_SIZE(ac->ac_o_ex.fe_len,
-					(8<<20)>>bsbits, max, 8 * 1024)) {
-		start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
-							(23 - bsbits)) << 23;
-		size = 8 * 1024 * 1024;
+	start = wind = 0;
+	value = last_non_zero = 0;
+
+	/* let's choose preallocation window depending on file size */
+	for (i = 0; i < EXT4_MAX_PREALLOC_TABLE; i++) {
+		value = sbi->s_mb_prealloc_table[i];
+		if (value == 0)
+			break;
+		else
+			last_non_zero = value;
+
+		if (size <= value) {
+			wind = value;
+			break;
+		}
+	}
+
+	if (wind == 0) {
+		if (last_non_zero != 0) {
+			u64 tstart, tend;
+
+			/* file is quite large, we now preallocate with
+			 * the biggest configured window with regart to
+			 * logical offset
+			 */
+			wind = last_non_zero;
+			tstart = ac->ac_o_ex.fe_logical;
+			do_div(tstart, wind);
+			start = tstart * wind;
+			tend = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len - 1;
+			do_div(tend, wind);
+			tend = tend * wind + wind;
+			size = tend - start;
+		}
 	} else {
-		start_off = (loff_t) ac->ac_o_ex.fe_logical << bsbits;
-		size	  = (loff_t) EXT4_C2B(EXT4_SB(ac->ac_sb),
-					      ac->ac_o_ex.fe_len) << bsbits;
+		size = wind;
 	}
-	size = size >> bsbits;
-	start = start_off >> bsbits;
+
+	orig_size = size;
 
 	/* don't cover already allocated blocks in selected range */
 	if (ar->pleft && start <= ar->lleft) {
@@ -3223,7 +3333,6 @@ static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac)
 			 (unsigned long) ac->ac_o_ex.fe_logical);
 		BUG();
 	}
-	BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
 
 	/* now prepare goal request */
 
@@ -4191,11 +4300,19 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
 
 	/* don't use group allocation for large files */
 	size = max(size, isize);
-	if (size > sbi->s_mb_stream_request) {
+	if ((ac->ac_o_ex.fe_len >= sbi->s_mb_small_req) ||
+	    (size >= sbi->s_mb_large_req)) {
 		ac->ac_flags |= EXT4_MB_STREAM_ALLOC;
 		return;
 	}
 
+	/*
+	 * request is so large that we don't care about
+	 * streaming - it overweights any possible seek
+	 */
+	if (ac->ac_o_ex.fe_len >= sbi->s_mb_large_req)
+		return;
+
 	BUG_ON(ac->ac_lg != NULL);
 	/*
 	 * locality group prealloc space are per cpu. The reason for having
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index a616f58..a52b311 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -752,8 +752,8 @@ struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir,
 	if (root->info.hash_version != DX_HASH_TEA &&
 	    root->info.hash_version != DX_HASH_HALF_MD4 &&
 	    root->info.hash_version != DX_HASH_LEGACY) {
-		ext4_warning_inode(dir, "Unrecognised inode hash code %u",
-				   root->info.hash_version);
+		ext4_warning_inode(dir, "Unrecognised inode hash code %u for directory %lu",
+				   root->info.hash_version, dir->i_ino);
 		goto fail;
 	}
 	if (fname)
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index 04b4f53..1375815 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -184,7 +184,8 @@ static ssize_t journal_task_show(struct ext4_sb_info *sbi, char *buf)
 EXT4_RW_ATTR_SBI_UI(mb_max_to_scan, s_mb_max_to_scan);
 EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan);
 EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
-EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request);
+EXT4_RW_ATTR_SBI_UI(mb_small_req, s_mb_small_req);
+EXT4_RW_ATTR_SBI_UI(mb_large_req, s_mb_large_req);
 EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc);
 EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb);
 EXT4_ATTR(trigger_fs_error, 0200, trigger_test_error);
@@ -213,7 +214,8 @@ static ssize_t journal_task_show(struct ext4_sb_info *sbi, char *buf)
 	ATTR_LIST(mb_max_to_scan),
 	ATTR_LIST(mb_min_to_scan),
 	ATTR_LIST(mb_order2_req),
-	ATTR_LIST(mb_stream_req),
+	ATTR_LIST(mb_small_req),
+	ATTR_LIST(mb_large_req),
 	ATTR_LIST(mb_group_prealloc),
 	ATTR_LIST(max_writeback_mb_bump),
 	ATTR_LIST(extent_max_zeroout_kb),
@@ -413,6 +415,8 @@ int ext4_register_sysfs(struct super_block *sb)
 				sb);
 		proc_create_seq_data("mb_groups", S_IRUGO, sbi->s_proc,
 				&ext4_mb_seq_groups_ops, sb);
+		proc_create_data("prealloc_table", S_IRUGO, sbi->s_proc,
+				 &ext4_seq_prealloc_table_fops, sb);
 	}
 	return 0;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 04/22] ext4: export inode management
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
                   ` (2 preceding siblings ...)
  2019-07-22  1:23 ` [lustre-devel] [PATCH 03/22] ext4: prealloc table optimization James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  4:34   ` NeilBrown
  2019-07-22  1:23 ` [lustre-devel] [PATCH 05/22] ext4: various misc changes James Simmons
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

Make ext4_delete_entry() exportable for osd-ldiskfs. Also add
exportable ext4_create_inode().

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/ext4.h  |  6 ++++++
 fs/ext4/namei.c | 30 ++++++++++++++++++++++++++----
 2 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 423ab4d..50f0c50 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2579,6 +2579,12 @@ extern int ext4_dirent_csum_verify(struct inode *inode,
 				   struct ext4_dir_entry *dirent);
 extern int ext4_orphan_add(handle_t *, struct inode *);
 extern int ext4_orphan_del(handle_t *, struct inode *);
+extern struct inode *ext4_create_inode(handle_t *handle,
+				       struct inode *dir, int mode,
+				       uid_t *owner);
+extern int ext4_delete_entry(handle_t *handle, struct inode * dir,
+			     struct ext4_dir_entry_2 *de_del,
+			     struct buffer_head *bh);
 extern int ext4_htree_fill_tree(struct file *dir_file, __u32 start_hash,
 				__u32 start_minor_hash, __u32 *next_hash);
 extern int ext4_search_dir(struct buffer_head *bh,
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index a52b311..a42a2db 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2459,10 +2459,9 @@ int ext4_generic_delete_entry(handle_t *handle,
 	return -ENOENT;
 }
 
-static int ext4_delete_entry(handle_t *handle,
-			     struct inode *dir,
-			     struct ext4_dir_entry_2 *de_del,
-			     struct buffer_head *bh)
+int ext4_delete_entry(handle_t *handle, struct inode *dir,
+		      struct ext4_dir_entry_2 *de_del,
+		      struct buffer_head *bh)
 {
 	int err, csum_size = 0;
 
@@ -2499,6 +2498,7 @@ static int ext4_delete_entry(handle_t *handle,
 		ext4_std_error(dir->i_sb, err);
 	return err;
 }
+EXPORT_SYMBOL(ext4_delete_entry);
 
 /*
  * Set directory link count to 1 if nlinks > EXT4_LINK_MAX, or if nlinks == 2
@@ -2545,6 +2545,28 @@ static int ext4_add_nondir(handle_t *handle,
 	return err;
 }
 
+/* Return locked inode, then the caller can modify the inode's states/flags
+ * before others finding it. The caller should unlock the inode by itself.
+ */
+struct inode *ext4_create_inode(handle_t *handle, struct inode *dir, int mode,
+				uid_t *owner)
+{
+	struct inode *inode;
+
+	inode = ext4_new_inode(handle, dir, mode, NULL, 0, owner, 0);
+	if (!IS_ERR(inode)) {
+		if (S_ISCHR(mode) || S_ISBLK(mode) || S_ISFIFO(mode)) {
+			inode->i_op = &ext4_special_inode_operations;
+		} else {
+			inode->i_op = &ext4_file_inode_operations;
+			inode->i_fop = &ext4_file_operations;
+			ext4_set_aops(inode);
+		}
+	}
+	return inode;
+}
+EXPORT_SYMBOL(ext4_create_inode);
+
 /*
  * By the time this is called, we already have created
  * the directory cache entry for the new file, but it
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 05/22] ext4: various misc changes
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
                   ` (3 preceding siblings ...)
  2019-07-22  1:23 ` [lustre-devel] [PATCH 04/22] ext4: export inode management James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  1:23 ` [lustre-devel] [PATCH 06/22] ext4: add extra checks for mballoc James Simmons
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

Mostly exporting symbols for osd-ldiskfs.

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/balloc.c    |  1 +
 fs/ext4/ext4.h      | 26 +++++++++++++++++++++++++-
 fs/ext4/ext4_jbd2.c |  4 ++++
 fs/ext4/hash.c      |  1 +
 fs/ext4/ialloc.c    |  3 ++-
 fs/ext4/inode.c     |  6 ++++++
 fs/ext4/namei.c     | 13 +++++++------
 fs/ext4/super.c     |  9 +++------
 8 files changed, 49 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index e5d6ee6..bca75c1 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -294,6 +294,7 @@ struct ext4_group_desc * ext4_get_group_desc(struct super_block *sb,
 		*bh = sbi->s_group_desc[group_desc];
 	return desc;
 }
+EXPORT_SYMBOL(ext4_get_group_desc);
 
 /*
  * Return the block number which was discovered to be invalid, or 0 if
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 50f0c50..eb2d124 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1617,6 +1617,8 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
 
 #define NEXT_ORPHAN(inode) EXT4_I(inode)->i_dtime
 
+#define JOURNAL_START_HAS_3ARGS	1
+
 /*
  * Codes for operating systems
  */
@@ -1839,7 +1841,24 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
 
 EXTN_FEATURE_FUNCS(2)
 EXTN_FEATURE_FUNCS(3)
-EXTN_FEATURE_FUNCS(4)
+
+static inline bool ext4_has_unknown_ext4_compat_features(struct super_block *sb)
+{
+	return ((EXT4_SB(sb)->s_es->s_feature_compat &
+		cpu_to_le32(~EXT4_FEATURE_COMPAT_SUPP)) != 0);
+}
+
+static inline bool ext4_has_unknown_ext4_ro_compat_features(struct super_block *sb)
+{
+	return ((EXT4_SB(sb)->s_es->s_feature_ro_compat &
+		cpu_to_le32(~EXT4_FEATURE_RO_COMPAT_SUPP)) != 0);
+}
+
+static inline bool ext4_has_unknown_ext4_incompat_features(struct super_block *sb)
+{
+	return ((EXT4_SB(sb)->s_es->s_feature_incompat &
+		cpu_to_le32(~EXT4_FEATURE_INCOMPAT_SUPP)) != 0);
+}
 
 static inline bool ext4_has_compat_features(struct super_block *sb)
 {
@@ -3192,6 +3211,11 @@ extern int ext4_check_blockref(const char *, unsigned int,
 
 extern int ext4_ext_tree_init(handle_t *handle, struct inode *);
 extern int ext4_ext_writepage_trans_blocks(struct inode *, int);
+extern struct buffer_head *ext4_read_inode_bitmap(struct super_block *sb,
+						  ext4_group_t block_group);
+extern struct buffer_head *ext4_append(handle_t *handle,
+				       struct inode *inode,
+				       ext4_lblk_t *block);
 extern int ext4_ext_index_trans_blocks(struct inode *inode, int extents);
 extern int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 			       struct ext4_map_blocks *map, int flags);
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 7c70b08..fcec082 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -81,6 +81,7 @@ handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
 	return jbd2__journal_start(journal, blocks, rsv_blocks, GFP_NOFS,
 				   type, line);
 }
+EXPORT_SYMBOL(__ext4_journal_start_sb);
 
 int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle)
 {
@@ -108,6 +109,7 @@ int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle)
 		__ext4_std_error(sb, where, line, err);
 	return err;
 }
+EXPORT_SYMBOL(__ext4_journal_stop);
 
 handle_t *__ext4_journal_start_reserved(handle_t *handle, unsigned int line,
 					int type)
@@ -173,6 +175,7 @@ int __ext4_journal_get_write_access(const char *where, unsigned int line,
 	}
 	return err;
 }
+EXPORT_SYMBOL(__ext4_journal_get_write_access);
 
 /*
  * The ext4 forget function must perform a revoke if we are freeing data
@@ -313,6 +316,7 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
 	}
 	return err;
 }
+EXPORT_SYMBOL(__ext4_handle_dirty_metadata);
 
 int __ext4_handle_dirty_super(const char *where, unsigned int line,
 			      handle_t *handle, struct super_block *sb)
diff --git a/fs/ext4/hash.c b/fs/ext4/hash.c
index d358bfc..4648a8f 100644
--- a/fs/ext4/hash.c
+++ b/fs/ext4/hash.c
@@ -300,3 +300,4 @@ int ext4fs_dirhash(const struct inode *dir, const char *name, int len,
 #endif
 	return __ext4fs_dirhash(name, len, hinfo);
 }
+EXPORT_SYMBOL(ext4fs_dirhash);
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 09ae4a4..68d41e6 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -114,7 +114,7 @@ static int ext4_validate_inode_bitmap(struct super_block *sb,
  *
  * Return buffer_head of bitmap on success or NULL.
  */
-static struct buffer_head *
+struct buffer_head *
 ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
 {
 	struct ext4_group_desc *desc;
@@ -211,6 +211,7 @@ static int ext4_validate_inode_bitmap(struct super_block *sb,
 	put_bh(bh);
 	return ERR_PTR(err);
 }
+EXPORT_SYMBOL(ext4_read_inode_bitmap);
 
 /*
  * NOTE! When we get the inode, we're the only people
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c37418a..5561351 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -741,6 +741,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 	}
 	return retval;
 }
+EXPORT_SYMBOL(ext4_map_blocks);
 
 /*
  * Update EXT4_MAP_FLAGS in bh->b_state. For buffer heads attached to pages
@@ -1027,6 +1028,7 @@ struct buffer_head *ext4_bread(handle_t *handle, struct inode *inode,
 	put_bh(bh);
 	return ERR_PTR(-EIO);
 }
+EXPORT_SYMBOL(ext4_bread);
 
 /* Read a contiguous batch of blocks. */
 int ext4_bread_batch(struct inode *inode, ext4_lblk_t block, int bh_count,
@@ -4559,6 +4561,7 @@ int ext4_truncate(struct inode *inode)
 	trace_ext4_truncate_exit(inode);
 	return err;
 }
+EXPORT_SYMBOL(ext4_truncate);
 
 /*
  * ext4_get_inode_loc returns with an extra refcount against the inode's
@@ -4710,6 +4713,7 @@ int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc)
 	return __ext4_get_inode_loc(inode, iloc,
 		!ext4_test_inode_state(inode, EXT4_STATE_XATTR));
 }
+EXPORT_SYMBOL(ext4_get_inode_loc);
 
 static bool ext4_should_use_dax(struct inode *inode)
 {
@@ -5113,6 +5117,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 	iget_failed(inode);
 	return ERR_PTR(ret);
 }
+EXPORT_SYMBOL(__ext4_iget);
 
 static int ext4_inode_blocks_set(handle_t *handle,
 				struct ext4_inode *raw_inode,
@@ -6050,6 +6055,7 @@ int ext4_mark_inode_dirty(handle_t *handle, struct inode *inode)
 
 	return ext4_mark_iloc_dirty(handle, inode, &iloc);
 }
+EXPORT_SYMBOL(ext4_mark_inode_dirty);
 
 /*
  * ext4_dirty_inode() is called from __mark_inode_dirty()
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index a42a2db..91fc7fe 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -50,9 +50,9 @@
 #define NAMEI_RA_BLOCKS  4
 #define NAMEI_RA_SIZE	     (NAMEI_RA_CHUNKS * NAMEI_RA_BLOCKS)
 
-static struct buffer_head *ext4_append(handle_t *handle,
-					struct inode *inode,
-					ext4_lblk_t *block)
+struct buffer_head *ext4_append(handle_t *handle,
+				struct inode *inode,
+				ext4_lblk_t *block)
 {
 	struct buffer_head *bh;
 	int err;
@@ -2511,24 +2511,25 @@ int ext4_delete_entry(handle_t *handle, struct inode *dir,
  * for checking S_ISDIR(inode) (since the INODE_INDEX feature will not be set
  * on regular files) and to avoid creating huge/slow non-HTREE directories.
  */
-static void ext4_inc_count(handle_t *handle, struct inode *inode)
+void ext4_inc_count(handle_t *handle, struct inode *inode)
 {
 	inc_nlink(inode);
 	if (is_dx(inode) &&
 	    (inode->i_nlink > EXT4_LINK_MAX || inode->i_nlink == 2))
 		set_nlink(inode, 1);
 }
+EXPORT_SYMBOL(ext4_inc_count);
 
 /*
  * If a directory had nlink == 1, then we should let it be 1. This indicates
  * directory has >EXT4_LINK_MAX subdirs.
  */
-static void ext4_dec_count(handle_t *handle, struct inode *inode)
+void ext4_dec_count(handle_t *handle, struct inode *inode)
 {
 	if (!S_ISDIR(inode->i_mode) || inode->i_nlink > 2)
 		drop_nlink(inode);
 }
-
+EXPORT_SYMBOL(ext4_dec_count);
 
 static int ext4_add_nondir(handle_t *handle,
 		struct dentry *dentry, struct inode *inode)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4079605..d15c26c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -279,6 +279,7 @@ __u32 ext4_itable_unused_count(struct super_block *sb,
 		(EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT ?
 		 (__u32)le16_to_cpu(bg->bg_itable_unused_hi) << 16 : 0);
 }
+EXPORT_SYMBOL(ext4_itable_unused_count);
 
 void ext4_block_bitmap_set(struct super_block *sb,
 			   struct ext4_group_desc *bg, ext4_fsblk_t blk)
@@ -658,6 +659,7 @@ void __ext4_std_error(struct super_block *sb, const char *function,
 	save_error_info(sb, function, line);
 	ext4_handle_error(sb);
 }
+EXPORT_SYMBOL(__ext4_std_error);
 
 /*
  * ext4_abort is a much stronger failure handler than ext4_error.  The
@@ -5101,6 +5103,7 @@ int ext4_force_commit(struct super_block *sb)
 	journal = EXT4_SB(sb)->s_journal;
 	return ext4_journal_force_commit(journal);
 }
+EXPORT_SYMBOL(ext4_force_commit);
 
 static int ext4_sync_fs(struct super_block *sb, int wait)
 {
@@ -6115,16 +6118,12 @@ static int __init ext4_init_fs(void)
 	err = init_inodecache();
 	if (err)
 		goto out1;
-	register_as_ext3();
-	register_as_ext2();
 	err = register_filesystem(&ext4_fs_type);
 	if (err)
 		goto out;
 
 	return 0;
 out:
-	unregister_as_ext2();
-	unregister_as_ext3();
 	destroy_inodecache();
 out1:
 	ext4_exit_mballoc();
@@ -6145,8 +6144,6 @@ static int __init ext4_init_fs(void)
 static void __exit ext4_exit_fs(void)
 {
 	ext4_destroy_lazyinit_thread();
-	unregister_as_ext2();
-	unregister_as_ext3();
 	unregister_filesystem(&ext4_fs_type);
 	destroy_inodecache();
 	ext4_exit_mballoc();
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 06/22] ext4: add extra checks for mballoc
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
                   ` (4 preceding siblings ...)
  2019-07-22  1:23 ` [lustre-devel] [PATCH 05/22] ext4: various misc changes James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  4:37   ` NeilBrown
  2019-07-22  1:23 ` [lustre-devel] [PATCH 07/22] ext4: update .. for hash indexed directory James Simmons
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

Handle mballoc corruptions.

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/ext4.h    |   1 +
 fs/ext4/mballoc.c | 110 +++++++++++++++++++++++++++++++++++++++++++++---------
 fs/ext4/mballoc.h |   2 +-
 3 files changed, 94 insertions(+), 19 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index eb2d124..e321286 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2957,6 +2957,7 @@ struct ext4_group_info {
 	ext4_grpblk_t	bb_fragments;	/* nr of freespace fragments */
 	ext4_grpblk_t	bb_largest_free_order;/* order of largest frag in BG */
 	struct          list_head bb_prealloc_list;
+	unsigned long	bb_prealloc_nr;
 #ifdef DOUBLE_CHECK
 	void            *bb_bitmap;
 #endif
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 3be3bef..483fc0f 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -352,8 +352,8 @@
 	"ext4_groupinfo_64k", "ext4_groupinfo_128k"
 };
 
-static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
-					ext4_group_t group);
+static int ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
+				    ext4_group_t group);
 static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
 						ext4_group_t group);
 
@@ -708,8 +708,8 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
 }
 
 static noinline_for_stack
-void ext4_mb_generate_buddy(struct super_block *sb,
-				void *buddy, void *bitmap, ext4_group_t group)
+int ext4_mb_generate_buddy(struct super_block *sb,
+			   void *buddy, void *bitmap, ext4_group_t group)
 {
 	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -752,6 +752,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
 		grp->bb_free = free;
 		ext4_mark_group_bitmap_corrupted(sb, group,
 					EXT4_GROUP_INFO_BBITMAP_CORRUPT);
+		return -EIO;
 	}
 	mb_set_largest_free_order(sb, grp);
 
@@ -762,6 +763,8 @@ void ext4_mb_generate_buddy(struct super_block *sb,
 	sbi->s_mb_buddies_generated++;
 	sbi->s_mb_generation_time += period;
 	spin_unlock(&sbi->s_bal_lock);
+
+	return 0;
 }
 
 static void mb_regenerate_buddy(struct ext4_buddy *e4b)
@@ -882,7 +885,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore, gfp_t gfp)
 	}
 
 	first_block = page->index * blocks_per_page;
-	for (i = 0; i < blocks_per_page; i++) {
+	for (i = 0; i < blocks_per_page && err == 0; i++) {
 		group = (first_block + i) >> 1;
 		if (group >= ngroups)
 			break;
@@ -926,7 +929,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore, gfp_t gfp)
 			ext4_lock_group(sb, group);
 			/* init the buddy */
 			memset(data, 0xff, blocksize);
-			ext4_mb_generate_buddy(sb, data, incore, group);
+			err = ext4_mb_generate_buddy(sb, data, incore, group);
 			ext4_unlock_group(sb, group);
 			incore = NULL;
 		} else {
@@ -941,7 +944,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore, gfp_t gfp)
 			memcpy(data, bitmap, blocksize);
 
 			/* mark all preallocated blks used in in-core bitmap */
-			ext4_mb_generate_from_pa(sb, data, group);
+			err = ext4_mb_generate_from_pa(sb, data, group);
 			ext4_mb_generate_from_freelist(sb, data, group);
 			ext4_unlock_group(sb, group);
 
@@ -951,8 +954,8 @@ static int ext4_mb_init_cache(struct page *page, char *incore, gfp_t gfp)
 			incore = data;
 		}
 	}
-	SetPageUptodate(page);
-
+	if (likely(err == 0))
+		SetPageUptodate(page);
 out:
 	if (bh) {
 		for (i = 0; i < groups_per_page; i++)
@@ -2281,7 +2284,8 @@ static int ext4_mb_seq_groups_show(struct seq_file *seq, void *v)
 {
 	struct super_block *sb = PDE_DATA(file_inode(seq->file));
 	ext4_group_t group = (ext4_group_t) ((unsigned long) v);
-	int i;
+	struct ext4_group_desc *gdp;
+	int free = 0, i;
 	int err, buddy_loaded = 0;
 	struct ext4_buddy e4b;
 	struct ext4_group_info *grinfo;
@@ -2295,7 +2299,7 @@ static int ext4_mb_seq_groups_show(struct seq_file *seq, void *v)
 
 	group--;
 	if (group == 0)
-		seq_puts(seq, "#group: free  frags first ["
+		seq_puts(seq, "#group: bfree  gfree  free  frags first ["
 			      " 2^0   2^1   2^2   2^3   2^4   2^5   2^6  "
 			      " 2^7   2^8   2^9   2^10  2^11  2^12  2^13  ]\n");
 
@@ -2313,13 +2317,19 @@ static int ext4_mb_seq_groups_show(struct seq_file *seq, void *v)
 		buddy_loaded = 1;
 	}
 
+	gdp = ext4_get_group_desc(sb, group, NULL);
+	if (gdp)
+		free = ext4_free_group_clusters(sb, gdp);
+
 	memcpy(&sg, ext4_get_group_info(sb, group), i);
 
 	if (buddy_loaded)
 		ext4_mb_unload_buddy(&e4b);
 
-	seq_printf(seq, "#%-5u: %-5u %-5u %-5u [", group, sg.info.bb_free,
-			sg.info.bb_fragments, sg.info.bb_first_free);
+	seq_printf(seq, "#%-5lu: %-5u %-5u %-5u %-5u %-5lu [",
+		   (long unsigned int)group, sg.info.bb_free, free,
+		   sg.info.bb_fragments, sg.info.bb_first_free,
+		   sg.info.bb_prealloc_nr);
 	for (i = 0; i <= 13; i++)
 		seq_printf(seq, " %-5u", i <= blocksize_bits + 1 ?
 				sg.info.bb_counters[i] : 0);
@@ -3593,6 +3603,42 @@ static void ext4_mb_use_group_pa(struct ext4_allocation_context *ac,
 }
 
 /*
+ * check free blocks in bitmap match free block in group descriptor
+ * do this before taking preallocated blocks into account to be able
+ * to detect on-disk corruptions. The group lock should be hold by the
+ * caller.
+ */
+int ext4_mb_check_ondisk_bitmap(struct super_block *sb, void *bitmap,
+				struct ext4_group_desc *gdp, int group)
+{
+	unsigned short max = EXT4_CLUSTERS_PER_GROUP(sb);
+	unsigned short i, first, free = 0;
+	unsigned short free_in_gdp = ext4_free_group_clusters(sb, gdp);
+
+	if (free_in_gdp == 0 && gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))
+		return 0;
+
+	i = mb_find_next_zero_bit(bitmap, max, 0);
+
+	while (i < max) {
+		first = i;
+		i = mb_find_next_bit(bitmap, max, i);
+		if (i > max)
+			i = max;
+		free += i - first;
+		if (i < max)
+			i = mb_find_next_zero_bit(bitmap, max, i);
+	}
+
+	if (free != free_in_gdp) {
+		ext4_error(sb, "on-disk bitmap for group %d corrupted: %u blocks free in bitmap, %u - in gd\n",
+			   group, free, free_in_gdp);
+		return -EIO;
+	}
+	return 0;
+}
+
+/*
  * the function goes through all block freed in the group
  * but not yet committed and marks them used in in-core bitmap.
  * buddy must be generated from this bitmap
@@ -3622,16 +3668,27 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
  * Need to be called with ext4 group lock held
  */
 static noinline_for_stack
-void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
-					ext4_group_t group)
+int ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
+			     ext4_group_t group)
 {
 	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
 	struct ext4_prealloc_space *pa;
+	struct ext4_group_desc *gdp;
 	struct list_head *cur;
 	ext4_group_t groupnr;
 	ext4_grpblk_t start;
 	int preallocated = 0;
-	int len;
+	int skip = 0, count = 0;
+	int err, len;
+
+	gdp = ext4_get_group_desc(sb, group, NULL);
+	if (!gdp)
+		return -EIO;
+
+	/* before applying preallocations, check bitmap consistency */
+	err = ext4_mb_check_ondisk_bitmap(sb, bitmap, gdp, group);
+	if (err)
+		return err;
 
 	/* all form of preallocation discards first load group,
 	 * so the only competing code is preallocation use.
@@ -3648,13 +3705,22 @@ void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
 					     &groupnr, &start);
 		len = pa->pa_len;
 		spin_unlock(&pa->pa_lock);
-		if (unlikely(len == 0))
+		if (unlikely(len == 0)) {
+			skip++;
 			continue;
+		}
 		BUG_ON(groupnr != group);
 		ext4_set_bits(bitmap, start, len);
 		preallocated += len;
+		count++;
+	}
+	if (count + skip != grp->bb_prealloc_nr) {
+		ext4_error(sb, "lost preallocations: count %d, bb_prealloc_nr %lu, skip %d\n",
+			   count, grp->bb_prealloc_nr, skip);
+		return -EIO;
 	}
 	mb_debug(1, "preallocated %u for group %u\n", preallocated, group);
+	return 0;
 }
 
 static void ext4_mb_pa_callback(struct rcu_head *head)
@@ -3718,6 +3784,7 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
 	 */
 	ext4_lock_group(sb, grp);
 	list_del(&pa->pa_group_list);
+	ext4_get_group_info(sb, grp)->bb_prealloc_nr--;
 	ext4_unlock_group(sb, grp);
 
 	spin_lock(pa->pa_obj_lock);
@@ -3812,6 +3879,7 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
 
 	ext4_lock_group(sb, ac->ac_b_ex.fe_group);
 	list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
+	grp->bb_prealloc_nr++;
 	ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
 
 	spin_lock(pa->pa_obj_lock);
@@ -3873,6 +3941,7 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
 
 	ext4_lock_group(sb, ac->ac_b_ex.fe_group);
 	list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
+	grp->bb_prealloc_nr++;
 	ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
 
 	/*
@@ -4044,6 +4113,8 @@ static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
 
 		spin_unlock(&pa->pa_lock);
 
+		BUG_ON(grp->bb_prealloc_nr == 0);
+		grp->bb_prealloc_nr--;
 		list_del(&pa->pa_group_list);
 		list_add(&pa->u.pa_tmp_list, &list);
 	}
@@ -4174,7 +4245,7 @@ void ext4_discard_preallocations(struct inode *inode)
 		if (err) {
 			ext4_error(sb, "Error %d loading buddy information for %u",
 				   err, group);
-			continue;
+			return;
 		}
 
 		bitmap_bh = ext4_read_block_bitmap(sb, group);
@@ -4187,6 +4258,8 @@ void ext4_discard_preallocations(struct inode *inode)
 		}
 
 		ext4_lock_group(sb, group);
+		BUG_ON(e4b.bd_info->bb_prealloc_nr == 0);
+		e4b.bd_info->bb_prealloc_nr--;
 		list_del(&pa->pa_group_list);
 		ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
 		ext4_unlock_group(sb, group);
@@ -4448,6 +4521,7 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
 		}
 		ext4_lock_group(sb, group);
 		list_del(&pa->pa_group_list);
+		ext4_get_group_info(sb, group)->bb_prealloc_nr--;
 		ext4_mb_release_group_pa(&e4b, pa);
 		ext4_unlock_group(sb, group);
 
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 88c98f1..8325ad9 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -70,7 +70,7 @@
 /*
  * for which requests use 2^N search using buddies
  */
-#define MB_DEFAULT_ORDER2_REQS		2
+#define MB_DEFAULT_ORDER2_REQS		8
 
 /*
  * default group prealloc size 512 blocks
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 07/22] ext4: update .. for hash indexed directory
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
                   ` (5 preceding siblings ...)
  2019-07-22  1:23 ` [lustre-devel] [PATCH 06/22] ext4: add extra checks for mballoc James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  4:45   ` NeilBrown
  2019-07-22  1:23 ` [lustre-devel] [PATCH 08/22] ext4: kill off struct dx_root James Simmons
                   ` (14 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

Handle special .. case.

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/namei.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 91fc7fe..dd64a10 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2133,6 +2133,73 @@ static int make_indexed_dir(handle_t *handle, struct ext4_filename *fname,
 	return retval;
 }
 
+/* update ".." for hash-indexed directory, split the item "." if necessary */
+static int ext4_update_dotdot(handle_t *handle, struct dentry *dentry,
+			      struct inode *inode)
+{
+	struct inode *dir = dentry->d_parent->d_inode;
+	struct buffer_head *dir_block;
+	struct ext4_dir_entry_2 *de;
+	int len, journal = 0, err = 0;
+
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+
+	if (IS_DIRSYNC(dir))
+		handle->h_sync = 1;
+
+	dir_block = ext4_bread(handle, dir, 0, 0);
+	if (IS_ERR(dir_block)) {
+		err = PTR_ERR(dir_block);
+		goto out;
+	}
+
+	de = (struct ext4_dir_entry_2 *)dir_block->b_data;
+	/* the first item must be "." */
+	assert(de->name_len == 1 && de->name[0] == '.');
+	len = le16_to_cpu(de->rec_len);
+	assert(len >= EXT4_DIR_REC_LEN(1));
+	if (len > EXT4_DIR_REC_LEN(1)) {
+		BUFFER_TRACE(dir_block, "get_write_access");
+		err = ext4_journal_get_write_access(handle, dir_block);
+		if (err)
+			goto out_journal;
+
+		journal = 1;
+		de->rec_len = cpu_to_le16(EXT4_DIR_REC_LEN(1));
+	}
+
+	len -= EXT4_DIR_REC_LEN(1);
+	assert(len == 0 || len >= EXT4_DIR_REC_LEN(2));
+	de = (struct ext4_dir_entry_2 *)
+		((char *) de + le16_to_cpu(de->rec_len));
+	if (!journal) {
+		BUFFER_TRACE(dir_block, "get_write_access");
+		err = ext4_journal_get_write_access(handle, dir_block);
+		if (err)
+			goto out_journal;
+	}
+
+	de->inode = cpu_to_le32(inode->i_ino);
+	if (len > 0)
+		de->rec_len = cpu_to_le16(len);
+	else
+		assert(le16_to_cpu(de->rec_len) >= EXT4_DIR_REC_LEN(2));
+	de->name_len = 2;
+	strcpy(de->name, "..");
+	ext4_set_de_type(dir->i_sb, de, S_IFDIR);
+
+out_journal:
+	if (journal) {
+		BUFFER_TRACE(dir_block, "call ext4_handle_dirty_metadata");
+		err = ext4_handle_dirty_dirent_node(handle, dir, dir_block);
+		ext4_mark_inode_dirty(handle, dir);
+	}
+	brelse(dir_block);
+out:
+	return err;
+}
+
 /*
  *	ext4_add_entry()
  *
@@ -2189,6 +2256,9 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
 	}
 
 	if (is_dx(dir)) {
+		if (dentry->d_name.len == 2 &&
+		    memcmp(dentry->d_name.name, "..", 2) == 0)
+			return ext4_update_dotdot(handle, dentry, inode);
 		retval = ext4_dx_add_entry(handle, &fname, dir, inode);
 		if (!retval || (retval != ERR_BAD_DX_DIR))
 			goto out;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 08/22] ext4: kill off struct dx_root
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
                   ` (6 preceding siblings ...)
  2019-07-22  1:23 ` [lustre-devel] [PATCH 07/22] ext4: update .. for hash indexed directory James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  4:52   ` NeilBrown
  2019-07-22  1:23 ` [lustre-devel] [PATCH 09/22] ext4: fix mballoc pa free mismatch James Simmons
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

Use struct dx_root_info directly.

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/namei.c | 110 +++++++++++++++++++++++++++++---------------------------
 1 file changed, 58 insertions(+), 52 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index dd64a10..4c45570 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -197,23 +197,13 @@ struct dx_entry
  * dirent the two low bits of the hash version will be zero.  Therefore, the
  * hash version mod 4 should never be 0.  Sincerely, the paranoia department.
  */
-
-struct dx_root
+struct dx_root_info
 {
-	struct fake_dirent dot;
-	char dot_name[4];
-	struct fake_dirent dotdot;
-	char dotdot_name[4];
-	struct dx_root_info
-	{
-		__le32 reserved_zero;
-		u8 hash_version;
-		u8 info_length; /* 8 */
-		u8 indirect_levels;
-		u8 unused_flags;
-	}
-	info;
-	struct dx_entry	entries[0];
+	__le32 reserved_zero;
+	u8 hash_version;
+	u8 info_length; /* 8 */
+	u8 indirect_levels;
+	u8 unused_flags;
 };
 
 struct dx_node
@@ -514,6 +504,16 @@ static inline int ext4_handle_dirty_dx_node(handle_t *handle,
  * Future: use high four bits of block for coalesce-on-delete flags
  * Mask them off for now.
  */
+struct dx_root_info *dx_get_dx_info(struct ext4_dir_entry_2 *de)
+{
+	/* get dotdot first */
+	de = (struct ext4_dir_entry_2 *)((char *)de + EXT4_DIR_REC_LEN(1));
+
+	/* dx root info is after dotdot entry */
+	de = (struct ext4_dir_entry_2 *)((char *)de + EXT4_DIR_REC_LEN(2));
+
+	return (struct dx_root_info *)de;
+}
 
 static inline ext4_lblk_t dx_get_block(struct dx_entry *entry)
 {
@@ -738,7 +738,7 @@ struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir,
 {
 	unsigned count, indirect;
 	struct dx_entry *at, *entries, *p, *q, *m;
-	struct dx_root *root;
+	struct dx_root_info *info;
 	struct dx_frame *frame = frame_in;
 	struct dx_frame *ret_err = ERR_PTR(ERR_BAD_DX_DIR);
 	u32 hash;
@@ -748,17 +748,17 @@ struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir,
 	if (IS_ERR(frame->bh))
 		return (struct dx_frame *) frame->bh;
 
-	root = (struct dx_root *) frame->bh->b_data;
-	if (root->info.hash_version != DX_HASH_TEA &&
-	    root->info.hash_version != DX_HASH_HALF_MD4 &&
-	    root->info.hash_version != DX_HASH_LEGACY) {
+	info = dx_get_dx_info((struct ext4_dir_entry_2 *)frame->bh->b_data);
+	if (info->hash_version != DX_HASH_TEA &&
+	    info->hash_version != DX_HASH_HALF_MD4 &&
+	    info->hash_version != DX_HASH_LEGACY) {
 		ext4_warning_inode(dir, "Unrecognised inode hash code %u for directory %lu",
-				   root->info.hash_version, dir->i_ino);
+				   info->hash_version, dir->i_ino);
 		goto fail;
 	}
 	if (fname)
 		hinfo = &fname->hinfo;
-	hinfo->hash_version = root->info.hash_version;
+	hinfo->hash_version = info->hash_version;
 	if (hinfo->hash_version <= DX_HASH_TEA)
 		hinfo->hash_version += EXT4_SB(dir->i_sb)->s_hash_unsigned;
 	hinfo->seed = EXT4_SB(dir->i_sb)->s_hash_seed;
@@ -766,13 +766,13 @@ struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir,
 		ext4fs_dirhash(dir, fname_name(fname), fname_len(fname), hinfo);
 	hash = hinfo->hash;
 
-	if (root->info.unused_flags & 1) {
+	if (info->unused_flags & 1) {
 		ext4_warning_inode(dir, "Unimplemented hash flags: %#06x",
-				   root->info.unused_flags);
+				   info->unused_flags);
 		goto fail;
 	}
 
-	indirect = root->info.indirect_levels;
+	indirect = info->indirect_levels;
 	if (indirect >= ext4_dir_htree_level(dir->i_sb)) {
 		ext4_warning(dir->i_sb,
 			     "Directory (ino: %lu) htree depth %#06x exceed"
@@ -785,14 +785,13 @@ struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir,
 		goto fail;
 	}
 
-	entries = (struct dx_entry *)(((char *)&root->info) +
-				      root->info.info_length);
+	entries = (struct dx_entry *)(((char *)info) + info->info_length);
 
 	if (dx_get_limit(entries) != dx_root_limit(dir,
-						   root->info.info_length)) {
+						   info->info_length)) {
 		ext4_warning_inode(dir, "dx entry: limit %u != root limit %u",
 				   dx_get_limit(entries),
-				   dx_root_limit(dir, root->info.info_length));
+				   dx_root_limit(dir, info->info_length));
 		goto fail;
 	}
 
@@ -877,7 +876,7 @@ static void dx_release(struct dx_frame *frames)
 	if (frames[0].bh == NULL)
 		return;
 
-	info = &((struct dx_root *)frames[0].bh->b_data)->info;
+	info = dx_get_dx_info((struct ext4_dir_entry_2 *)frames[0].bh->b_data);
 	/* save local copy, "info" may be freed after brelse() */
 	indirect_levels = info->indirect_levels;
 	for (i = 0; i <= indirect_levels; i++) {
@@ -2020,17 +2019,16 @@ static int make_indexed_dir(handle_t *handle, struct ext4_filename *fname,
 			    struct inode *inode, struct buffer_head *bh)
 {
 	struct buffer_head *bh2;
-	struct dx_root	*root;
 	struct dx_frame	frames[EXT4_HTREE_LEVEL], *frame;
 	struct dx_entry *entries;
-	struct ext4_dir_entry_2	*de, *de2;
+	struct ext4_dir_entry_2	*de, *de2, *dot_de, *dotdot_de;
 	struct ext4_dir_entry_tail *t;
 	char		*data1, *top;
 	unsigned	len;
 	int		retval;
 	unsigned	blocksize;
 	ext4_lblk_t  block;
-	struct fake_dirent *fde;
+	struct dx_root_info *dx_info;
 	int csum_size = 0;
 
 	if (ext4_has_metadata_csum(inode->i_sb))
@@ -2045,18 +2043,19 @@ static int make_indexed_dir(handle_t *handle, struct ext4_filename *fname,
 		brelse(bh);
 		return retval;
 	}
-	root = (struct dx_root *) bh->b_data;
+
+	dot_de = (struct ext4_dir_entry_2 *)bh->b_data;
+	dotdot_de = ext4_next_entry(dot_de, blocksize);
 
 	/* The 0th block becomes the root, move the dirents out */
-	fde = &root->dotdot;
-	de = (struct ext4_dir_entry_2 *)((char *)fde +
-		ext4_rec_len_from_disk(fde->rec_len, blocksize));
-	if ((char *) de >= (((char *) root) + blocksize)) {
+	de = (struct ext4_dir_entry_2 *)((char *)dotdot_de +
+		ext4_rec_len_from_disk(dotdot_de->rec_len, blocksize));
+	if ((char *)de >= (((char *)dot_de) + blocksize)) {
 		EXT4_ERROR_INODE(dir, "invalid rec_len for '..'");
 		brelse(bh);
 		return -EFSCORRUPTED;
 	}
-	len = ((char *) root) + (blocksize - csum_size) - (char *) de;
+	len = ((char *)dot_de) + (blocksize - csum_size) - (char *) de;
 
 	/* Allocate new block for the 0th block's dirents */
 	bh2 = ext4_append(handle, dir, &block);
@@ -2082,19 +2081,24 @@ static int make_indexed_dir(handle_t *handle, struct ext4_filename *fname,
 	}
 
 	/* Initialize the root; the dot dirents already exist */
-	de = (struct ext4_dir_entry_2 *) (&root->dotdot);
-	de->rec_len = ext4_rec_len_to_disk(blocksize - EXT4_DIR_REC_LEN(2),
-					   blocksize);
-	memset (&root->info, 0, sizeof(root->info));
-	root->info.info_length = sizeof(root->info);
-	root->info.hash_version = EXT4_SB(dir->i_sb)->s_def_hash_version;
-	entries = root->entries;
+	dotdot_de->rec_len =
+		ext4_rec_len_to_disk(blocksize - le16_to_cpu(dot_de->rec_len),
+				     blocksize);
+
+	/* initialize hashing info */
+	dx_info = dx_get_dx_info(dot_de);
+	memset(dx_info, 0, sizeof(*dx_info));
+	dx_info->info_length = sizeof(*dx_info);
+	dx_info->hash_version = EXT4_SB(dir->i_sb)->s_def_hash_version;
+
+	entries = (void *)dx_info + sizeof(*dx_info);
+
 	dx_set_block(entries, 1);
 	dx_set_count(entries, 1);
-	dx_set_limit(entries, dx_root_limit(dir, sizeof(root->info)));
+	dx_set_limit(entries, dx_root_limit(dir, sizeof(*dx_info)));
 
 	/* Initialize as for dx_probe */
-	fname->hinfo.hash_version = root->info.hash_version;
+	fname->hinfo.hash_version = dx_info->hash_version;
 	if (fname->hinfo.hash_version <= DX_HASH_TEA)
 		fname->hinfo.hash_version += EXT4_SB(dir->i_sb)->s_hash_unsigned;
 	fname->hinfo.seed = EXT4_SB(dir->i_sb)->s_hash_seed;
@@ -2443,7 +2447,8 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
 				goto journal_error;
 			}
 		} else {
-			struct dx_root *dxroot;
+			struct dx_root_info *info;
+
 			memcpy((char *) entries2, (char *) entries,
 			       icount * sizeof(struct dx_entry));
 			dx_set_limit(entries2, dx_node_limit(dir));
@@ -2451,8 +2456,9 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
 			/* Set up root */
 			dx_set_count(entries, 1);
 			dx_set_block(entries + 0, newblock);
-			dxroot = (struct dx_root *)frames[0].bh->b_data;
-			dxroot->info.indirect_levels += 1;
+			info = dx_get_dx_info((struct ext4_dir_entry_2 *)
+					      frames[0].bh->b_data);
+			info->indirect_levels += 1;
 			dxtrace(printk(KERN_DEBUG
 				       "Creating %d level index...\n",
 				       dxroot->info.indirect_levels));
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 09/22] ext4: fix mballoc pa free mismatch
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
                   ` (7 preceding siblings ...)
  2019-07-22  1:23 ` [lustre-devel] [PATCH 08/22] ext4: kill off struct dx_root James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  4:56   ` NeilBrown
  2019-07-22  1:23 ` [lustre-devel] [PATCH 10/22] ext4: add data in dentry feature James Simmons
                   ` (12 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

Intoduce pa_error so we can find any mballoc pa cleanup problems.

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/mballoc.c | 45 +++++++++++++++++++++++++++++++++++++++------
 fs/ext4/mballoc.h |  2 ++
 2 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 483fc0f..463fba6 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3863,6 +3863,7 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
 	INIT_LIST_HEAD(&pa->pa_group_list);
 	pa->pa_deleted = 0;
 	pa->pa_type = MB_INODE_PA;
+	pa->pa_error = 0;
 
 	mb_debug(1, "new inode pa %p: %llu/%u for %u\n", pa,
 			pa->pa_pstart, pa->pa_len, pa->pa_lstart);
@@ -3924,6 +3925,7 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
 	INIT_LIST_HEAD(&pa->pa_group_list);
 	pa->pa_deleted = 0;
 	pa->pa_type = MB_GROUP_PA;
+	pa->pa_error = 0;
 
 	mb_debug(1, "new group pa %p: %llu/%u for %u\n", pa,
 			pa->pa_pstart, pa->pa_len, pa->pa_lstart);
@@ -3983,7 +3985,9 @@ static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
 	unsigned long long grp_blk_start;
 	int free = 0;
 
+	assert_spin_locked(ext4_group_lock_ptr(sb, e4b->bd_group));
 	BUG_ON(pa->pa_deleted == 0);
+	BUG_ON(pa->pa_inode == NULL);
 	ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit);
 	grp_blk_start = pa->pa_pstart - EXT4_C2B(sbi, bit);
 	BUG_ON(group != e4b->bd_group && pa->pa_len != 0);
@@ -4006,12 +4010,19 @@ static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
 		mb_free_blocks(pa->pa_inode, e4b, bit, next - bit);
 		bit = next + 1;
 	}
-	if (free != pa->pa_free) {
-		ext4_msg(e4b->bd_sb, KERN_CRIT,
-			 "pa %p: logic %lu, phys. %lu, len %lu",
-			 pa, (unsigned long) pa->pa_lstart,
-			 (unsigned long) pa->pa_pstart,
-			 (unsigned long) pa->pa_len);
+
+	/* "free < pa->pa_free" means we maybe double alloc the same blocks,
+	 * otherwise maybe leave some free blocks unavailable, no need to BUG.
+	 */
+	if ((free > pa->pa_free && !pa->pa_error) || (free < pa->pa_free)) {
+		ext4_error(sb, "pa free mismatch: [pa %p] "
+				"[phy %lu] [logic %lu] [len %u] [free %u] "
+				"[error %u] [inode %lu] [freed %u]", pa,
+				(unsigned long)pa->pa_pstart,
+				(unsigned long)pa->pa_lstart,
+				(unsigned)pa->pa_len, (unsigned)pa->pa_free,
+				(unsigned)pa->pa_error, pa->pa_inode->i_ino,
+				free);
 		ext4_grp_locked_error(sb, group, 0, 0, "free %u, pa_free %u",
 					free, pa->pa_free);
 		/*
@@ -4019,6 +4030,8 @@ static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
 		 * from the bitmap and continue.
 		 */
 	}
+	/* do not verify if the file system is being umounted */
+	BUG_ON(atomic_read(&sb->s_active) > 0 && pa->pa_free != free);
 	atomic_add(free, &sbi->s_mb_discarded);
 
 	return 0;
@@ -4764,6 +4777,26 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 		ac->ac_b_ex.fe_len = 0;
 		ar->len = 0;
 		ext4_mb_show_ac(ac);
+		if (ac->ac_pa) {
+			struct ext4_prealloc_space *pa = ac->ac_pa;
+
+			/* We can not make sure whether the bitmap has
+			 * been updated or not when fail case. So can
+			 * not revert pa_free back, just mark pa_error
+			 */
+			pa->pa_error++;
+			ext4_error(sb,
+				   "Updating bitmap error: [err %d] "
+				   "[pa %p] [phy %lu] [logic %lu] "
+				   "[len %u] [free %u] [error %u] "
+				   "[inode %lu]", *errp, pa,
+				   (unsigned long)pa->pa_pstart,
+				   (unsigned long)pa->pa_lstart,
+				   (unsigned)pa->pa_len,
+				   (unsigned)pa->pa_free,
+				   (unsigned)pa->pa_error,
+				   pa->pa_inode ? pa->pa_inode->i_ino : 0);
+		}
 	}
 	ext4_mb_release_context(ac);
 out:
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 8325ad9..e00c3b7 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -20,6 +20,7 @@
 #include <linux/seq_file.h>
 #include <linux/blkdev.h>
 #include <linux/mutex.h>
+#include <linux/genhd.h>
 #include "ext4_jbd2.h"
 #include "ext4.h"
 
@@ -111,6 +112,7 @@ struct ext4_prealloc_space {
 	ext4_grpblk_t		pa_len;		/* len of preallocated chunk */
 	ext4_grpblk_t		pa_free;	/* how many blocks are free */
 	unsigned short		pa_type;	/* pa type. inode or group */
+	unsigned short		pa_error;
 	spinlock_t		*pa_obj_lock;
 	struct inode		*pa_inode;	/* hack, for history only */
 };
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 10/22] ext4: add data in dentry feature
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
                   ` (8 preceding siblings ...)
  2019-07-22  1:23 ` [lustre-devel] [PATCH 09/22] ext4: fix mballoc pa free mismatch James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  1:23 ` [lustre-devel] [PATCH 11/22] ext4: over ride current_time James Simmons
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

Lustre likes to stuff extra data into dentrys.

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/dir.c    |  14 ++--
 fs/ext4/ext4.h   |  92 +++++++++++++++++++++--
 fs/ext4/inline.c |  16 ++--
 fs/ext4/namei.c  | 217 +++++++++++++++++++++++++++++++++++++++++++------------
 fs/ext4/super.c  |   4 +-
 5 files changed, 278 insertions(+), 65 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index c7843b1..20e0d32 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -70,11 +70,11 @@ int __ext4_check_dir_entry(const char *function, unsigned int line,
 	const int rlen = ext4_rec_len_from_disk(de->rec_len,
 						dir->i_sb->s_blocksize);
 
-	if (unlikely(rlen < EXT4_DIR_REC_LEN(1)))
+	if (unlikely(rlen < __EXT4_DIR_REC_LEN(1)))
 		error_msg = "rec_len is smaller than minimal";
 	else if (unlikely(rlen % 4 != 0))
 		error_msg = "rec_len % 4 != 0";
-	else if (unlikely(rlen < EXT4_DIR_REC_LEN(de->name_len)))
+	else if (unlikely(rlen < __EXT4_DIR_REC_LEN(de->name_len)))
 		error_msg = "rec_len is too small for name_len";
 	else if (unlikely(((char *) de - buf) + rlen > size))
 		error_msg = "directory entry overrun";
@@ -219,7 +219,7 @@ static int ext4_readdir(struct file *file, struct dir_context *ctx)
 				 * failure will be detected in the
 				 * dirent test below. */
 				if (ext4_rec_len_from_disk(de->rec_len,
-					sb->s_blocksize) < EXT4_DIR_REC_LEN(1))
+					sb->s_blocksize) < __EXT4_DIR_REC_LEN(1))
 					break;
 				i += ext4_rec_len_from_disk(de->rec_len,
 							    sb->s_blocksize);
@@ -441,13 +441,17 @@ int ext4_htree_store_dirent(struct file *dir_file, __u32 hash,
 	struct rb_node **p, *parent = NULL;
 	struct fname *fname, *new_fn;
 	struct dir_private_info *info;
+	int extra_data = 0;
 	int len;
 
 	info = dir_file->private_data;
 	p = &info->root.rb_node;
 
 	/* Create and allocate the fname structure */
-	len = sizeof(struct fname) + ent_name->len + 1;
+	if (dirent->file_type & EXT4_DIRENT_LUFID)
+		extra_data = ext4_get_dirent_data_len(dirent);
+
+	len = sizeof(struct fname) + ent_name->len + extra_data + 1;
 	new_fn = kzalloc(len, GFP_KERNEL);
 	if (!new_fn)
 		return -ENOMEM;
@@ -456,7 +460,7 @@ int ext4_htree_store_dirent(struct file *dir_file, __u32 hash,
 	new_fn->inode = le32_to_cpu(dirent->inode);
 	new_fn->name_len = ent_name->len;
 	new_fn->file_type = dirent->file_type;
-	memcpy(new_fn->name, ent_name->name, ent_name->len);
+	memcpy(new_fn->name, ent_name->name, ent_name->len + extra_data);
 	new_fn->name[ent_name->len] = 0;
 
 	while (*p) {
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e321286..51b6159 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1069,6 +1069,7 @@ struct ext4_inode_info {
 	__u32 i_csum_seed;
 
 	kprojid_t i_projid;
+	void *i_dirdata;
 };
 
 /*
@@ -1112,6 +1113,7 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_POSIX_ACL		0x08000	/* POSIX Access Control Lists */
 #define EXT4_MOUNT_NO_AUTO_DA_ALLOC	0x10000	/* No auto delalloc mapping */
 #define EXT4_MOUNT_BARRIER		0x20000 /* Use block barriers */
+#define EXT4_MOUNT_DIRDATA		0x40000 /* Data in directory entries */
 #define EXT4_MOUNT_QUOTA		0x40000 /* Some quota option set */
 #define EXT4_MOUNT_USRQUOTA		0x80000 /* "old" user quota,
 						 * enable enforcement for hidden
@@ -1805,6 +1807,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
 					 EXT4_FEATURE_INCOMPAT_FLEX_BG| \
 					 EXT4_FEATURE_INCOMPAT_EA_INODE| \
 					 EXT4_FEATURE_INCOMPAT_MMP | \
+					 EXT4_FEATURE_INCOMPAT_DIRDATA | \
 					 EXT4_FEATURE_INCOMPAT_INLINE_DATA | \
 					 EXT4_FEATURE_INCOMPAT_ENCRYPT | \
 					 EXT4_FEATURE_INCOMPAT_CASEFOLD | \
@@ -1984,6 +1987,43 @@ struct ext4_dir_entry_tail {
 #define EXT4_FT_SYMLINK		7
 
 #define EXT4_FT_MAX		8
+#define EXT4_FT_MASK           0xf
+
+#if EXT4_FT_MAX > EXT4_FT_MASK
+#error "conflicting EXT4_FT_MAX and EXT4_FT_MASK"
+#endif
+
+/*
+ * d_type has 4 unused bits, so it can hold four types data. these different
+ * type of data (e.g. lustre data, high 32 bits of 64-bit inode number) can be
+ * stored, in flag order, after file-name in ext4 dirent.
+*/
+/*
+ * this flag is added to d_type if ext4 dirent has extra data after
+ * filename. this data length is variable and length is stored in first byte
+ * of data. data start after filename NUL byte.
+ * This is used by Lustre FS.
+ */
+#define EXT4_DIRENT_LUFID	0x10
+
+#define EXT4_LUFID_MAGIC	0xAD200907UL
+struct ext4_dentry_param {
+	__u32  edp_magic;	/* EXT4_LUFID_MAGIC */
+	char   edp_len;		/* size of edp_data in bytes */
+	char   edp_data[0];	/* packed array of data */
+} __packed;
+
+static inline unsigned char *ext4_dentry_get_data(struct super_block *sb,
+						  struct ext4_dentry_param *p)
+{
+	if (!ext4_has_feature_dirdata(sb))
+		return NULL;
+
+	if (p && p->edp_magic == EXT4_LUFID_MAGIC)
+		return &p->edp_len;
+	else
+		return NULL;
+}
 
 #define EXT4_FT_DIR_CSUM	0xDE
 
@@ -1994,8 +2034,11 @@ struct ext4_dir_entry_tail {
  */
 #define EXT4_DIR_PAD			4
 #define EXT4_DIR_ROUND			(EXT4_DIR_PAD - 1)
-#define EXT4_DIR_REC_LEN(name_len)	(((name_len) + 8 + EXT4_DIR_ROUND) & \
+#define __EXT4_DIR_REC_LEN(name_len)	(((name_len) + 8 + EXT4_DIR_ROUND) & \
 					 ~EXT4_DIR_ROUND)
+#define EXT4_DIR_REC_LEN(de)		(__EXT4_DIR_REC_LEN((de)->name_len +\
+					 ext4_get_dirent_data_len(de)))
+
 #define EXT4_MAX_REC_LEN		((1<<16)-1)
 
 /*
@@ -2422,11 +2465,11 @@ extern int ext4_find_dest_de(struct inode *dir, struct inode *inode,
 			     struct buffer_head *bh,
 			     void *buf, int buf_size,
 			     struct ext4_filename *fname,
-			     struct ext4_dir_entry_2 **dest_de);
+			     struct ext4_dir_entry_2 **dest_de, int *dlen);
 void ext4_insert_dentry(struct inode *inode,
 			struct ext4_dir_entry_2 *de,
 			int buf_size,
-			struct ext4_filename *fname);
+			struct ext4_filename *fname, void *data);
 static inline void ext4_update_dx_flag(struct inode *inode)
 {
 	if (!ext4_has_feature_dir_index(inode->i_sb))
@@ -2438,11 +2481,18 @@ static inline void ext4_update_dx_flag(struct inode *inode)
 
 static inline  unsigned char get_dtype(struct super_block *sb, int filetype)
 {
-	if (!ext4_has_feature_filetype(sb) || filetype >= EXT4_FT_MAX)
+	int fl_index = filetype & EXT4_FT_MASK;
+
+	if (!ext4_has_feature_filetype(sb) || fl_index >= EXT4_FT_MAX)
 		return DT_UNKNOWN;
 
-	return ext4_filetype_table[filetype];
+	if (!test_opt(sb, DIRDATA))
+		return ext4_filetype_table[fl_index];
+
+	return (ext4_filetype_table[fl_index]) |
+		(filetype & EXT4_DIRENT_LUFID);
 }
+
 extern int ext4_check_all_de(struct inode *dir, struct buffer_head *bh,
 			     void *buf, int buf_size);
 
@@ -2604,6 +2654,8 @@ extern struct inode *ext4_create_inode(handle_t *handle,
 extern int ext4_delete_entry(handle_t *handle, struct inode * dir,
 			     struct ext4_dir_entry_2 *de_del,
 			     struct buffer_head *bh);
+extern int ext4_add_dot_dotdot(handle_t *handle, struct inode *dir,
+			       struct inode *inode, const void *, const void *);
 extern int ext4_htree_fill_tree(struct file *dir_file, __u32 start_hash,
 				__u32 start_minor_hash, __u32 *next_hash);
 extern int ext4_search_dir(struct buffer_head *bh,
@@ -3339,6 +3391,36 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
 
 extern const struct iomap_ops ext4_iomap_ops;
 
+/*
+ * Compute the total directory entry data length.
+ * This includes the filename and an implicit NUL terminator (always present),
+ * and optional extensions.  Each extension has a bit set in the high 4 bits of
+ * de->file_type, and the extension length is the first byte in each entry.
+ */
+static inline int ext4_get_dirent_data_len(struct ext4_dir_entry_2 *de)
+{
+	char *len = de->name + de->name_len + 1 /* NUL terminator */;
+	int dlen = 0;
+	u8 extra_data_flags = (de->file_type & ~EXT4_FT_MASK) >> 4;
+	struct ext4_dir_entry_tail *t = (struct ext4_dir_entry_tail *)de;
+
+	if (!t->det_reserved_zero1 &&
+	    le16_to_cpu(t->det_rec_len) ==
+		sizeof(struct ext4_dir_entry_tail) &&
+	    !t->det_reserved_zero2 &&
+	    t->det_reserved_ft == EXT4_FT_DIR_CSUM)
+		return 0;
+
+	while (extra_data_flags) {
+		if (extra_data_flags & 1) {
+			dlen += *len + (dlen == 0);
+			len += *len;
+		}
+		extra_data_flags >>= 1;
+	}
+	return dlen;
+}
+
 #endif	/* __KERNEL__ */
 
 #define EFSBADCRC	EBADMSG		/* Bad CRC detected */
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index f73bc39..7610cfe 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1023,7 +1023,7 @@ static int ext4_add_dirent_to_inline(handle_t *handle,
 	struct ext4_dir_entry_2 *de;
 
 	err = ext4_find_dest_de(dir, inode, iloc->bh, inline_start,
-				inline_size, fname, &de);
+				inline_size, fname, &de, NULL);
 	if (err)
 		return err;
 
@@ -1031,7 +1031,7 @@ static int ext4_add_dirent_to_inline(handle_t *handle,
 	err = ext4_journal_get_write_access(handle, iloc->bh);
 	if (err)
 		return err;
-	ext4_insert_dentry(inode, de, inline_size, fname);
+	ext4_insert_dentry(inode, de, inline_size, fname, NULL);
 
 	ext4_show_inline_dir(dir, iloc->bh, inline_start, inline_size);
 
@@ -1100,7 +1100,7 @@ static int ext4_update_inline_dir(handle_t *handle, struct inode *dir,
 	int old_size = EXT4_I(dir)->i_inline_size - EXT4_MIN_INLINE_DATA_SIZE;
 	int new_size = get_max_inline_xattr_value_size(dir, iloc);
 
-	if (new_size - old_size <= EXT4_DIR_REC_LEN(1))
+	if (new_size - old_size <= __EXT4_DIR_REC_LEN(1))
 		return -ENOSPC;
 
 	ret = ext4_update_inline_data(handle, dir,
@@ -1381,7 +1381,7 @@ int htree_inlinedir_to_tree(struct file *dir_file,
 			fake.name_len = 1;
 			strcpy(fake.name, ".");
 			fake.rec_len = ext4_rec_len_to_disk(
-						EXT4_DIR_REC_LEN(fake.name_len),
+						EXT4_DIR_REC_LEN(&fake),
 						inline_size);
 			ext4_set_de_type(inode->i_sb, &fake, S_IFDIR);
 			de = &fake;
@@ -1391,7 +1391,7 @@ int htree_inlinedir_to_tree(struct file *dir_file,
 			fake.name_len = 2;
 			strcpy(fake.name, "..");
 			fake.rec_len = ext4_rec_len_to_disk(
-						EXT4_DIR_REC_LEN(fake.name_len),
+						EXT4_DIR_REC_LEN(&fake),
 						inline_size);
 			ext4_set_de_type(inode->i_sb, &fake, S_IFDIR);
 			de = &fake;
@@ -1489,8 +1489,8 @@ int ext4_read_inline_dir(struct file *file,
 	 * So we will use extra_offset and extra_size to indicate them
 	 * during the inline dir iteration.
 	 */
-	dotdot_offset = EXT4_DIR_REC_LEN(1);
-	dotdot_size = dotdot_offset + EXT4_DIR_REC_LEN(2);
+	dotdot_offset = __EXT4_DIR_REC_LEN(1);
+	dotdot_size = dotdot_offset + __EXT4_DIR_REC_LEN(2);
 	extra_offset = dotdot_size - EXT4_INLINE_DOTDOT_SIZE;
 	extra_size = extra_offset + inline_size;
 
@@ -1525,7 +1525,7 @@ int ext4_read_inline_dir(struct file *file,
 			 * failure will be detected in the
 			 * dirent test below. */
 			if (ext4_rec_len_from_disk(de->rec_len, extra_size)
-				< EXT4_DIR_REC_LEN(1))
+				< __EXT4_DIR_REC_LEN(1))
 				break;
 			i += ext4_rec_len_from_disk(de->rec_len,
 						    extra_size);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 4c45570..9cb86e4 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -243,7 +243,9 @@ struct dx_tail {
 static unsigned dx_get_limit(struct dx_entry *entries);
 static void dx_set_count(struct dx_entry *entries, unsigned value);
 static void dx_set_limit(struct dx_entry *entries, unsigned value);
-static unsigned dx_root_limit(struct inode *dir, unsigned infosize);
+static unsigned int inline dx_root_limit(struct inode *dir,
+					 struct ext4_dir_entry_2 *dot_de,
+					 unsigned int infosize);
 static unsigned dx_node_limit(struct inode *dir);
 static struct dx_frame *dx_probe(struct ext4_filename *fname,
 				 struct inode *dir,
@@ -384,24 +386,26 @@ static struct dx_countlimit *get_dx_countlimit(struct inode *inode,
 					       struct ext4_dir_entry *dirent,
 					       int *offset)
 {
+	int dot_rec_len, dotdot_rec_len;
 	struct ext4_dir_entry *dp;
 	struct dx_root_info *root;
 	int count_offset;
 
 	if (le16_to_cpu(dirent->rec_len) == EXT4_BLOCK_SIZE(inode->i_sb))
 		count_offset = 8;
-	else if (le16_to_cpu(dirent->rec_len) == 12) {
-		dp = (struct ext4_dir_entry *)(((void *)dirent) + 12);
+	else {
+		dot_rec_len = le16_to_cpu(dirent->rec_len);
+		dp = (struct ext4_dir_entry *)(((void *)dirent) + dot_rec_len);
 		if (le16_to_cpu(dp->rec_len) !=
-		    EXT4_BLOCK_SIZE(inode->i_sb) - 12)
+		    EXT4_BLOCK_SIZE(inode->i_sb) - dot_rec_len)
 			return NULL;
-		root = (struct dx_root_info *)(((void *)dp + 12));
+		dotdot_rec_len = EXT4_DIR_REC_LEN((struct ext4_dir_entry_2 *)dp);
+		root = (struct dx_root_info *)(((void *)dp + dotdot_rec_len));
 		if (root->reserved_zero ||
 		    root->info_length != sizeof(struct dx_root_info))
 			return NULL;
-		count_offset = 32;
-	} else
-		return NULL;
+		count_offset = 8 + dot_rec_len + dotdot_rec_len;
+	}
 
 	if (offset)
 		*offset = count_offset;
@@ -507,10 +511,10 @@ static inline int ext4_handle_dirty_dx_node(handle_t *handle,
 struct dx_root_info *dx_get_dx_info(struct ext4_dir_entry_2 *de)
 {
 	/* get dotdot first */
-	de = (struct ext4_dir_entry_2 *)((char *)de + EXT4_DIR_REC_LEN(1));
+	de = (struct ext4_dir_entry_2 *)((char *)de + EXT4_DIR_REC_LEN(de));
 
 	/* dx root info is after dotdot entry */
-	de = (struct ext4_dir_entry_2 *)((char *)de + EXT4_DIR_REC_LEN(2));
+	de = (struct ext4_dir_entry_2 *)((char *)de + EXT4_DIR_REC_LEN(de));
 
 	return (struct dx_root_info *)de;
 }
@@ -555,10 +559,17 @@ static inline void dx_set_limit(struct dx_entry *entries, unsigned value)
 	((struct dx_countlimit *) entries)->limit = cpu_to_le16(value);
 }
 
-static inline unsigned dx_root_limit(struct inode *dir, unsigned infosize)
+static inline unsigned int dx_root_limit(struct inode *dir,
+					 struct ext4_dir_entry_2 *dot_de,
+					 unsigned int infosize)
 {
-	unsigned entry_space = dir->i_sb->s_blocksize - EXT4_DIR_REC_LEN(1) -
-		EXT4_DIR_REC_LEN(2) - infosize;
+	struct ext4_dir_entry_2 *dotdot_de;
+	unsigned int entry_space;
+
+	BUG_ON(dot_de->name_len != 1);
+	dotdot_de = ext4_next_entry(dot_de, dir->i_sb->s_blocksize);
+	entry_space = dir->i_sb->s_blocksize - EXT4_DIR_REC_LEN(dot_de) -
+		      EXT4_DIR_REC_LEN(dotdot_de) - infosize;
 
 	if (ext4_has_metadata_csum(dir->i_sb))
 		entry_space -= sizeof(struct dx_tail);
@@ -567,7 +578,7 @@ static inline unsigned dx_root_limit(struct inode *dir, unsigned infosize)
 
 static inline unsigned dx_node_limit(struct inode *dir)
 {
-	unsigned entry_space = dir->i_sb->s_blocksize - EXT4_DIR_REC_LEN(0);
+	unsigned entry_space = dir->i_sb->s_blocksize - __EXT4_DIR_REC_LEN(0);
 
 	if (ext4_has_metadata_csum(dir->i_sb))
 		entry_space -= sizeof(struct dx_tail);
@@ -679,7 +690,7 @@ static struct stats dx_show_leaf(struct inode *dir,
 				       (unsigned) ((char *) de - base));
 #endif
 			}
-			space += EXT4_DIR_REC_LEN(de->name_len);
+			space += EXT4_DIR_REC_LEN(de);
 			names++;
 		}
 		de = ext4_next_entry(de, size);
@@ -787,11 +798,14 @@ struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir,
 
 	entries = (struct dx_entry *)(((char *)info) + info->info_length);
 
-	if (dx_get_limit(entries) != dx_root_limit(dir,
-						   info->info_length)) {
+	if (dx_get_limit(entries) !=
+	    dx_root_limit(dir, (struct ext4_dir_entry_2 *)frame->bh->b_data,
+			  info->info_length)) {
 		ext4_warning_inode(dir, "dx entry: limit %u != root limit %u",
 				   dx_get_limit(entries),
-				   dx_root_limit(dir, info->info_length));
+				   dx_root_limit(dir,
+						 (struct ext4_dir_entry_2 *)frame->bh->b_data,
+						 info->info_length));
 		goto fail;
 	}
 
@@ -986,7 +1000,7 @@ static int htree_dirblock_to_tree(struct file *dir_file,
 	de = (struct ext4_dir_entry_2 *) bh->b_data;
 	top = (struct ext4_dir_entry_2 *) ((char *) de +
 					   dir->i_sb->s_blocksize -
-					   EXT4_DIR_REC_LEN(0));
+					   __EXT4_DIR_REC_LEN(0));
 #ifdef CONFIG_FS_ENCRYPTION
 	/* Check if the directory is encrypted */
 	if (IS_ENCRYPTED(dir)) {
@@ -1743,7 +1757,7 @@ struct dentry *ext4_get_parent(struct dentry *child)
 	while (count--) {
 		struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *)
 						(from + (map->offs<<2));
-		rec_len = EXT4_DIR_REC_LEN(de->name_len);
+		rec_len = EXT4_DIR_REC_LEN(de);
 		memcpy (to, de, rec_len);
 		((struct ext4_dir_entry_2 *) to)->rec_len =
 				ext4_rec_len_to_disk(rec_len, blocksize);
@@ -1767,7 +1781,7 @@ static struct ext4_dir_entry_2* dx_pack_dirents(char *base, unsigned blocksize)
 	while ((char*)de < base + blocksize) {
 		next = ext4_next_entry(de, blocksize);
 		if (de->inode && de->name_len) {
-			rec_len = EXT4_DIR_REC_LEN(de->name_len);
+			rec_len = EXT4_DIR_REC_LEN(de);
 			if (de > to)
 				memmove(to, de, rec_len);
 			to->rec_len = ext4_rec_len_to_disk(rec_len, blocksize);
@@ -1898,14 +1912,16 @@ int ext4_find_dest_de(struct inode *dir, struct inode *inode,
 		      struct buffer_head *bh,
 		      void *buf, int buf_size,
 		      struct ext4_filename *fname,
-		      struct ext4_dir_entry_2 **dest_de)
+		      struct ext4_dir_entry_2 **dest_de, int *dlen)
 {
 	struct ext4_dir_entry_2 *de;
-	unsigned short reclen = EXT4_DIR_REC_LEN(fname_len(fname));
+	unsigned short reclen = __EXT4_DIR_REC_LEN(fname_len(fname)) +
+						   (dlen ? *dlen : 0);
 	int nlen, rlen;
 	unsigned int offset = 0;
 	char *top;
 
+	dlen ? *dlen = 0 : 0; /* default set to 0 */
 	de = (struct ext4_dir_entry_2 *)buf;
 	top = buf + buf_size - reclen;
 	while ((char *) de <= top) {
@@ -1914,10 +1930,29 @@ int ext4_find_dest_de(struct inode *dir, struct inode *inode,
 			return -EFSCORRUPTED;
 		if (ext4_match(dir, fname, de))
 			return -EEXIST;
-		nlen = EXT4_DIR_REC_LEN(de->name_len);
+		nlen = EXT4_DIR_REC_LEN(de);
 		rlen = ext4_rec_len_from_disk(de->rec_len, buf_size);
 		if ((de->inode ? rlen - nlen : rlen) >= reclen)
 			break;
+		/* Then for dotdot entries, check for the smaller space
+		 * required for just the entry, no FID
+		 */
+		if (fname_len(fname) == 2 && memcmp(fname_name(fname), "..", 2) == 0) {
+			if ((de->inode ? rlen - nlen : rlen) >=
+			    __EXT4_DIR_REC_LEN(fname_len(fname))) {
+				/* set dlen=1 to indicate not
+				 * enough space store fid
+				 */
+				dlen ? *dlen = 1 : 0;
+				break;
+			}
+			/* The new ".." entry must be written over the
+			 * previous ".." entry, which is the first
+			 * entry traversed by this scan. If it doesn't
+			 * fit, something is badly wrong, so -EIO.
+			 */
+			return -EIO;
+		}
 		de = (struct ext4_dir_entry_2 *)((char *)de + rlen);
 		offset += rlen;
 	}
@@ -1931,12 +1966,12 @@ int ext4_find_dest_de(struct inode *dir, struct inode *inode,
 void ext4_insert_dentry(struct inode *inode,
 			struct ext4_dir_entry_2 *de,
 			int buf_size,
-			struct ext4_filename *fname)
+			struct ext4_filename *fname, void *data )
 {
 
 	int nlen, rlen;
 
-	nlen = EXT4_DIR_REC_LEN(de->name_len);
+	nlen = EXT4_DIR_REC_LEN(de);
 	rlen = ext4_rec_len_from_disk(de->rec_len, buf_size);
 	if (de->inode) {
 		struct ext4_dir_entry_2 *de1 =
@@ -1950,6 +1985,11 @@ void ext4_insert_dentry(struct inode *inode,
 	ext4_set_de_type(inode->i_sb, de, inode->i_mode);
 	de->name_len = fname_len(fname);
 	memcpy(de->name, fname_name(fname), fname_len(fname));
+	if (data) {
+		de->name[fname_len(fname)] = 0;
+		memcpy(&de->name[fname_len(fname) + 1], data, *(char *)data);
+		de->file_type |= EXT4_DIRENT_LUFID;
+	}
 }
 
 /*
@@ -1966,15 +2006,21 @@ static int add_dirent_to_buf(handle_t *handle, struct ext4_filename *fname,
 			     struct buffer_head *bh)
 {
 	unsigned int	blocksize = dir->i_sb->s_blocksize;
+	unsigned char *data;
 	int		csum_size = 0;
+	int dlen = 0;
 	int		err;
 
+	data = ext4_dentry_get_data(inode->i_sb,
+				    (struct ext4_dentry_param *)EXT4_I(inode)->i_dirdata);
 	if (ext4_has_metadata_csum(inode->i_sb))
 		csum_size = sizeof(struct ext4_dir_entry_tail);
 
 	if (!de) {
+		if (data)
+			dlen = (*data) + 1;
 		err = ext4_find_dest_de(dir, inode, bh, bh->b_data,
-					blocksize - csum_size, fname, &de);
+					blocksize - csum_size, fname, &de, &dlen);
 		if (err)
 			return err;
 	}
@@ -1986,7 +2032,10 @@ static int add_dirent_to_buf(handle_t *handle, struct ext4_filename *fname,
 	}
 
 	/* By now the buffer is marked for journaling */
-	ext4_insert_dentry(inode, de, blocksize, fname);
+	/* If writing the short form of "dotdot", don't add the data section */
+	if (dlen == 1)
+		data = NULL;
+	ext4_insert_dentry(inode, de, blocksize, fname, data);
 
 	/*
 	 * XXX shouldn't update any times until successful
@@ -2095,7 +2144,8 @@ static int make_indexed_dir(handle_t *handle, struct ext4_filename *fname,
 
 	dx_set_block(entries, 1);
 	dx_set_count(entries, 1);
-	dx_set_limit(entries, dx_root_limit(dir, sizeof(*dx_info)));
+	dx_set_limit(entries, dx_root_limit(dir, dot_de,
+					    sizeof(*dx_info)));
 
 	/* Initialize as for dx_probe */
 	fname->hinfo.hash_version = dx_info->hash_version;
@@ -2145,6 +2195,8 @@ static int ext4_update_dotdot(handle_t *handle, struct dentry *dentry,
 	struct buffer_head *dir_block;
 	struct ext4_dir_entry_2 *de;
 	int len, journal = 0, err = 0;
+	int dlen = 0;
+	char *data;
 
 	if (IS_ERR(handle))
 		return PTR_ERR(handle);
@@ -2162,19 +2214,24 @@ static int ext4_update_dotdot(handle_t *handle, struct dentry *dentry,
 	/* the first item must be "." */
 	assert(de->name_len == 1 && de->name[0] == '.');
 	len = le16_to_cpu(de->rec_len);
-	assert(len >= EXT4_DIR_REC_LEN(1));
-	if (len > EXT4_DIR_REC_LEN(1)) {
+	assert(len >= __EXT4_DIR_REC_LEN(1));
+	if (len > __EXT4_DIR_REC_LEN(1)) {
 		BUFFER_TRACE(dir_block, "get_write_access");
 		err = ext4_journal_get_write_access(handle, dir_block);
 		if (err)
 			goto out_journal;
 
 		journal = 1;
-		de->rec_len = cpu_to_le16(EXT4_DIR_REC_LEN(1));
+		de->rec_len = cpu_to_le16(EXT4_DIR_REC_LEN(de));
 	}
 
-	len -= EXT4_DIR_REC_LEN(1);
-	assert(len == 0 || len >= EXT4_DIR_REC_LEN(2));
+	len -= EXT4_DIR_REC_LEN(de);
+	data = ext4_dentry_get_data(dir->i_sb,
+			(struct ext4_dentry_param *)dentry->d_fsdata);
+	if (data)
+		dlen = *data + 1;
+	assert(len == 0 || len >= __EXT4_DIR_REC_LEN(2 + dlen));
+
 	de = (struct ext4_dir_entry_2 *)
 		((char *) de + le16_to_cpu(de->rec_len));
 	if (!journal) {
@@ -2188,10 +2245,15 @@ static int ext4_update_dotdot(handle_t *handle, struct dentry *dentry,
 	if (len > 0)
 		de->rec_len = cpu_to_le16(len);
 	else
-		assert(le16_to_cpu(de->rec_len) >= EXT4_DIR_REC_LEN(2));
+		assert(le16_to_cpu(de->rec_len) >= __EXT4_DIR_REC_LEN(2));
 	de->name_len = 2;
 	strcpy(de->name, "..");
-	ext4_set_de_type(dir->i_sb, de, S_IFDIR);
+	if (data && ext4_get_dirent_data_len(de) >= dlen) {
+		de->name[2] = 0;
+		memcpy(&de->name[2 + 1], data, *data);
+		ext4_set_de_type(dir->i_sb, de, S_IFDIR);
+		de->file_type |= EXT4_DIRENT_LUFID;
+	}
 
 out_journal:
 	if (journal) {
@@ -2230,6 +2292,7 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
 	ext4_lblk_t block, blocks;
 	int	csum_size = 0;
 
+	EXT4_I(inode)->i_dirdata = dentry->d_fsdata;
 	if (ext4_has_metadata_csum(inode->i_sb))
 		csum_size = sizeof(struct ext4_dir_entry_tail);
 
@@ -2757,37 +2820,71 @@ static int ext4_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 	return err;
 }
 
+struct tp_block {
+	struct inode	*inode;
+	void		*data1;
+	void		*data2;
+};
+
 struct ext4_dir_entry_2 *ext4_init_dot_dotdot(struct inode *inode,
 			  struct ext4_dir_entry_2 *de,
 			  int blocksize, int csum_size,
 			  unsigned int parent_ino, int dotdot_real_len)
 {
+	void *data1 = NULL, *data2 = NULL;
+	int dot_reclen = 0;
+
+	if (dotdot_real_len == 10) {
+		struct tp_block *tpb = (struct tp_block *)inode;
+
+		data1 = tpb->data1;
+		data2 = tpb->data2;
+		inode = tpb->inode;
+		dotdot_real_len = 0;
+	}
 	de->inode = cpu_to_le32(inode->i_ino);
 	de->name_len = 1;
-	de->rec_len = ext4_rec_len_to_disk(EXT4_DIR_REC_LEN(de->name_len),
-					   blocksize);
 	strcpy(de->name, ".");
 	ext4_set_de_type(inode->i_sb, de, S_IFDIR);
 
+	/* get packed fid data*/
+	data1 = ext4_dentry_get_data(inode->i_sb,
+			(struct ext4_dentry_param *) data1);
+	if (data1) {
+		de->name[1] = 0;
+		memcpy(&de->name[2], data1, *(char *) data1);
+		de->file_type |= EXT4_DIRENT_LUFID;
+	}
+	de->rec_len = cpu_to_le16(EXT4_DIR_REC_LEN(de));
+	dot_reclen = cpu_to_le16(de->rec_len);
 	de = ext4_next_entry(de, blocksize);
 	de->inode = cpu_to_le32(parent_ino);
 	de->name_len = 2;
+	strcpy(de->name, "..");
+	ext4_set_de_type(inode->i_sb, de, S_IFDIR);
+	data2 = ext4_dentry_get_data(inode->i_sb,
+			(struct ext4_dentry_param *) data2);
+	if (data2) {
+		de->name[2] = 0;
+		memcpy(&de->name[3], data2, *(char *) data2);
+		de->file_type |= EXT4_DIRENT_LUFID;
+	}
 	if (!dotdot_real_len)
 		de->rec_len = ext4_rec_len_to_disk(blocksize -
-					(csum_size + EXT4_DIR_REC_LEN(1)),
+					(csum_size + dot_reclen),
 					blocksize);
 	else
 		de->rec_len = ext4_rec_len_to_disk(
-				EXT4_DIR_REC_LEN(de->name_len), blocksize);
-	strcpy(de->name, "..");
-	ext4_set_de_type(inode->i_sb, de, S_IFDIR);
+				EXT4_DIR_REC_LEN(de), blocksize);
 
 	return ext4_next_entry(de, blocksize);
 }
 
 static int ext4_init_new_dir(handle_t *handle, struct inode *dir,
-			     struct inode *inode)
+			     struct inode *inode,
+			     const void *data1, const void *data2)
 {
+	struct tp_block param;
 	struct buffer_head *dir_block = NULL;
 	struct ext4_dir_entry_2 *de;
 	struct ext4_dir_entry_tail *t;
@@ -2812,7 +2909,11 @@ static int ext4_init_new_dir(handle_t *handle, struct inode *dir,
 	if (IS_ERR(dir_block))
 		return PTR_ERR(dir_block);
 	de = (struct ext4_dir_entry_2 *)dir_block->b_data;
-	ext4_init_dot_dotdot(inode, de, blocksize, csum_size, dir->i_ino, 0);
+	param.inode = inode;
+	param.data1 = (void *)data1;
+	param.data2 = (void *)data2;
+	ext4_init_dot_dotdot((struct inode *)(&param), de, blocksize,
+			     csum_size, dir->i_ino, 10);
 	set_nlink(inode, 2);
 	if (csum_size) {
 		t = EXT4_DIRENT_TAIL(dir_block->b_data, blocksize);
@@ -2829,6 +2930,30 @@ static int ext4_init_new_dir(handle_t *handle, struct inode *dir,
 	return err;
 }
 
+/* Initialize @inode as a subdirectory of @dir, and add the
+ * "." and ".." entries into the first directory block.
+ */
+int ext4_add_dot_dotdot(handle_t *handle, struct inode *dir,
+			struct inode *inode,
+			const void *data1, const void *data2)
+{
+	int rc;
+
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+
+	if (IS_DIRSYNC(dir))
+		ext4_handle_sync(handle);
+
+	inode->i_op = &ext4_dir_inode_operations;
+	inode->i_fop = &ext4_dir_operations;
+	rc = ext4_init_new_dir(handle, dir, inode, data1, data2);
+	if (!rc)
+		rc = ext4_mark_inode_dirty(handle, inode);
+	return rc;
+}
+EXPORT_SYMBOL(ext4_add_dot_dotdot);
+
 static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 {
 	handle_t *handle;
@@ -2855,7 +2980,7 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 
 	inode->i_op = &ext4_dir_inode_operations;
 	inode->i_fop = &ext4_dir_operations;
-	err = ext4_init_new_dir(handle, dir, inode);
+	err = ext4_init_new_dir(handle, dir, inode, NULL, NULL);
 	if (err)
 		goto out_clear_inode;
 	err = ext4_mark_inode_dirty(handle, inode);
@@ -2906,7 +3031,7 @@ bool ext4_empty_dir(struct inode *inode)
 	}
 
 	sb = inode->i_sb;
-	if (inode->i_size < EXT4_DIR_REC_LEN(1) + EXT4_DIR_REC_LEN(2)) {
+	if (inode->i_size < __EXT4_DIR_REC_LEN(1) + __EXT4_DIR_REC_LEN(2)) {
 		EXT4_ERROR_INODE(inode, "invalid size");
 		return true;
 	}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index d15c26c..564eb35 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1447,7 +1447,7 @@ enum {
 	Opt_data_err_abort, Opt_data_err_ignore, Opt_test_dummy_encryption,
 	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
 	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_jqfmt_vfsv1, Opt_quota,
-	Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
+	Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err, Opt_dirdata,
 	Opt_usrquota, Opt_grpquota, Opt_prjquota, Opt_i_version, Opt_dax,
 	Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_warn_on_error,
 	Opt_nowarn_on_error, Opt_mblk_io_submit,
@@ -1474,6 +1474,7 @@ enum {
 	{Opt_err_ro, "errors=remount-ro"},
 	{Opt_nouid32, "nouid32"},
 	{Opt_debug, "debug"},
+	{Opt_dirdata, "dirdata"},
 	{Opt_removed, "oldalloc"},
 	{Opt_removed, "orlov"},
 	{Opt_user_xattr, "user_xattr"},
@@ -1747,6 +1748,7 @@ static int clear_qf_name(struct super_block *sb, int qtype)
 	{Opt_grpjquota, 0, MOPT_Q},
 	{Opt_offusrjquota, 0, MOPT_Q},
 	{Opt_offgrpjquota, 0, MOPT_Q},
+	{Opt_dirdata, EXT4_MOUNT_DIRDATA, MOPT_SET},
 	{Opt_jqfmt_vfsold, QFMT_VFS_OLD, MOPT_QFMT},
 	{Opt_jqfmt_vfsv0, QFMT_VFS_V0, MOPT_QFMT},
 	{Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT},
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 11/22] ext4: over ride current_time
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
                   ` (9 preceding siblings ...)
  2019-07-22  1:23 ` [lustre-devel] [PATCH 10/22] ext4: add data in dentry feature James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  5:06   ` NeilBrown
  2019-07-22  1:23 ` [lustre-devel] [PATCH 12/22] ext4: add htree lock implementation James Simmons
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

return i_ctime if IS_NOCMTIME is true for inode.

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/ext4.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 51b6159..80601a9 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -661,6 +661,13 @@ enum {
 #define EXT4_GOING_FLAGS_LOGFLUSH		0x1	/* flush log but not data */
 #define EXT4_GOING_FLAGS_NOLOGFLUSH		0x2	/* don't flush log nor data */
 
+static inline struct timespec64 ext4_current_time(struct inode *inode)
+{
+	if (IS_NOCMTIME(inode))
+		return inode->i_ctime;
+	return current_time(inode);
+}
+#define current_time(a) ext4_current_time(a)
 
 #if defined(__KERNEL__) && defined(CONFIG_COMPAT)
 /*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 12/22] ext4: add htree lock implementation
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
                   ` (10 preceding siblings ...)
  2019-07-22  1:23 ` [lustre-devel] [PATCH 11/22] ext4: over ride current_time James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  5:10   ` NeilBrown
  2019-07-22  1:23 ` [lustre-devel] [PATCH 13/22] ext4: Add a proc interface for max_dir_size James Simmons
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

Used for parallel lookup

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/Makefile     |   4 +-
 fs/ext4/ext4.h       |  80 +++++
 fs/ext4/htree_lock.c | 891 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/htree_lock.h | 187 +++++++++++
 fs/ext4/namei.c      | 461 +++++++++++++++++++++++---
 fs/ext4/super.c      |   1 +
 6 files changed, 1583 insertions(+), 41 deletions(-)
 create mode 100644 fs/ext4/htree_lock.c
 create mode 100644 fs/ext4/htree_lock.h

diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index 8fdfcd3..0d0a28f 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -6,8 +6,8 @@
 obj-$(CONFIG_EXT4_FS) += ext4.o
 
 ext4-y	:= balloc.o bitmap.o block_validity.o dir.o ext4_jbd2.o extents.o \
-		extents_status.o file.o fsmap.o fsync.o hash.o ialloc.o \
-		indirect.o inline.o inode.o ioctl.o mballoc.o migrate.o \
+		extents_status.o file.o fsmap.o fsync.o hash.o htree_lock.o \
+		ialloc.o indirect.o inline.o inode.o ioctl.o mballoc.o migrate.o \
 		mmp.o move_extent.o namei.o page-io.o readpage.o resize.o \
 		super.o symlink.o sysfs.o xattr.o xattr_trusted.o xattr_user.o
 
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 80601a9..bf74c7c 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -44,6 +44,8 @@
 
 #include <linux/compiler.h>
 
+#include "htree_lock.h"
+
 /*
  * The fourth extended filesystem constants/structures
  */
@@ -936,6 +938,9 @@ struct ext4_inode_info {
 	__u32	i_dtime;
 	ext4_fsblk_t	i_file_acl;
 
+	/* following fields for parallel directory operations -bzzz */
+	struct semaphore i_append_sem;
+
 	/*
 	 * i_block_group is the number of the block group which contains
 	 * this file's inode.  Constant across the lifetime of the inode,
@@ -2145,6 +2150,72 @@ struct dx_hash_info
  */
 #define HASH_NB_ALWAYS		1
 
+/* assume name-hash is protected by upper layer */
+#define EXT4_HTREE_LOCK_HASH	0
+
+enum ext4_pdo_lk_types {
+#if EXT4_HTREE_LOCK_HASH
+	EXT4_LK_HASH,
+#endif
+	EXT4_LK_DX,		/* index block */
+	EXT4_LK_DE,		/* directory entry block */
+	EXT4_LK_SPIN,		/* spinlock */
+	EXT4_LK_MAX,
+};
+
+/* read-only bit */
+#define EXT4_LB_RO(b)		(1 << (b))
+/* read + write, high bits for writer */
+#define EXT4_LB_RW(b)		((1 << (b)) | (1 << (EXT4_LK_MAX + (b))))
+
+enum ext4_pdo_lock_bits {
+	/* DX lock bits */
+	EXT4_LB_DX_RO		= EXT4_LB_RO(EXT4_LK_DX),
+	EXT4_LB_DX		= EXT4_LB_RW(EXT4_LK_DX),
+	/* DE lock bits */
+	EXT4_LB_DE_RO		= EXT4_LB_RO(EXT4_LK_DE),
+	EXT4_LB_DE		= EXT4_LB_RW(EXT4_LK_DE),
+	/* DX spinlock bits */
+	EXT4_LB_SPIN_RO		= EXT4_LB_RO(EXT4_LK_SPIN),
+	EXT4_LB_SPIN		= EXT4_LB_RW(EXT4_LK_SPIN),
+	/* accurate searching */
+	EXT4_LB_EXACT		= EXT4_LB_RO(EXT4_LK_MAX << 1),
+};
+
+enum ext4_pdo_lock_opc {
+	/* external */
+	EXT4_HLOCK_READDIR	= (EXT4_LB_DE_RO | EXT4_LB_DX_RO),
+	EXT4_HLOCK_LOOKUP	= (EXT4_LB_DE_RO | EXT4_LB_SPIN_RO |
+				   EXT4_LB_EXACT),
+	EXT4_HLOCK_DEL		= (EXT4_LB_DE | EXT4_LB_SPIN_RO |
+				   EXT4_LB_EXACT),
+	EXT4_HLOCK_ADD		= (EXT4_LB_DE | EXT4_LB_SPIN_RO),
+
+	/* internal */
+	EXT4_HLOCK_LOOKUP_SAFE	= (EXT4_LB_DE_RO | EXT4_LB_DX_RO |
+				   EXT4_LB_EXACT),
+	EXT4_HLOCK_DEL_SAFE	= (EXT4_LB_DE | EXT4_LB_DX_RO | EXT4_LB_EXACT),
+	EXT4_HLOCK_SPLIT	= (EXT4_LB_DE | EXT4_LB_DX | EXT4_LB_SPIN),
+};
+
+extern struct htree_lock_head *ext4_htree_lock_head_alloc(unsigned hbits);
+#define ext4_htree_lock_head_free(lhead)	htree_lock_head_free(lhead)
+
+extern struct htree_lock *ext4_htree_lock_alloc(void);
+#define ext4_htree_lock_free(lck)		htree_lock_free(lck)
+
+extern void ext4_htree_lock(struct htree_lock *lck,
+			    struct htree_lock_head *lhead,
+			    struct inode *dir, unsigned flags);
+#define ext4_htree_unlock(lck)			htree_unlock(lck)
+
+extern struct buffer_head *ext4_find_entry_locked(struct inode *dir,
+					const struct qstr *d_name,
+					struct ext4_dir_entry_2 **res_dir,
+					int *inlined, struct htree_lock *lck);
+extern int ext4_add_entry_locked(handle_t *handle, struct dentry *dentry,
+				 struct inode *inode, struct htree_lock *lck);
+
 struct ext4_filename {
 	const struct qstr *usr_fname;
 	struct fscrypt_str disk_name;
@@ -2479,8 +2550,17 @@ void ext4_insert_dentry(struct inode *inode,
 			struct ext4_filename *fname, void *data);
 static inline void ext4_update_dx_flag(struct inode *inode)
 {
+	/* Disable it for ldiskfs, because going from a DX directory to
+	 * a non-DX directory while it is in use will completely break
+	 * the htree-locking.
+	 * If we really want to support this operation in the future,
+	 * we need to exclusively lock the directory@here which will
+	 * increase complexity of code
+	 */
+#if 0
 	if (!ext4_has_feature_dir_index(inode->i_sb))
 		ext4_clear_inode_flag(inode, EXT4_INODE_INDEX);
+#endif
 }
 static const unsigned char ext4_filetype_table[] = {
 	DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
diff --git a/fs/ext4/htree_lock.c b/fs/ext4/htree_lock.c
new file mode 100644
index 0000000..9c1e8f0
--- /dev/null
+++ b/fs/ext4/htree_lock.c
@@ -0,0 +1,891 @@
+/*
+ * fs/ext4/htree_lock.c
+ *
+ * Copyright (c) 2011, 2012, Intel Corporation.
+ *
+ * Author: Liang Zhen <liang@whamcloud.com>
+ */
+#include <linux/jbd2.h>
+#include <linux/hash.h>
+#include <linux/module.h>
+#include "htree_lock.h"
+
+enum {
+	HTREE_LOCK_BIT_EX	= (1 << HTREE_LOCK_EX),
+	HTREE_LOCK_BIT_PW	= (1 << HTREE_LOCK_PW),
+	HTREE_LOCK_BIT_PR	= (1 << HTREE_LOCK_PR),
+	HTREE_LOCK_BIT_CW	= (1 << HTREE_LOCK_CW),
+	HTREE_LOCK_BIT_CR	= (1 << HTREE_LOCK_CR),
+};
+
+enum {
+	HTREE_LOCK_COMPAT_EX	= 0,
+	HTREE_LOCK_COMPAT_PW	= HTREE_LOCK_COMPAT_EX | HTREE_LOCK_BIT_CR,
+	HTREE_LOCK_COMPAT_PR	= HTREE_LOCK_COMPAT_PW | HTREE_LOCK_BIT_PR,
+	HTREE_LOCK_COMPAT_CW	= HTREE_LOCK_COMPAT_PW | HTREE_LOCK_BIT_CW,
+	HTREE_LOCK_COMPAT_CR	= HTREE_LOCK_COMPAT_CW | HTREE_LOCK_BIT_PR |
+				  HTREE_LOCK_BIT_PW,
+};
+
+static int htree_lock_compat[] = {
+	[HTREE_LOCK_EX]		HTREE_LOCK_COMPAT_EX,
+	[HTREE_LOCK_PW]		HTREE_LOCK_COMPAT_PW,
+	[HTREE_LOCK_PR]		HTREE_LOCK_COMPAT_PR,
+	[HTREE_LOCK_CW]		HTREE_LOCK_COMPAT_CW,
+	[HTREE_LOCK_CR]		HTREE_LOCK_COMPAT_CR,
+};
+
+/* max allowed htree-lock depth.
+ * We only need depth=3 for ext4 although user can have higher value. */
+#define HTREE_LOCK_DEP_MAX	16
+
+#ifdef HTREE_LOCK_DEBUG
+
+static char *hl_name[] = {
+	[HTREE_LOCK_EX]		"EX",
+	[HTREE_LOCK_PW]		"PW",
+	[HTREE_LOCK_PR]		"PR",
+	[HTREE_LOCK_CW]		"CW",
+	[HTREE_LOCK_CR]		"CR",
+};
+
+/* lock stats */
+struct htree_lock_node_stats {
+	unsigned long long	blocked[HTREE_LOCK_MAX];
+	unsigned long long	granted[HTREE_LOCK_MAX];
+	unsigned long long	retried[HTREE_LOCK_MAX];
+	unsigned long long	events;
+};
+
+struct htree_lock_stats {
+	struct htree_lock_node_stats	nodes[HTREE_LOCK_DEP_MAX];
+	unsigned long long	granted[HTREE_LOCK_MAX];
+	unsigned long long	blocked[HTREE_LOCK_MAX];
+};
+
+static struct htree_lock_stats hl_stats;
+
+void htree_lock_stat_reset(void)
+{
+	memset(&hl_stats, 0, sizeof(hl_stats));
+}
+
+void htree_lock_stat_print(int depth)
+{
+	int     i;
+	int	j;
+
+	printk(KERN_DEBUG "HTREE LOCK STATS:\n");
+	for (i = 0; i < HTREE_LOCK_MAX; i++) {
+		printk(KERN_DEBUG "[%s]: G [%10llu], B [%10llu]\n",
+		       hl_name[i], hl_stats.granted[i], hl_stats.blocked[i]);
+	}
+	for (i = 0; i < depth; i++) {
+		printk(KERN_DEBUG "HTREE CHILD [%d] STATS:\n", i);
+		for (j = 0; j < HTREE_LOCK_MAX; j++) {
+			printk(KERN_DEBUG
+				"[%s]: G [%10llu], B [%10llu], R [%10llu]\n",
+				hl_name[j], hl_stats.nodes[i].granted[j],
+				hl_stats.nodes[i].blocked[j],
+				hl_stats.nodes[i].retried[j]);
+		}
+	}
+}
+
+#define lk_grant_inc(m)       do { hl_stats.granted[m]++; } while (0)
+#define lk_block_inc(m)       do { hl_stats.blocked[m]++; } while (0)
+#define ln_grant_inc(d, m)    do { hl_stats.nodes[d].granted[m]++; } while (0)
+#define ln_block_inc(d, m)    do { hl_stats.nodes[d].blocked[m]++; } while (0)
+#define ln_retry_inc(d, m)    do { hl_stats.nodes[d].retried[m]++; } while (0)
+#define ln_event_inc(d)       do { hl_stats.nodes[d].events++; } while (0)
+
+#else /* !DEBUG */
+
+void htree_lock_stat_reset(void) {}
+void htree_lock_stat_print(int depth) {}
+
+#define lk_grant_inc(m)	      do {} while (0)
+#define lk_block_inc(m)	      do {} while (0)
+#define ln_grant_inc(d, m)    do {} while (0)
+#define ln_block_inc(d, m)    do {} while (0)
+#define ln_retry_inc(d, m)    do {} while (0)
+#define ln_event_inc(d)	      do {} while (0)
+
+#endif /* DEBUG */
+
+EXPORT_SYMBOL(htree_lock_stat_reset);
+EXPORT_SYMBOL(htree_lock_stat_print);
+
+#define HTREE_DEP_ROOT		  (-1)
+
+#define htree_spin_lock(lhead, dep)				\
+	bit_spin_lock((dep) + 1, &(lhead)->lh_lock)
+#define htree_spin_unlock(lhead, dep)				\
+	bit_spin_unlock((dep) + 1, &(lhead)->lh_lock)
+
+#define htree_key_event_ignore(child, ln)			\
+	(!((child)->lc_events & (1 << (ln)->ln_mode)))
+
+static int
+htree_key_list_empty(struct htree_lock_node *ln)
+{
+	return list_empty(&ln->ln_major_list) && list_empty(&ln->ln_minor_list);
+}
+
+static void
+htree_key_list_del_init(struct htree_lock_node *ln)
+{
+	struct htree_lock_node *tmp = NULL;
+
+	if (!list_empty(&ln->ln_minor_list)) {
+		tmp = list_entry(ln->ln_minor_list.next,
+				 struct htree_lock_node, ln_minor_list);
+		list_del_init(&ln->ln_minor_list);
+	}
+
+	if (list_empty(&ln->ln_major_list))
+		return;
+
+	if (tmp == NULL) { /* not on minor key list */
+		list_del_init(&ln->ln_major_list);
+	} else {
+		BUG_ON(!list_empty(&tmp->ln_major_list));
+		list_replace_init(&ln->ln_major_list, &tmp->ln_major_list);
+	}
+}
+
+static void
+htree_key_list_replace_init(struct htree_lock_node *old,
+			    struct htree_lock_node *new)
+{
+	if (!list_empty(&old->ln_major_list))
+		list_replace_init(&old->ln_major_list, &new->ln_major_list);
+
+	if (!list_empty(&old->ln_minor_list))
+		list_replace_init(&old->ln_minor_list, &new->ln_minor_list);
+}
+
+static void
+htree_key_event_enqueue(struct htree_lock_child *child,
+			struct htree_lock_node *ln, int dep, void *event)
+{
+	struct htree_lock_node *tmp;
+
+	/* NB: ALWAYS called holding lhead::lh_lock(dep) */
+	BUG_ON(ln->ln_mode == HTREE_LOCK_NL);
+	if (event == NULL || htree_key_event_ignore(child, ln))
+		return;
+
+	/* shouldn't be a very long list */
+	list_for_each_entry(tmp, &ln->ln_alive_list, ln_alive_list) {
+		if (tmp->ln_mode == HTREE_LOCK_NL) {
+			ln_event_inc(dep);
+			if (child->lc_callback != NULL)
+				child->lc_callback(tmp->ln_ev_target, event);
+		}
+	}
+}
+
+static int
+htree_node_lock_enqueue(struct htree_lock *newlk, struct htree_lock *curlk,
+			unsigned dep, int wait, void *event)
+{
+	struct htree_lock_child *child = &newlk->lk_head->lh_children[dep];
+	struct htree_lock_node *newln = &newlk->lk_nodes[dep];
+	struct htree_lock_node *curln = &curlk->lk_nodes[dep];
+
+	/* NB: ALWAYS called holding lhead::lh_lock(dep) */
+	/* NB: we only expect PR/PW lock mode at here, only these two modes are
+	 * allowed for htree_node_lock(asserted in htree_node_lock_internal),
+	 * NL is only used for listener, user can't directly require NL mode */
+	if ((curln->ln_mode == HTREE_LOCK_NL) ||
+	    (curln->ln_mode != HTREE_LOCK_PW &&
+	     newln->ln_mode != HTREE_LOCK_PW)) {
+		/* no conflict, attach it on granted list of @curlk */
+		if (curln->ln_mode != HTREE_LOCK_NL) {
+			list_add(&newln->ln_granted_list,
+				 &curln->ln_granted_list);
+		} else {
+			/* replace key owner */
+			htree_key_list_replace_init(curln, newln);
+		}
+
+		list_add(&newln->ln_alive_list, &curln->ln_alive_list);
+		htree_key_event_enqueue(child, newln, dep, event);
+		ln_grant_inc(dep, newln->ln_mode);
+		return 1; /* still hold lh_lock */
+	}
+
+	if (!wait) { /* can't grant and don't want to wait */
+		ln_retry_inc(dep, newln->ln_mode);
+		newln->ln_mode = HTREE_LOCK_INVAL;
+		return -1; /* don't wait and just return -1 */
+	}
+
+	newlk->lk_task = current;
+	set_current_state(TASK_UNINTERRUPTIBLE);
+	/* conflict, attach it on blocked list of curlk */
+	list_add_tail(&newln->ln_blocked_list, &curln->ln_blocked_list);
+	list_add(&newln->ln_alive_list, &curln->ln_alive_list);
+	ln_block_inc(dep, newln->ln_mode);
+
+	htree_spin_unlock(newlk->lk_head, dep);
+	/* wait to be given the lock */
+	if (newlk->lk_task != NULL)
+		schedule();
+	/* granted, no doubt, wake up will set me RUNNING */
+	if (event == NULL || htree_key_event_ignore(child, newln))
+		return 0; /* granted without lh_lock */
+
+	htree_spin_lock(newlk->lk_head, dep);
+	htree_key_event_enqueue(child, newln, dep, event);
+	return 1; /* still hold lh_lock */
+}
+
+/*
+ * get PR/PW access to particular tree-node according to @dep and @key,
+ * it will return -1 if @wait is false and can't immediately grant this lock.
+ * All listeners(HTREE_LOCK_NL) on @dep and with the same @key will get
+ * @event if it's not NULL.
+ * NB: ALWAYS called holding lhead::lh_lock
+ */
+static int
+htree_node_lock_internal(struct htree_lock_head *lhead, struct htree_lock *lck,
+			 htree_lock_mode_t mode, u32 key, unsigned dep,
+			 int wait, void *event)
+{
+	LIST_HEAD(list);
+	struct htree_lock	*tmp;
+	struct htree_lock	*tmp2;
+	u16			major;
+	u16			minor;
+	u8			reverse;
+	u8			ma_bits;
+	u8			mi_bits;
+
+	BUG_ON(mode != HTREE_LOCK_PW && mode != HTREE_LOCK_PR);
+	BUG_ON(htree_node_is_granted(lck, dep));
+
+	key = hash_long(key, lhead->lh_hbits);
+
+	mi_bits = lhead->lh_hbits >> 1;
+	ma_bits = lhead->lh_hbits - mi_bits;
+
+	lck->lk_nodes[dep].ln_major_key = major = key & ((1U << ma_bits) - 1);
+	lck->lk_nodes[dep].ln_minor_key = minor = key >> ma_bits;
+	lck->lk_nodes[dep].ln_mode = mode;
+
+	/*
+	 * The major key list is an ordered list, so searches are started
+	 * at the end of the list that is numerically closer to major_key,
+	 * so at most half of the list will be walked (for well-distributed
+	 * keys). The list traversal aborts early if the expected key
+	 * location is passed.
+	 */
+	reverse = (major >= (1 << (ma_bits - 1)));
+
+	if (reverse) {
+		list_for_each_entry_reverse(tmp,
+					&lhead->lh_children[dep].lc_list,
+					lk_nodes[dep].ln_major_list) {
+			if (tmp->lk_nodes[dep].ln_major_key == major) {
+				goto search_minor;
+
+			} else if (tmp->lk_nodes[dep].ln_major_key < major) {
+				/* attach _after_ @tmp */
+				list_add(&lck->lk_nodes[dep].ln_major_list,
+					 &tmp->lk_nodes[dep].ln_major_list);
+				goto out_grant_major;
+			}
+		}
+
+		list_add(&lck->lk_nodes[dep].ln_major_list,
+			 &lhead->lh_children[dep].lc_list);
+		goto out_grant_major;
+
+	} else {
+		list_for_each_entry(tmp, &lhead->lh_children[dep].lc_list,
+				    lk_nodes[dep].ln_major_list) {
+			if (tmp->lk_nodes[dep].ln_major_key == major) {
+				goto search_minor;
+
+			} else if (tmp->lk_nodes[dep].ln_major_key > major) {
+				/* insert _before_ @tmp */
+				list_add_tail(&lck->lk_nodes[dep].ln_major_list,
+					&tmp->lk_nodes[dep].ln_major_list);
+				goto out_grant_major;
+			}
+		}
+
+		list_add_tail(&lck->lk_nodes[dep].ln_major_list,
+			      &lhead->lh_children[dep].lc_list);
+		goto out_grant_major;
+	}
+
+ search_minor:
+	/*
+	 * NB: minor_key list doesn't have a "head", @list is just a
+	 * temporary stub for helping list searching, make sure it's removed
+	 * after searching.
+	 * minor_key list is an ordered list too.
+	 */
+	list_add_tail(&list, &tmp->lk_nodes[dep].ln_minor_list);
+
+	reverse = (minor >= (1 << (mi_bits - 1)));
+
+	if (reverse) {
+		list_for_each_entry_reverse(tmp2, &list,
+					    lk_nodes[dep].ln_minor_list) {
+			if (tmp2->lk_nodes[dep].ln_minor_key == minor) {
+				goto out_enqueue;
+
+			} else if (tmp2->lk_nodes[dep].ln_minor_key < minor) {
+				/* attach _after_ @tmp2 */
+				list_add(&lck->lk_nodes[dep].ln_minor_list,
+					 &tmp2->lk_nodes[dep].ln_minor_list);
+				goto out_grant_minor;
+			}
+		}
+
+		list_add(&lck->lk_nodes[dep].ln_minor_list, &list);
+
+	} else {
+		list_for_each_entry(tmp2, &list,
+				    lk_nodes[dep].ln_minor_list) {
+			if (tmp2->lk_nodes[dep].ln_minor_key == minor) {
+				goto out_enqueue;
+
+			} else if (tmp2->lk_nodes[dep].ln_minor_key > minor) {
+				/* insert _before_ @tmp2 */
+				list_add_tail(&lck->lk_nodes[dep].ln_minor_list,
+					&tmp2->lk_nodes[dep].ln_minor_list);
+				goto out_grant_minor;
+			}
+		}
+
+		list_add_tail(&lck->lk_nodes[dep].ln_minor_list, &list);
+	}
+
+ out_grant_minor:
+	if (list.next == &lck->lk_nodes[dep].ln_minor_list) {
+		/* new lock @lck is the first one on minor_key list, which
+		 * means it has the smallest minor_key and it should
+		 * replace @tmp as minor_key owner */
+		list_replace_init(&tmp->lk_nodes[dep].ln_major_list,
+				  &lck->lk_nodes[dep].ln_major_list);
+	}
+	/* remove the temporary head */
+	list_del(&list);
+
+ out_grant_major:
+	ln_grant_inc(dep, lck->lk_nodes[dep].ln_mode);
+	return 1; /* granted with holding lh_lock */
+
+ out_enqueue:
+	list_del(&list); /* remove temprary head */
+	return htree_node_lock_enqueue(lck, tmp2, dep, wait, event);
+}
+
+/*
+ * release the key of @lck at level @dep, and grant any blocked locks.
+ * caller will still listen on @key if @event is not NULL, which means
+ * caller can see a event (by event_cb) while granting any lock with
+ * the same key at level @dep.
+ * NB: ALWAYS called holding lhead::lh_lock
+ * NB: listener will not block anyone because listening mode is HTREE_LOCK_NL
+ */
+static void
+htree_node_unlock_internal(struct htree_lock_head *lhead,
+			   struct htree_lock *curlk, unsigned dep, void *event)
+{
+	struct htree_lock_node	*curln = &curlk->lk_nodes[dep];
+	struct htree_lock	*grtlk = NULL;
+	struct htree_lock_node	*grtln;
+	struct htree_lock	*poslk;
+	struct htree_lock	*tmplk;
+
+	if (!htree_node_is_granted(curlk, dep))
+		return;
+
+	if (!list_empty(&curln->ln_granted_list)) {
+		/* there is another granted lock */
+		grtlk = list_entry(curln->ln_granted_list.next,
+				   struct htree_lock,
+				   lk_nodes[dep].ln_granted_list);
+		list_del_init(&curln->ln_granted_list);
+	}
+
+	if (grtlk == NULL && !list_empty(&curln->ln_blocked_list)) {
+		/*
+		 * @curlk is the only granted lock, so we confirmed:
+		 * a) curln is key owner (attached on major/minor_list),
+		 *    so if there is any blocked lock, it should be attached
+		 *    on curln->ln_blocked_list
+		 * b) we always can grant the first blocked lock
+		 */
+		grtlk = list_entry(curln->ln_blocked_list.next,
+				   struct htree_lock,
+				   lk_nodes[dep].ln_blocked_list);
+		BUG_ON(grtlk->lk_task == NULL);
+		wake_up_process(grtlk->lk_task);
+	}
+
+	if (event != NULL &&
+	    lhead->lh_children[dep].lc_events != HTREE_EVENT_DISABLE) {
+		curln->ln_ev_target = event;
+		curln->ln_mode = HTREE_LOCK_NL; /* listen! */
+	} else {
+		curln->ln_mode = HTREE_LOCK_INVAL;
+	}
+
+	if (grtlk == NULL) { /* I must be the only one locking this key */
+		struct htree_lock_node *tmpln;
+
+		BUG_ON(htree_key_list_empty(curln));
+
+		if (curln->ln_mode == HTREE_LOCK_NL) /* listening */
+			return;
+
+		/* not listening */
+		if (list_empty(&curln->ln_alive_list)) { /* no more listener */
+			htree_key_list_del_init(curln);
+			return;
+		}
+
+		tmpln = list_entry(curln->ln_alive_list.next,
+				   struct htree_lock_node, ln_alive_list);
+
+		BUG_ON(tmpln->ln_mode != HTREE_LOCK_NL);
+
+		htree_key_list_replace_init(curln, tmpln);
+		list_del_init(&curln->ln_alive_list);
+
+		return;
+	}
+
+	/* have a granted lock */
+	grtln = &grtlk->lk_nodes[dep];
+	if (!list_empty(&curln->ln_blocked_list)) {
+		/* only key owner can be on both lists */
+		BUG_ON(htree_key_list_empty(curln));
+
+		if (list_empty(&grtln->ln_blocked_list)) {
+			list_add(&grtln->ln_blocked_list,
+				 &curln->ln_blocked_list);
+		}
+		list_del_init(&curln->ln_blocked_list);
+	}
+	/*
+	 * NB: this is the tricky part:
+	 * We have only two modes for child-lock (PR and PW), also,
+	 * only owner of the key (attached on major/minor_list) can be on
+	 * both blocked_list and granted_list, so @grtlk must be one
+	 * of these two cases:
+	 *
+	 * a) @grtlk is taken from granted_list, which means we've granted
+	 *    more than one lock so @grtlk has to be PR, the first blocked
+	 *    lock must be PW and we can't grant it at all.
+	 *    So even @grtlk is not owner of the key (empty blocked_list),
+	 *    we don't care because we can't grant any lock.
+	 * b) we just grant a new lock which is taken from head of blocked
+	 *    list, and it should be the first granted lock, and it should
+	 *    be the first one linked on blocked_list.
+	 *
+	 * Either way, we can get correct result by iterating blocked_list
+	 * of @grtlk, and don't have to bother on how to find out
+	 * owner of current key.
+	 */
+	list_for_each_entry_safe(poslk, tmplk, &grtln->ln_blocked_list,
+				 lk_nodes[dep].ln_blocked_list) {
+		if (grtlk->lk_nodes[dep].ln_mode == HTREE_LOCK_PW ||
+		    poslk->lk_nodes[dep].ln_mode == HTREE_LOCK_PW)
+			break;
+		/* grant all readers */
+		list_del_init(&poslk->lk_nodes[dep].ln_blocked_list);
+		list_add(&poslk->lk_nodes[dep].ln_granted_list,
+			 &grtln->ln_granted_list);
+
+		BUG_ON(poslk->lk_task == NULL);
+		wake_up_process(poslk->lk_task);
+	}
+
+	/* if @curln is the owner of this key, replace it with @grtln */
+	if (!htree_key_list_empty(curln))
+		htree_key_list_replace_init(curln, grtln);
+
+	if (curln->ln_mode == HTREE_LOCK_INVAL)
+		list_del_init(&curln->ln_alive_list);
+}
+
+/*
+ * it's just wrapper of htree_node_lock_internal, it returns 1 on granted
+ * and 0 only if @wait is false and can't grant it immediately
+ */
+int
+htree_node_lock_try(struct htree_lock *lck, htree_lock_mode_t mode,
+		    u32 key, unsigned dep, int wait, void *event)
+{
+	struct htree_lock_head *lhead = lck->lk_head;
+	int rc;
+
+	BUG_ON(dep >= lck->lk_depth);
+	BUG_ON(lck->lk_mode == HTREE_LOCK_INVAL);
+
+	htree_spin_lock(lhead, dep);
+	rc = htree_node_lock_internal(lhead, lck, mode, key, dep, wait, event);
+	if (rc != 0)
+		htree_spin_unlock(lhead, dep);
+	return rc >= 0;
+}
+EXPORT_SYMBOL(htree_node_lock_try);
+
+/* it's wrapper of htree_node_unlock_internal */
+void
+htree_node_unlock(struct htree_lock *lck, unsigned dep, void *event)
+{
+	struct htree_lock_head *lhead = lck->lk_head;
+
+	BUG_ON(dep >= lck->lk_depth);
+	BUG_ON(lck->lk_mode == HTREE_LOCK_INVAL);
+
+	htree_spin_lock(lhead, dep);
+	htree_node_unlock_internal(lhead, lck, dep, event);
+	htree_spin_unlock(lhead, dep);
+}
+EXPORT_SYMBOL(htree_node_unlock);
+
+/* stop listening on child-lock level @dep */
+void
+htree_node_stop_listen(struct htree_lock *lck, unsigned dep)
+{
+	struct htree_lock_node *ln = &lck->lk_nodes[dep];
+	struct htree_lock_node *tmp;
+
+	BUG_ON(htree_node_is_granted(lck, dep));
+	BUG_ON(!list_empty(&ln->ln_blocked_list));
+	BUG_ON(!list_empty(&ln->ln_granted_list));
+
+	if (!htree_node_is_listening(lck, dep))
+		return;
+
+	htree_spin_lock(lck->lk_head, dep);
+	ln->ln_mode = HTREE_LOCK_INVAL;
+	ln->ln_ev_target = NULL;
+
+	if (htree_key_list_empty(ln)) { /* not owner */
+		list_del_init(&ln->ln_alive_list);
+		goto out;
+	}
+
+	/* I'm the owner... */
+	if (list_empty(&ln->ln_alive_list)) { /* no more listener */
+		htree_key_list_del_init(ln);
+		goto out;
+	}
+
+	tmp = list_entry(ln->ln_alive_list.next,
+			 struct htree_lock_node, ln_alive_list);
+
+	BUG_ON(tmp->ln_mode != HTREE_LOCK_NL);
+	htree_key_list_replace_init(ln, tmp);
+	list_del_init(&ln->ln_alive_list);
+ out:
+	htree_spin_unlock(lck->lk_head, dep);
+}
+EXPORT_SYMBOL(htree_node_stop_listen);
+
+/* release all child-locks if we have any */
+static void
+htree_node_release_all(struct htree_lock *lck)
+{
+	int	i;
+
+	for (i = 0; i < lck->lk_depth; i++) {
+		if (htree_node_is_granted(lck, i))
+			htree_node_unlock(lck, i, NULL);
+		else if (htree_node_is_listening(lck, i))
+			htree_node_stop_listen(lck, i);
+	}
+}
+
+/*
+ * obtain htree lock, it could be blocked inside if there's conflict
+ * with any granted or blocked lock and @wait is true.
+ * NB: ALWAYS called holding lhead::lh_lock
+ */
+static int
+htree_lock_internal(struct htree_lock *lck, int wait)
+{
+	struct htree_lock_head *lhead = lck->lk_head;
+	int	granted = 0;
+	int	blocked = 0;
+	int	i;
+
+	for (i = 0; i < HTREE_LOCK_MAX; i++) {
+		if (lhead->lh_ngranted[i] != 0)
+			granted |= 1 << i;
+		if (lhead->lh_nblocked[i] != 0)
+			blocked |= 1 << i;
+	}
+	if ((htree_lock_compat[lck->lk_mode] & granted) != granted ||
+	    (htree_lock_compat[lck->lk_mode] & blocked) != blocked) {
+		/* will block current lock even it just conflicts with any
+		 * other blocked lock, so lock like EX wouldn't starve */
+		if (!wait)
+			return -1;
+		lhead->lh_nblocked[lck->lk_mode]++;
+		lk_block_inc(lck->lk_mode);
+
+		lck->lk_task = current;
+		list_add_tail(&lck->lk_blocked_list, &lhead->lh_blocked_list);
+
+retry:
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		htree_spin_unlock(lhead, HTREE_DEP_ROOT);
+		/* wait to be given the lock */
+		if (lck->lk_task != NULL)
+			schedule();
+		/* granted, no doubt. wake up will set me RUNNING.
+		 * Since thread would be waken up accidentally,
+		 * so we need check lock whether granted or not again. */
+		if (!list_empty(&lck->lk_blocked_list)) {
+			htree_spin_lock(lhead, HTREE_DEP_ROOT);
+			if (list_empty(&lck->lk_blocked_list)) {
+				htree_spin_unlock(lhead, HTREE_DEP_ROOT);
+				return 0;
+			}
+			goto retry;
+		}
+		return 0; /* without lh_lock */
+	}
+	lhead->lh_ngranted[lck->lk_mode]++;
+	lk_grant_inc(lck->lk_mode);
+	return 1;
+}
+
+/* release htree lock. NB: ALWAYS called holding lhead::lh_lock */
+static void
+htree_unlock_internal(struct htree_lock *lck)
+{
+	struct htree_lock_head *lhead = lck->lk_head;
+	struct htree_lock *tmp;
+	struct htree_lock *tmp2;
+	int granted = 0;
+	int i;
+
+	BUG_ON(lhead->lh_ngranted[lck->lk_mode] == 0);
+
+	lhead->lh_ngranted[lck->lk_mode]--;
+	lck->lk_mode = HTREE_LOCK_INVAL;
+
+	for (i = 0; i < HTREE_LOCK_MAX; i++) {
+		if (lhead->lh_ngranted[i] != 0)
+			granted |= 1 << i;
+	}
+	list_for_each_entry_safe(tmp, tmp2,
+				 &lhead->lh_blocked_list, lk_blocked_list) {
+		/* conflict with any granted lock? */
+		if ((htree_lock_compat[tmp->lk_mode] & granted) != granted)
+			break;
+
+		list_del_init(&tmp->lk_blocked_list);
+
+		BUG_ON(lhead->lh_nblocked[tmp->lk_mode] == 0);
+
+		lhead->lh_nblocked[tmp->lk_mode]--;
+		lhead->lh_ngranted[tmp->lk_mode]++;
+		granted |= 1 << tmp->lk_mode;
+
+		BUG_ON(tmp->lk_task == NULL);
+		wake_up_process(tmp->lk_task);
+	}
+}
+
+/* it's wrapper of htree_lock_internal and exported interface.
+ * It always return 1 with granted lock if @wait is true, it can return 0
+ * if @wait is false and locking request can't be granted immediately */
+int
+htree_lock_try(struct htree_lock *lck, struct htree_lock_head *lhead,
+	       htree_lock_mode_t mode, int wait)
+{
+	int	rc;
+
+	BUG_ON(lck->lk_depth > lhead->lh_depth);
+	BUG_ON(lck->lk_head != NULL);
+	BUG_ON(lck->lk_task != NULL);
+
+	lck->lk_head = lhead;
+	lck->lk_mode = mode;
+
+	htree_spin_lock(lhead, HTREE_DEP_ROOT);
+	rc = htree_lock_internal(lck, wait);
+	if (rc != 0)
+		htree_spin_unlock(lhead, HTREE_DEP_ROOT);
+	return rc >= 0;
+}
+EXPORT_SYMBOL(htree_lock_try);
+
+/* it's wrapper of htree_unlock_internal and exported interface.
+ * It will release all htree_node_locks and htree_lock */
+void
+htree_unlock(struct htree_lock *lck)
+{
+	BUG_ON(lck->lk_head == NULL);
+	BUG_ON(lck->lk_mode == HTREE_LOCK_INVAL);
+
+	htree_node_release_all(lck);
+
+	htree_spin_lock(lck->lk_head, HTREE_DEP_ROOT);
+	htree_unlock_internal(lck);
+	htree_spin_unlock(lck->lk_head, HTREE_DEP_ROOT);
+	lck->lk_head = NULL;
+	lck->lk_task = NULL;
+}
+EXPORT_SYMBOL(htree_unlock);
+
+/* change lock mode */
+void
+htree_change_mode(struct htree_lock *lck, htree_lock_mode_t mode)
+{
+	BUG_ON(lck->lk_mode == HTREE_LOCK_INVAL);
+	lck->lk_mode = mode;
+}
+EXPORT_SYMBOL(htree_change_mode);
+
+/* release htree lock, and lock it again with new mode.
+ * This function will first release all htree_node_locks and htree_lock,
+ * then try to gain htree_lock with new @mode.
+ * It always return 1 with granted lock if @wait is true, it can return 0
+ * if @wait is false and locking request can't be granted immediately */
+int
+htree_change_lock_try(struct htree_lock *lck, htree_lock_mode_t mode, int wait)
+{
+	struct htree_lock_head *lhead = lck->lk_head;
+	int rc;
+
+	BUG_ON(lhead == NULL);
+	BUG_ON(lck->lk_mode == mode);
+	BUG_ON(lck->lk_mode == HTREE_LOCK_INVAL || mode == HTREE_LOCK_INVAL);
+
+	htree_node_release_all(lck);
+
+	htree_spin_lock(lhead, HTREE_DEP_ROOT);
+	htree_unlock_internal(lck);
+	lck->lk_mode = mode;
+	rc = htree_lock_internal(lck, wait);
+	if (rc != 0)
+		htree_spin_unlock(lhead, HTREE_DEP_ROOT);
+	return rc >= 0;
+}
+EXPORT_SYMBOL(htree_change_lock_try);
+
+/* create a htree_lock head with @depth levels (number of child-locks),
+ * it is a per resoruce structure */
+struct htree_lock_head *
+htree_lock_head_alloc(unsigned depth, unsigned hbits, unsigned priv)
+{
+	struct htree_lock_head *lhead;
+	int  i;
+
+	if (depth > HTREE_LOCK_DEP_MAX) {
+		printk(KERN_ERR "%d is larger than max htree_lock depth %d\n",
+			depth, HTREE_LOCK_DEP_MAX);
+		return NULL;
+	}
+
+	lhead = kzalloc(offsetof(struct htree_lock_head,
+				 lh_children[depth]) + priv, GFP_NOFS);
+	if (lhead == NULL)
+		return NULL;
+
+	if (hbits < HTREE_HBITS_MIN)
+		lhead->lh_hbits = HTREE_HBITS_MIN;
+	else if (hbits > HTREE_HBITS_MAX)
+		lhead->lh_hbits = HTREE_HBITS_MAX;
+
+	lhead->lh_lock = 0;
+	lhead->lh_depth = depth;
+	INIT_LIST_HEAD(&lhead->lh_blocked_list);
+	if (priv > 0) {
+		lhead->lh_private = (void *)lhead +
+			offsetof(struct htree_lock_head, lh_children[depth]);
+	}
+
+	for (i = 0; i < depth; i++) {
+		INIT_LIST_HEAD(&lhead->lh_children[i].lc_list);
+		lhead->lh_children[i].lc_events = HTREE_EVENT_DISABLE;
+	}
+	return lhead;
+}
+EXPORT_SYMBOL(htree_lock_head_alloc);
+
+/* free the htree_lock head */
+void
+htree_lock_head_free(struct htree_lock_head *lhead)
+{
+	int     i;
+
+	BUG_ON(!list_empty(&lhead->lh_blocked_list));
+	for (i = 0; i < lhead->lh_depth; i++)
+		BUG_ON(!list_empty(&lhead->lh_children[i].lc_list));
+	kfree(lhead);
+}
+EXPORT_SYMBOL(htree_lock_head_free);
+
+/* register event callback for @events of child-lock at level @dep */
+void
+htree_lock_event_attach(struct htree_lock_head *lhead, unsigned dep,
+			unsigned events, htree_event_cb_t callback)
+{
+	BUG_ON(lhead->lh_depth <= dep);
+	lhead->lh_children[dep].lc_events = events;
+	lhead->lh_children[dep].lc_callback = callback;
+}
+EXPORT_SYMBOL(htree_lock_event_attach);
+
+/* allocate a htree_lock, which is per-thread structure, @pbytes is some
+ * extra-bytes as private data for caller */
+struct htree_lock *
+htree_lock_alloc(unsigned depth, unsigned pbytes)
+{
+	struct htree_lock *lck;
+	int i = offsetof(struct htree_lock, lk_nodes[depth]);
+
+	if (depth > HTREE_LOCK_DEP_MAX) {
+		printk(KERN_ERR "%d is larger than max htree_lock depth %d\n",
+			depth, HTREE_LOCK_DEP_MAX);
+		return NULL;
+	}
+	lck = kzalloc(i + pbytes, GFP_NOFS);
+	if (lck == NULL)
+		return NULL;
+
+	if (pbytes != 0)
+		lck->lk_private = (void *)lck + i;
+	lck->lk_mode = HTREE_LOCK_INVAL;
+	lck->lk_depth = depth;
+	INIT_LIST_HEAD(&lck->lk_blocked_list);
+
+	for (i = 0; i < depth; i++) {
+		struct htree_lock_node *node = &lck->lk_nodes[i];
+
+		node->ln_mode = HTREE_LOCK_INVAL;
+		INIT_LIST_HEAD(&node->ln_major_list);
+		INIT_LIST_HEAD(&node->ln_minor_list);
+		INIT_LIST_HEAD(&node->ln_alive_list);
+		INIT_LIST_HEAD(&node->ln_blocked_list);
+		INIT_LIST_HEAD(&node->ln_granted_list);
+	}
+
+	return lck;
+}
+EXPORT_SYMBOL(htree_lock_alloc);
+
+/* free htree_lock node */
+void
+htree_lock_free(struct htree_lock *lck)
+{
+	BUG_ON(lck->lk_mode != HTREE_LOCK_INVAL);
+	kfree(lck);
+}
+EXPORT_SYMBOL(htree_lock_free);
diff --git a/fs/ext4/htree_lock.h b/fs/ext4/htree_lock.h
new file mode 100644
index 0000000..9dc7788
--- /dev/null
+++ b/fs/ext4/htree_lock.h
@@ -0,0 +1,187 @@
+/*
+ * include/linux/htree_lock.h
+ *
+ * Copyright (c) 2011, 2012, Intel Corporation.
+ *
+ * Author: Liang Zhen <liang@whamcloud.com>
+ */
+
+/*
+ * htree lock
+ *
+ * htree_lock is an advanced lock, it can support five lock modes (concept is
+ * taken from DLM) and it's a sleeping lock.
+ *
+ * most common use case is:
+ * - create a htree_lock_head for data
+ * - each thread (contender) creates it's own htree_lock
+ * - contender needs to call htree_lock(lock_node, mode) to protect data and
+ *   call htree_unlock to release lock
+ *
+ * Also, there is advanced use-case which is more complex, user can have
+ * PW/PR lock on particular key, it's mostly used while user holding shared
+ * lock on the htree (CW, CR)
+ *
+ * htree_lock(lock_node, HTREE_LOCK_CR); lock the htree with CR
+ * htree_node_lock(lock_node, HTREE_LOCK_PR, key...); lock @key with PR
+ * ...
+ * htree_node_unlock(lock_node);; unlock the key
+ *
+ * Another tip is, we can have N-levels of this kind of keys, all we need to
+ * do is specifying N-levels while creating htree_lock_head, then we can
+ * lock/unlock a specific level by:
+ * htree_node_lock(lock_node, mode1, key1, level1...);
+ * do something;
+ * htree_node_lock(lock_node, mode1, key2, level2...);
+ * do something;
+ * htree_node_unlock(lock_node, level2);
+ * htree_node_unlock(lock_node, level1);
+ *
+ * NB: for multi-level, should be careful about locking order to avoid deadlock
+ */
+
+#ifndef _LINUX_HTREE_LOCK_H
+#define _LINUX_HTREE_LOCK_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/sched.h>
+
+/*
+ * Lock Modes
+ * more details can be found here:
+ * http://en.wikipedia.org/wiki/Distributed_lock_manager
+ */
+typedef enum {
+	HTREE_LOCK_EX	= 0, /* exclusive lock: incompatible with all others */
+	HTREE_LOCK_PW,	     /* protected write: allows only CR users */
+	HTREE_LOCK_PR,	     /* protected read: allow PR, CR users */
+	HTREE_LOCK_CW,	     /* concurrent write: allow CR, CW users */
+	HTREE_LOCK_CR,	     /* concurrent read: allow all but EX users */
+	HTREE_LOCK_MAX,	     /* number of lock modes */
+} htree_lock_mode_t;
+
+#define HTREE_LOCK_NL		HTREE_LOCK_MAX
+#define HTREE_LOCK_INVAL	0xdead10c
+
+enum {
+	HTREE_HBITS_MIN		= 2,
+	HTREE_HBITS_DEF		= 14,
+	HTREE_HBITS_MAX		= 32,
+};
+
+enum {
+	HTREE_EVENT_DISABLE	= (0),
+	HTREE_EVENT_RD		= (1 << HTREE_LOCK_PR),
+	HTREE_EVENT_WR		= (1 << HTREE_LOCK_PW),
+	HTREE_EVENT_RDWR	= (HTREE_EVENT_RD | HTREE_EVENT_WR),
+};
+
+struct htree_lock;
+
+typedef void (*htree_event_cb_t)(void *target, void *event);
+
+struct htree_lock_child {
+	struct list_head	lc_list;	/* granted list */
+	htree_event_cb_t	lc_callback;	/* event callback */
+	unsigned		lc_events;	/* event types */
+};
+
+struct htree_lock_head {
+	unsigned long		lh_lock;	/* bits lock */
+	/* blocked lock list (htree_lock) */
+	struct list_head	lh_blocked_list;
+	/* # key levels */
+	u16			lh_depth;
+	/* hash bits for key and limit number of locks */
+	u16			lh_hbits;
+	/* counters for blocked locks */
+	u16			lh_nblocked[HTREE_LOCK_MAX];
+	/* counters for granted locks */
+	u16			lh_ngranted[HTREE_LOCK_MAX];
+	/* private data */
+	void			*lh_private;
+	/* array of children locks */
+	struct htree_lock_child	lh_children[0];
+};
+
+/* htree_lock_node_t is child-lock for a specific key (ln_value) */
+struct htree_lock_node {
+	htree_lock_mode_t	ln_mode;
+	/* major hash key */
+	u16			ln_major_key;
+	/* minor hash key */
+	u16			ln_minor_key;
+	struct list_head	ln_major_list;
+	struct list_head	ln_minor_list;
+	/* alive list, all locks (granted, blocked, listening) are on it */
+	struct list_head	ln_alive_list;
+	/* blocked list */
+	struct list_head	ln_blocked_list;
+	/* granted list */
+	struct list_head	ln_granted_list;
+	void			*ln_ev_target;
+};
+
+struct htree_lock {
+	struct task_struct	*lk_task;
+	struct htree_lock_head	*lk_head;
+	void			*lk_private;
+	unsigned		lk_depth;
+	htree_lock_mode_t	lk_mode;
+	struct list_head	lk_blocked_list;
+	struct htree_lock_node	lk_nodes[0];
+};
+
+/* create a lock head, which stands for a resource */
+struct htree_lock_head *htree_lock_head_alloc(unsigned depth,
+					      unsigned hbits, unsigned priv);
+/* free a lock head */
+void htree_lock_head_free(struct htree_lock_head *lhead);
+/* register event callback for child lock at level @depth */
+void htree_lock_event_attach(struct htree_lock_head *lhead, unsigned depth,
+			     unsigned events, htree_event_cb_t callback);
+/* create a lock handle, which stands for a thread */
+struct htree_lock *htree_lock_alloc(unsigned depth, unsigned pbytes);
+/* free a lock handle */
+void htree_lock_free(struct htree_lock *lck);
+/* lock htree, when @wait is true, 0 is returned if the lock can't
+ * be granted immediately */
+int htree_lock_try(struct htree_lock *lck, struct htree_lock_head *lhead,
+		   htree_lock_mode_t mode, int wait);
+/* unlock htree */
+void htree_unlock(struct htree_lock *lck);
+/* unlock and relock htree with @new_mode */
+int htree_change_lock_try(struct htree_lock *lck,
+			  htree_lock_mode_t new_mode, int wait);
+void htree_change_mode(struct htree_lock *lck, htree_lock_mode_t mode);
+/* require child lock (key) of htree at level @dep, @event will be sent to all
+ * listeners on this @key while lock being granted */
+int htree_node_lock_try(struct htree_lock *lck, htree_lock_mode_t mode,
+			u32 key, unsigned dep, int wait, void *event);
+/* release child lock at level @dep, this lock will listen on it's key
+ * if @event isn't NULL, event_cb will be called against @lck while granting
+ * any other lock at level @dep with the same key */
+void htree_node_unlock(struct htree_lock *lck, unsigned dep, void *event);
+/* stop listening on child lock@level @dep */
+void htree_node_stop_listen(struct htree_lock *lck, unsigned dep);
+/* for debug */
+void htree_lock_stat_print(int depth);
+void htree_lock_stat_reset(void);
+
+#define htree_lock(lck, lh, mode)	htree_lock_try(lck, lh, mode, 1)
+#define htree_change_lock(lck, mode)	htree_change_lock_try(lck, mode, 1)
+
+#define htree_lock_mode(lck)		((lck)->lk_mode)
+
+#define htree_node_lock(lck, mode, key, dep)	\
+	htree_node_lock_try(lck, mode, key, dep, 1, NULL)
+/* this is only safe in thread context of lock owner */
+#define htree_node_is_granted(lck, dep)		\
+	((lck)->lk_nodes[dep].ln_mode != HTREE_LOCK_INVAL && \
+	 (lck)->lk_nodes[dep].ln_mode != HTREE_LOCK_NL)
+/* this is only safe in thread context of lock owner */
+#define htree_node_is_listening(lck, dep)	\
+	((lck)->lk_nodes[dep].ln_mode == HTREE_LOCK_NL)
+
+#endif
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 9cb86e4..0153c4d 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -54,6 +54,7 @@ struct buffer_head *ext4_append(handle_t *handle,
 				struct inode *inode,
 				ext4_lblk_t *block)
 {
+	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct buffer_head *bh;
 	int err;
 
@@ -62,15 +63,24 @@ struct buffer_head *ext4_append(handle_t *handle,
 		      EXT4_SB(inode->i_sb)->s_max_dir_size_kb)))
 		return ERR_PTR(-ENOSPC);
 
+	/* with parallel dir operations all appends
+	 * have to be serialized -bzzz
+	 */
+	down(&ei->i_append_sem);
+
 	*block = inode->i_size >> inode->i_sb->s_blocksize_bits;
 
 	bh = ext4_bread(handle, inode, *block, EXT4_GET_BLOCKS_CREATE);
-	if (IS_ERR(bh))
+	if (IS_ERR(bh)) {
+		up(&ei->i_append_sem);
 		return bh;
+	}
+
 	inode->i_size += inode->i_sb->s_blocksize;
 	EXT4_I(inode)->i_disksize = inode->i_size;
 	BUFFER_TRACE(bh, "get_write_access");
 	err = ext4_journal_get_write_access(handle, bh);
+	up(&ei->i_append_sem);
 	if (err) {
 		brelse(bh);
 		ext4_std_error(inode->i_sb, err);
@@ -250,7 +260,8 @@ static unsigned int inline dx_root_limit(struct inode *dir,
 static struct dx_frame *dx_probe(struct ext4_filename *fname,
 				 struct inode *dir,
 				 struct dx_hash_info *hinfo,
-				 struct dx_frame *frame);
+				 struct dx_frame *frame,
+				 struct htree_lock *lck);
 static void dx_release(struct dx_frame *frames);
 static int dx_make_map(struct inode *dir, struct ext4_dir_entry_2 *de,
 		       unsigned blocksize, struct dx_hash_info *hinfo,
@@ -264,12 +275,13 @@ static void dx_insert_block(struct dx_frame *frame,
 static int ext4_htree_next_block(struct inode *dir, __u32 hash,
 				 struct dx_frame *frame,
 				 struct dx_frame *frames,
-				 __u32 *start_hash);
+				 __u32 *start_hash, struct htree_lock *lck);
 static struct buffer_head * ext4_dx_find_entry(struct inode *dir,
 		struct ext4_filename *fname,
-		struct ext4_dir_entry_2 **res_dir);
+		struct ext4_dir_entry_2 **res_dir, struct htree_lock *lck);
 static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
-			     struct inode *dir, struct inode *inode);
+			     struct inode *dir, struct inode *inode,
+			     struct htree_lock *lck);
 
 /* checksumming functions */
 void initialize_dirent_tail(struct ext4_dir_entry_tail *t,
@@ -734,6 +746,226 @@ struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir,
 }
 #endif /* DX_DEBUG */
 
+/* private data for htree_lock */
+struct ext4_dir_lock_data {
+	unsigned int		ld_flags;	/* bits-map for lock types */
+	unsigned int		ld_count;	/* # entries of the last DX block */
+	struct dx_entry		ld_at_entry;	/* copy of leaf dx_entry */
+	struct dx_entry		*ld_at;		/* position of leaf dx_entry */
+};
+
+#define ext4_htree_lock_data(l)	((struct ext4_dir_lock_data *)(l)->lk_private)
+#define ext4_find_entry(dir, name, dirent, inline) \
+			ext4_find_entry_locked(dir, name, dirent, inline, NULL)
+#define ext4_add_entry(handle, dentry, inode) \
+		       ext4_add_entry_locked(handle, dentry, inode, NULL)
+
+/* NB: ext4_lblk_t is 32 bits so we use high bits to identify invalid blk */
+#define EXT4_HTREE_NODE_CHANGED		(0xcafeULL << 32)
+
+static void ext4_htree_event_cb(void *target, void *event)
+{
+	u64 *block = (u64 *)target;
+
+	if (*block == dx_get_block((struct dx_entry *)event))
+		*block = EXT4_HTREE_NODE_CHANGED;
+}
+
+struct htree_lock_head *ext4_htree_lock_head_alloc(unsigned hbits)
+{
+	struct htree_lock_head *lhead;
+
+	lhead = htree_lock_head_alloc(EXT4_LK_MAX, hbits, 0);
+	if (lhead) {
+		htree_lock_event_attach(lhead, EXT4_LK_SPIN, HTREE_EVENT_WR,
+					ext4_htree_event_cb);
+	}
+	return lhead;
+}
+EXPORT_SYMBOL(ext4_htree_lock_head_alloc);
+
+struct htree_lock *ext4_htree_lock_alloc(void)
+{
+	return htree_lock_alloc(EXT4_LK_MAX,
+				sizeof(struct ext4_dir_lock_data));
+}
+EXPORT_SYMBOL(ext4_htree_lock_alloc);
+
+static htree_lock_mode_t ext4_htree_mode(unsigned flags)
+{
+	switch (flags) {
+	default: /* 0 or unknown flags require EX lock */
+		return HTREE_LOCK_EX;
+	case EXT4_HLOCK_READDIR:
+		return HTREE_LOCK_PR;
+	case EXT4_HLOCK_LOOKUP:
+		return HTREE_LOCK_CR;
+	case EXT4_HLOCK_DEL:
+	case EXT4_HLOCK_ADD:
+		return HTREE_LOCK_CW;
+	}
+}
+
+/* return PR for read-only operations, otherwise return EX */
+static inline htree_lock_mode_t ext4_htree_safe_mode(unsigned flags)
+{
+	int writer = (flags & EXT4_LB_DE) == EXT4_LB_DE;
+
+	/* 0 requires EX lock */
+	return (flags == 0 || writer) ? HTREE_LOCK_EX : HTREE_LOCK_PR;
+}
+
+static int ext4_htree_safe_locked(struct htree_lock *lck)
+{
+	int writer;
+
+	if (!lck || lck->lk_mode == HTREE_LOCK_EX)
+		return 1;
+
+	writer = (ext4_htree_lock_data(lck)->ld_flags & EXT4_LB_DE) ==
+		  EXT4_LB_DE;
+	if (writer) /* all readers & writers are excluded? */
+		return lck->lk_mode == HTREE_LOCK_EX;
+
+	/* all writers are excluded? */
+	return lck->lk_mode == HTREE_LOCK_PR ||
+	       lck->lk_mode == HTREE_LOCK_PW ||
+	       lck->lk_mode == HTREE_LOCK_EX;
+}
+
+/* relock htree_lock with EX mode if it's change operation, otherwise
+ * relock it with PR mode. It's noop if PDO is disabled.
+ */
+static void ext4_htree_safe_relock(struct htree_lock *lck)
+{
+	if (!ext4_htree_safe_locked(lck)) {
+		unsigned int flags = ext4_htree_lock_data(lck)->ld_flags;
+
+		htree_change_lock(lck, ext4_htree_safe_mode(flags));
+	}
+}
+
+void ext4_htree_lock(struct htree_lock *lck, struct htree_lock_head *lhead,
+		     struct inode *dir, unsigned flags)
+{
+	htree_lock_mode_t mode = is_dx(dir) ? ext4_htree_mode(flags) :
+					      ext4_htree_safe_mode(flags);
+
+	ext4_htree_lock_data(lck)->ld_flags = flags;
+	htree_lock(lck, lhead, mode);
+	if (!is_dx(dir))
+		ext4_htree_safe_relock(lck); /* make sure it's safe locked */
+}
+EXPORT_SYMBOL(ext4_htree_lock);
+
+static int ext4_htree_node_lock(struct htree_lock *lck, struct dx_entry *at,
+				unsigned int lmask, int wait, void *ev)
+{
+	u32 key = (at == NULL) ? 0 : dx_get_block(at);
+	u32 mode;
+
+	/* NOOP if htree is well protected or caller doesn't require the lock */
+	if (ext4_htree_safe_locked(lck) ||
+	    !(ext4_htree_lock_data(lck)->ld_flags & lmask))
+		return 1;
+
+	mode = (ext4_htree_lock_data(lck)->ld_flags & lmask) == lmask ?
+		HTREE_LOCK_PW : HTREE_LOCK_PR;
+	while (1) {
+		if (htree_node_lock_try(lck, mode, key, ffz(~lmask), wait, ev))
+			return 1;
+		if (!(lmask & EXT4_LB_SPIN)) /* not a spinlock */
+			return 0;
+		cpu_relax(); /* spin until granted */
+	}
+}
+
+static int ext4_htree_node_locked(struct htree_lock *lck, unsigned lmask)
+{
+	return ext4_htree_safe_locked(lck) ||
+	       htree_node_is_granted(lck, ffz(~lmask));
+}
+
+static void ext4_htree_node_unlock(struct htree_lock *lck,
+				   unsigned int lmask, void *buf)
+{
+	/* NB: it's safe to call mutiple times or even it's not locked */
+	if (!ext4_htree_safe_locked(lck) &&
+	    htree_node_is_granted(lck, ffz(~lmask)))
+		htree_node_unlock(lck, ffz(~lmask), buf);
+}
+
+#define ext4_htree_dx_lock(lck, key)		\
+	ext4_htree_node_lock(lck, key, EXT4_LB_DX, 1, NULL)
+#define ext4_htree_dx_lock_try(lck, key)	\
+	ext4_htree_node_lock(lck, key, EXT4_LB_DX, 0, NULL)
+#define ext4_htree_dx_unlock(lck)		\
+	ext4_htree_node_unlock(lck, EXT4_LB_DX, NULL)
+#define ext4_htree_dx_locked(lck)		\
+	ext4_htree_node_locked(lck, EXT4_LB_DX)
+
+static void ext4_htree_dx_need_lock(struct htree_lock *lck)
+{
+	struct ext4_dir_lock_data *ld;
+
+	if (ext4_htree_safe_locked(lck))
+		return;
+
+	ld = ext4_htree_lock_data(lck);
+	switch (ld->ld_flags) {
+	default:
+		return;
+	case EXT4_HLOCK_LOOKUP:
+		ld->ld_flags = EXT4_HLOCK_LOOKUP_SAFE;
+		return;
+	case EXT4_HLOCK_DEL:
+		ld->ld_flags = EXT4_HLOCK_DEL_SAFE;
+		return;
+	case EXT4_HLOCK_ADD:
+		ld->ld_flags = EXT4_HLOCK_SPLIT;
+		return;
+	}
+}
+
+#define ext4_htree_de_lock(lck, key)		\
+	ext4_htree_node_lock(lck, key, EXT4_LB_DE, 1, NULL)
+#define ext4_htree_de_unlock(lck)		\
+	ext4_htree_node_unlock(lck, EXT4_LB_DE, NULL)
+
+#define ext4_htree_spin_lock(lck, key, event)	\
+	ext4_htree_node_lock(lck, key, EXT4_LB_SPIN, 0, event)
+#define ext4_htree_spin_unlock(lck)		\
+	ext4_htree_node_unlock(lck, EXT4_LB_SPIN, NULL)
+#define ext4_htree_spin_unlock_listen(lck, p)	\
+	ext4_htree_node_unlock(lck, EXT4_LB_SPIN, p)
+
+static void ext4_htree_spin_stop_listen(struct htree_lock *lck)
+{
+	if (!ext4_htree_safe_locked(lck) &&
+	    htree_node_is_listening(lck, ffz(~EXT4_LB_SPIN)))
+		htree_node_stop_listen(lck, ffz(~EXT4_LB_SPIN));
+}
+
+enum {
+	DX_HASH_COL_IGNORE,	/* ignore collision while probing frames */
+	DX_HASH_COL_YES,	/* there is collision and it does matter */
+	DX_HASH_COL_NO,	/* there is no collision */
+};
+
+static int dx_probe_hash_collision(struct htree_lock *lck,
+				   struct dx_entry *entries,
+				   struct dx_entry *at, u32 hash)
+{
+	if (!(lck && ext4_htree_lock_data(lck)->ld_flags & EXT4_LB_EXACT)) {
+		return DX_HASH_COL_IGNORE; /* don't care about collision */
+	} else if (at == entries + dx_get_count(entries) - 1) {
+		return DX_HASH_COL_IGNORE; /* not in any leaf of this DX */
+	} else { /* hash collision? */
+		return ((dx_get_hash(at + 1) & ~1) == hash) ?
+			DX_HASH_COL_YES : DX_HASH_COL_NO;
+	}
+}
+
 /*
  * Probe for a directory leaf block to search.
  *
@@ -745,10 +977,11 @@ struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir,
  */
 static struct dx_frame *
 dx_probe(struct ext4_filename *fname, struct inode *dir,
-	 struct dx_hash_info *hinfo, struct dx_frame *frame_in)
+	 struct dx_hash_info *hinfo, struct dx_frame *frame_in,
+	 struct htree_lock *lck)
 {
 	unsigned count, indirect;
-	struct dx_entry *at, *entries, *p, *q, *m;
+	struct dx_entry *at, *entries, *p, *q, *m, *dx = NULL;
 	struct dx_root_info *info;
 	struct dx_frame *frame = frame_in;
 	struct dx_frame *ret_err = ERR_PTR(ERR_BAD_DX_DIR);
@@ -811,6 +1044,13 @@ struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir,
 
 	dxtrace(printk("Look up %x", hash));
 	while (1) {
+		if (indirect == 0) { /* the last index level */
+			/* NB: ext4_htree_dx_lock() could be noop if
+			 * DX-lock flag is not set for current operation
+			 */
+			ext4_htree_dx_lock(lck, dx);
+			ext4_htree_spin_lock(lck, dx, NULL);
+		}
 		count = dx_get_count(entries);
 		if (!count || count > dx_get_limit(entries)) {
 			ext4_warning_inode(dir,
@@ -851,8 +1091,75 @@ struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir,
 			       dx_get_block(at)));
 		frame->entries = entries;
 		frame->at = at;
-		if (!indirect--)
+
+		if (indirect == 0) { /* the last index level */
+			struct ext4_dir_lock_data *ld;
+			u64 myblock;
+
+			/* By default we only lock DE-block, however, we will
+			 * also lock the last level DX-block if:
+			 * a) there is hash collision
+			 *    we will set DX-lock flag (a few lines below)
+			 *    and redo to lock DX-block
+			 *    see detail in dx_probe_hash_collision()
+			 * b) it's a retry from splitting
+			 *    we need to lock the last level DX-block so nobody
+			 *    else can split any leaf blocks under the same
+			 *    DX-block, see detail in ext4_dx_add_entry()
+			 */
+			if (ext4_htree_dx_locked(lck)) {
+				/* DX-block is locked, just lock DE-block
+				 * and return
+				 */
+				ext4_htree_spin_unlock(lck);
+				if (!ext4_htree_safe_locked(lck))
+					ext4_htree_de_lock(lck, frame->at);
+				return frame;
+			}
+			/* it's pdirop and no DX lock */
+			if (dx_probe_hash_collision(lck, entries, at, hash) ==
+			    DX_HASH_COL_YES) {
+				/* found hash collision, set DX-lock flag
+				 * and retry to abtain DX-lock
+				 */
+				ext4_htree_spin_unlock(lck);
+				ext4_htree_dx_need_lock(lck);
+				continue;
+			}
+			ld = ext4_htree_lock_data(lck);
+			/* because I don't lock DX, so @at can't be trusted
+			 * after I release spinlock so I have to save it
+			 */
+			ld->ld_at = at;
+			ld->ld_at_entry = *at;
+			ld->ld_count = dx_get_count(entries);
+
+			frame->at = &ld->ld_at_entry;
+			myblock = dx_get_block(at);
+
+			/* NB: ordering locking */
+			ext4_htree_spin_unlock_listen(lck, &myblock);
+			/* other thread can split this DE-block because:
+			 * a) I don't have lock for the DE-block yet
+			 * b) I released spinlock on DX-block
+			 * if it happened I can detect it by listening
+			 * splitting event on this DE-block
+			 */
+			ext4_htree_de_lock(lck, frame->at);
+			ext4_htree_spin_stop_listen(lck);
+
+			if (myblock == EXT4_HTREE_NODE_CHANGED) {
+				/* someone split this DE-block before
+				 * I locked it, I need to retry and lock
+				 * valid DE-block
+				 */
+				ext4_htree_de_unlock(lck);
+				continue;
+			}
 			return frame;
+		}
+		dx = at;
+		indirect--;
 		frame++;
 		frame->bh = ext4_read_dirblock(dir, dx_get_block(at), INDEX);
 		if (IS_ERR(frame->bh)) {
@@ -921,7 +1228,7 @@ static void dx_release(struct dx_frame *frames)
 static int ext4_htree_next_block(struct inode *dir, __u32 hash,
 				 struct dx_frame *frame,
 				 struct dx_frame *frames,
-				 __u32 *start_hash)
+				 u32 *start_hash, struct htree_lock *lck)
 {
 	struct dx_frame *p;
 	struct buffer_head *bh;
@@ -936,12 +1243,23 @@ static int ext4_htree_next_block(struct inode *dir, __u32 hash,
 	 * this loop, num_frames indicates the number of interior
 	 * nodes need to be read.
 	 */
+	ext4_htree_de_unlock(lck);
 	while (1) {
-		if (++(p->at) < p->entries + dx_get_count(p->entries))
-			break;
+		if (num_frames > 0 || ext4_htree_dx_locked(lck)) {
+			/* num_frames > 0 :
+			 *   DX block
+			 * ext4_htree_dx_locked:
+			 *   frame->at is reliable pointer returned by dx_probe,
+			 *   otherwise dx_probe already knew no collision
+			 */
+			if (++(p->at) < p->entries + dx_get_count(p->entries))
+				break;
+		}
 		if (p == frames)
 			return 0;
 		num_frames++;
+		if (num_frames == 1)
+			ext4_htree_dx_unlock(lck);
 		p--;
 	}
 
@@ -964,6 +1282,14 @@ static int ext4_htree_next_block(struct inode *dir, __u32 hash,
 	 * block so no check is necessary
 	 */
 	while (num_frames--) {
+		if (num_frames == 0) {
+			/* it's not always necessary, we just don't want to
+			 * detect hash collision again
+			 */
+			ext4_htree_dx_need_lock(lck);
+			ext4_htree_dx_lock(lck, p->at);
+		}
+
 		bh = ext4_read_dirblock(dir, dx_get_block(p->at), INDEX);
 		if (IS_ERR(bh))
 			return PTR_ERR(bh);
@@ -972,10 +1298,10 @@ static int ext4_htree_next_block(struct inode *dir, __u32 hash,
 		p->bh = bh;
 		p->at = p->entries = ((struct dx_node *) bh->b_data)->entries;
 	}
+	ext4_htree_de_lock(lck, p->at);
 	return 1;
 }
 
-
 /*
  * This function fills a red-black tree with information from a
  * directory block.  It returns the number directory entries loaded
@@ -1119,7 +1445,8 @@ int ext4_htree_fill_tree(struct file *dir_file, __u32 start_hash,
 	}
 	hinfo.hash = start_hash;
 	hinfo.minor_hash = 0;
-	frame = dx_probe(NULL, dir, &hinfo, frames);
+	/* assume it's PR locked */
+	frame = dx_probe(NULL, dir, &hinfo, frames, NULL);
 	if (IS_ERR(frame))
 		return PTR_ERR(frame);
 
@@ -1162,7 +1489,7 @@ int ext4_htree_fill_tree(struct file *dir_file, __u32 start_hash,
 		count += ret;
 		hashval = ~0;
 		ret = ext4_htree_next_block(dir, HASH_NB_ALWAYS,
-					    frame, frames, &hashval);
+					    frame, frames, &hashval, NULL);
 		*next_hash = hashval;
 		if (ret < 0) {
 			err = ret;
@@ -1400,7 +1727,7 @@ static int is_dx_internal_node(struct inode *dir, ext4_lblk_t block,
 static struct buffer_head *__ext4_find_entry(struct inode *dir,
 					     struct ext4_filename *fname,
 					     struct ext4_dir_entry_2 **res_dir,
-					     int *inlined)
+					     int *inlined, struct htree_lock *lck)
 {
 	struct super_block *sb;
 	struct buffer_head *bh_use[NAMEI_RA_SIZE];
@@ -1442,7 +1769,7 @@ static struct buffer_head *__ext4_find_entry(struct inode *dir,
 		goto restart;
 	}
 	if (is_dx(dir)) {
-		ret = ext4_dx_find_entry(dir, fname, res_dir);
+		ret = ext4_dx_find_entry(dir, fname, res_dir, lck);
 		/*
 		 * On success, or if the error was file not found,
 		 * return.  Otherwise, fall back to doing a search the
@@ -1452,6 +1779,7 @@ static struct buffer_head *__ext4_find_entry(struct inode *dir,
 			goto cleanup_and_exit;
 		dxtrace(printk(KERN_DEBUG "ext4_find_entry: dx failed, "
 			       "falling back\n"));
+		ext4_htree_safe_relock(lck);
 		ret = NULL;
 	}
 	nblocks = dir->i_size >> EXT4_BLOCK_SIZE_BITS(sb);
@@ -1540,10 +1868,10 @@ static struct buffer_head *__ext4_find_entry(struct inode *dir,
 	return ret;
 }
 
-static struct buffer_head *ext4_find_entry(struct inode *dir,
+struct buffer_head *ext4_find_entry_locked(struct inode *dir,
 					   const struct qstr *d_name,
 					   struct ext4_dir_entry_2 **res_dir,
-					   int *inlined)
+					   int *inlined, struct htree_lock *lck)
 {
 	int err;
 	struct ext4_filename fname;
@@ -1555,11 +1883,12 @@ static struct buffer_head *ext4_find_entry(struct inode *dir,
 	if (err)
 		return ERR_PTR(err);
 
-	bh = __ext4_find_entry(dir, &fname, res_dir, inlined);
+	bh = __ext4_find_entry(dir, &fname, res_dir, inlined, lck);
 
 	ext4_fname_free_filename(&fname);
 	return bh;
 }
+EXPORT_SYMBOL(ext4_find_entry_locked);
 
 static struct buffer_head *ext4_lookup_entry(struct inode *dir,
 					     struct dentry *dentry,
@@ -1575,7 +1904,7 @@ static struct buffer_head *ext4_lookup_entry(struct inode *dir,
 	if (err)
 		return ERR_PTR(err);
 
-	bh = __ext4_find_entry(dir, &fname, res_dir, NULL);
+	bh = __ext4_find_entry(dir, &fname, res_dir, NULL, NULL);
 
 	ext4_fname_free_filename(&fname);
 	return bh;
@@ -1583,7 +1912,8 @@ static struct buffer_head *ext4_lookup_entry(struct inode *dir,
 
 static struct buffer_head * ext4_dx_find_entry(struct inode *dir,
 			struct ext4_filename *fname,
-			struct ext4_dir_entry_2 **res_dir)
+			struct ext4_dir_entry_2 **res_dir,
+			struct htree_lock *lck)
 {
 	struct super_block * sb = dir->i_sb;
 	struct dx_frame frames[EXT4_HTREE_LEVEL], *frame;
@@ -1594,7 +1924,7 @@ static struct buffer_head * ext4_dx_find_entry(struct inode *dir,
 #ifdef CONFIG_FS_ENCRYPTION
 	*res_dir = NULL;
 #endif
-	frame = dx_probe(fname, dir, NULL, frames);
+	frame = dx_probe(fname, dir, NULL, frames, lck);
 	if (IS_ERR(frame))
 		return (struct buffer_head *) frame;
 	do {
@@ -1616,7 +1946,7 @@ static struct buffer_head * ext4_dx_find_entry(struct inode *dir,
 
 		/* Check to see if we should continue to search */
 		retval = ext4_htree_next_block(dir, fname->hinfo.hash, frame,
-					       frames, NULL);
+					       frames, NULL, lck);
 		if (retval < 0) {
 			ext4_warning_inode(dir,
 				"error %d reading directory index block",
@@ -1799,8 +2129,9 @@ static struct ext4_dir_entry_2* dx_pack_dirents(char *base, unsigned blocksize)
  * Returns pointer to de in block into which the new entry will be inserted.
  */
 static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
-			struct buffer_head **bh,struct dx_frame *frame,
-			struct dx_hash_info *hinfo)
+			struct buffer_head **bh, struct dx_frame *frames,
+			struct dx_frame *frame, struct dx_hash_info *hinfo,
+			struct htree_lock *lck)
 {
 	unsigned blocksize = dir->i_sb->s_blocksize;
 	unsigned count, continued;
@@ -1862,8 +2193,15 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
 					hash2, split, count-split));
 
 	/* Fancy dance to stay within two buffers */
-	de2 = dx_move_dirents(data1, data2, map + split, count - split,
-			      blocksize);
+	if (hinfo->hash < hash2) {
+		de2 = dx_move_dirents(data1, data2, map + split,
+				      count - split, blocksize);
+	} else {
+		/* make sure we will add entry to the same block which
+		 * we have already locked
+		 */
+		de2 = dx_move_dirents(data1, data2, map, split, blocksize);
+	}
 	de = dx_pack_dirents(data1, blocksize);
 	de->rec_len = ext4_rec_len_to_disk(data1 + (blocksize - csum_size) -
 					   (char *) de,
@@ -1884,12 +2222,20 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
 	dxtrace(dx_show_leaf(dir, hinfo, (struct ext4_dir_entry_2 *) data2,
 			blocksize, 1));
 
-	/* Which block gets the new entry? */
-	if (hinfo->hash >= hash2) {
-		swap(*bh, bh2);
-		de = de2;
+	ext4_htree_spin_lock(lck, frame > frames ? (frame - 1)->at : NULL,
+			     frame->at); /* notify block is being split */
+	if (hinfo->hash < hash2) {
+		dx_insert_block(frame, hash2 + continued, newblock);
+	} else {
+		/* switch block number */
+		dx_insert_block(frame, hash2 + continued,
+				dx_get_block(frame->at));
+		dx_set_block(frame->at, newblock);
+		(frame->at)++;
 	}
-	dx_insert_block(frame, hash2 + continued, newblock);
+	ext4_htree_spin_unlock(lck);
+	ext4_htree_dx_unlock(lck);
+
 	err = ext4_handle_dirty_dirent_node(handle, dir, bh2);
 	if (err)
 		goto journal_error;
@@ -2167,7 +2513,7 @@ static int make_indexed_dir(handle_t *handle, struct ext4_filename *fname,
 	if (retval)
 		goto out_frames;	
 
-	de = do_split(handle,dir, &bh2, frame, &fname->hinfo);
+	de = do_split(handle, dir, &bh2, frames, frame, &fname->hinfo, NULL);
 	if (IS_ERR(de)) {
 		retval = PTR_ERR(de);
 		goto out_frames;
@@ -2276,8 +2622,8 @@ static int ext4_update_dotdot(handle_t *handle, struct dentry *dentry,
  * may not sleep between calling this and putting something into
  * the entry, as someone else might have used it while you slept.
  */
-static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
-			  struct inode *inode)
+int ext4_add_entry_locked(handle_t *handle, struct dentry *dentry,
+			  struct inode *inode, struct htree_lock *lck)
 {
 	struct inode *dir = d_inode(dentry->d_parent);
 	struct buffer_head *bh = NULL;
@@ -2326,9 +2672,10 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
 		if (dentry->d_name.len == 2 &&
 		    memcmp(dentry->d_name.name, "..", 2) == 0)
 			return ext4_update_dotdot(handle, dentry, inode);
-		retval = ext4_dx_add_entry(handle, &fname, dir, inode);
+		retval = ext4_dx_add_entry(handle, &fname, dir, inode, lck);
 		if (!retval || (retval != ERR_BAD_DX_DIR))
 			goto out;
+		ext4_htree_safe_relock(lck);
 		ext4_clear_inode_flag(dir, EXT4_INODE_INDEX);
 		dx_fallback++;
 		ext4_mark_inode_dirty(handle, dir);
@@ -2378,12 +2725,14 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
 		ext4_set_inode_state(inode, EXT4_STATE_NEWENTRY);
 	return retval;
 }
+EXPORT_SYMBOL(ext4_add_entry_locked);
 
 /*
  * Returns 0 for success, or a negative error value
  */
 static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
-			     struct inode *dir, struct inode *inode)
+			     struct inode *dir, struct inode *inode,
+			     struct htree_lock *lck)
 {
 	struct dx_frame frames[EXT4_HTREE_LEVEL], *frame;
 	struct dx_entry *entries, *at;
@@ -2395,7 +2744,7 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
 
 again:
 	restart = 0;
-	frame = dx_probe(fname, dir, NULL, frames);
+	frame = dx_probe(fname, dir, NULL, frames, lck);
 	if (IS_ERR(frame))
 		return PTR_ERR(frame);
 	entries = frame->entries;
@@ -2430,6 +2779,12 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
 		struct dx_node *node2;
 		struct buffer_head *bh2;
 
+		if (!ext4_htree_safe_locked(lck)) { /* retry with EX lock */
+			ext4_htree_safe_relock(lck);
+			restart = 1;
+			goto cleanup;
+		}
+
 		while (frame > frames) {
 			if (dx_get_count((frame - 1)->entries) <
 			    dx_get_limit((frame - 1)->entries)) {
@@ -2533,8 +2888,34 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
 			restart = 1;
 			goto journal_error;
 		}
+	} else if (!ext4_htree_dx_locked(lck)) {
+		struct ext4_dir_lock_data *ld = ext4_htree_lock_data(lck);
+
+		/* not well protected, require DX lock */
+		ext4_htree_dx_need_lock(lck);
+		at = frame > frames ? (frame - 1)->at : NULL;
+
+		/* NB: no risk of deadlock because it's just a try.
+		 *
+		 * NB: we check ld_count for twice, the first time before
+		 * having DX lock, the second time after holding DX lock.
+		 *
+		 * NB: We never free blocks for directory so far, which
+		 * means value returned by dx_get_count() should equal to
+		 * ld->ld_count if nobody split any DE-block under @at,
+		 * and ld->ld_at still points to valid dx_entry.
+		 */
+		if ((ld->ld_count != dx_get_count(entries)) ||
+		    !ext4_htree_dx_lock_try(lck, at) ||
+		   (ld->ld_count != dx_get_count(entries))) {
+			restart = 1;
+			goto cleanup;
+		}
+		/* OK, I've got DX lock and nothing changed */
+		frame->at = ld->ld_at;
 	}
-	de = do_split(handle, dir, &bh, frame, &fname->hinfo);
+
+	de = do_split(handle, dir, &bh, frames, frame, &fname->hinfo, lck);
 	if (IS_ERR(de)) {
 		err = PTR_ERR(de);
 		goto cleanup;
@@ -2545,6 +2926,8 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
 journal_error:
 	ext4_std_error(dir->i_sb, err); /* this is a no-op if err == 0 */
 cleanup:
+	ext4_htree_dx_unlock(lck);
+	ext4_htree_de_unlock(lck);
 	brelse(bh);
 	dx_release(frames);
 	/* @restart is true means htree-path has been changed, we need to
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 564eb35..07242d7 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1077,6 +1077,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 		return NULL;
 
 	inode_set_iversion(&ei->vfs_inode, 1);
+	sema_init(&ei->i_append_sem, 1);
 	spin_lock_init(&ei->i_raw_lock);
 	INIT_LIST_HEAD(&ei->i_prealloc_list);
 	spin_lock_init(&ei->i_prealloc_lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 13/22] ext4: Add a proc interface for max_dir_size.
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
                   ` (11 preceding siblings ...)
  2019-07-22  1:23 ` [lustre-devel] [PATCH 12/22] ext4: add htree lock implementation James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  5:14   ` NeilBrown
  2019-07-22  1:23 ` [lustre-devel] [PATCH 14/22] ext4: remove inode_lock handling James Simmons
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/sysfs.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index 1375815..3a71a16 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -180,6 +180,8 @@ static ssize_t journal_task_show(struct ext4_sb_info *sbi, char *buf)
 EXT4_ATTR_OFFSET(inode_readahead_blks, 0644, inode_readahead,
 		 ext4_sb_info, s_inode_readahead_blks);
 EXT4_RW_ATTR_SBI_UI(inode_goal, s_inode_goal);
+EXT4_RW_ATTR_SBI_UI(max_dir_size, s_max_dir_size_kb);
+EXT4_RW_ATTR_SBI_UI(max_dir_size_kb, s_max_dir_size_kb);
 EXT4_RW_ATTR_SBI_UI(mb_stats, s_mb_stats);
 EXT4_RW_ATTR_SBI_UI(mb_max_to_scan, s_mb_max_to_scan);
 EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan);
@@ -210,6 +212,8 @@ static ssize_t journal_task_show(struct ext4_sb_info *sbi, char *buf)
 	ATTR_LIST(reserved_clusters),
 	ATTR_LIST(inode_readahead_blks),
 	ATTR_LIST(inode_goal),
+	ATTR_LIST(max_dir_size),
+	ATTR_LIST(max_dir_size_kb),
 	ATTR_LIST(mb_stats),
 	ATTR_LIST(mb_max_to_scan),
 	ATTR_LIST(mb_min_to_scan),
@@ -350,6 +354,8 @@ static ssize_t ext4_attr_store(struct kobject *kobj,
 		ret = kstrtoul(skip_spaces(buf), 0, &t);
 		if (ret)
 			return ret;
+		if (strcmp("max_dir_size", a->attr.name) == 0)
+			t >>= 10;
 		if (a->attr_ptr == ptr_ext4_super_block_offset)
 			*((__le32 *) ptr) = cpu_to_le32(t);
 		else
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 14/22] ext4: remove inode_lock handling
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
                   ` (12 preceding siblings ...)
  2019-07-22  1:23 ` [lustre-devel] [PATCH 13/22] ext4: Add a proc interface for max_dir_size James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  5:16   ` NeilBrown
  2019-07-22  1:23 ` [lustre-devel] [PATCH 15/22] ext4: remove bitmap corruption warnings James Simmons
                   ` (7 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

There will cause a deadlock if invoke ext4_truncate with i_mutex locked
in lustre. Since lustre has own lock to provide protect so we don't
need this check at all.

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/inode.c | 2 --
 fs/ext4/namei.c | 4 ----
 2 files changed, 6 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5561351..9296611 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4474,8 +4474,6 @@ int ext4_truncate(struct inode *inode)
 	 * or it's a completely new inode. In those cases we might not
 	 * have i_mutex locked because it's not necessary.
 	 */
-	if (!(inode->i_state & (I_NEW|I_FREEING)))
-		WARN_ON(!inode_is_locked(inode));
 	trace_ext4_truncate_enter(inode);
 
 	if (!ext4_can_truncate(inode))
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 0153c4d..1b6d22a 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -3485,8 +3485,6 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
 	if (!sbi->s_journal || is_bad_inode(inode))
 		return 0;
 
-	WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
-		     !inode_is_locked(inode));
 	/*
 	 * Exit early if inode already is on orphan list. This is a big speedup
 	 * since we don't have to contend on the global s_orphan_lock.
@@ -3569,8 +3567,6 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
 	if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS))
 		return 0;
 
-	WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
-		     !inode_is_locked(inode));
 	/* Do this quick check before taking global s_orphan_lock. */
 	if (list_empty(&ei->i_orphan))
 		return 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 15/22] ext4: remove bitmap corruption warnings
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
                   ` (13 preceding siblings ...)
  2019-07-22  1:23 ` [lustre-devel] [PATCH 14/22] ext4: remove inode_lock handling James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  5:18   ` NeilBrown
  2019-07-22  1:23 ` [lustre-devel] [PATCH 16/22] ext4: add warning for directory htree growth James Simmons
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

Since we could skip corrupt block groups, this patch
use ext4_warning() intead of ext4_error() to make FS not
mount RO in default.

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/balloc.c  | 10 ++++-----
 fs/ext4/ialloc.c  |  6 ++---
 fs/ext4/mballoc.c | 66 +++++++++++++++++++++++--------------------------------
 3 files changed, 35 insertions(+), 47 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index bca75c1..aff3981 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -374,7 +374,7 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
 	if (unlikely(!ext4_block_bitmap_csum_verify(sb, block_group,
 			desc, bh))) {
 		ext4_unlock_group(sb, block_group);
-		ext4_error(sb, "bg %u: bad block bitmap checksum", block_group);
+		ext4_warning(sb, "bg %u: bad block bitmap checksum", block_group);
 		ext4_mark_group_bitmap_corrupted(sb, block_group,
 					EXT4_GROUP_INFO_BBITMAP_CORRUPT);
 		return -EFSBADCRC;
@@ -382,8 +382,8 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
 	blk = ext4_valid_block_bitmap(sb, desc, block_group, bh);
 	if (unlikely(blk != 0)) {
 		ext4_unlock_group(sb, block_group);
-		ext4_error(sb, "bg %u: block %llu: invalid block bitmap",
-			   block_group, blk);
+		ext4_warning(sb, "bg %u: block %llu: invalid block bitmap",
+			     block_group, blk);
 		ext4_mark_group_bitmap_corrupted(sb, block_group,
 					EXT4_GROUP_INFO_BBITMAP_CORRUPT);
 		return -EFSCORRUPTED;
@@ -459,8 +459,8 @@ struct buffer_head *
 		ext4_unlock_group(sb, block_group);
 		unlock_buffer(bh);
 		if (err) {
-			ext4_error(sb, "Failed to init block bitmap for group "
-				   "%u: %d", block_group, err);
+			ext4_warning(sb, "Failed to init block bitmap for group "
+				     "%u: %d", block_group, err);
 			goto out;
 		}
 		goto verify;
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 68d41e6..a72fe63 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -96,8 +96,8 @@ static int ext4_validate_inode_bitmap(struct super_block *sb,
 	if (!ext4_inode_bitmap_csum_verify(sb, block_group, desc, bh,
 					   EXT4_INODES_PER_GROUP(sb) / 8)) {
 		ext4_unlock_group(sb, block_group);
-		ext4_error(sb, "Corrupt inode bitmap - block_group = %u, "
-			   "inode_bitmap = %llu", block_group, blk);
+		ext4_warning(sb, "Corrupt inode bitmap - block_group = %u, "
+			     "inode_bitmap = %llu", block_group, blk);
 		ext4_mark_group_bitmap_corrupted(sb, block_group,
 					EXT4_GROUP_INFO_IBITMAP_CORRUPT);
 		return -EFSBADCRC;
@@ -346,7 +346,7 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
 		if (!fatal)
 			fatal = err;
 	} else {
-		ext4_error(sb, "bit already cleared for inode %lu", ino);
+		ext4_warning(sb, "bit already cleared for inode %lu", ino);
 		ext4_mark_group_bitmap_corrupted(sb, block_group,
 					EXT4_GROUP_INFO_IBITMAP_CORRUPT);
 	}
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 463fba6..82398b0 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -741,10 +741,15 @@ int ext4_mb_generate_buddy(struct super_block *sb,
 	grp->bb_fragments = fragments;
 
 	if (free != grp->bb_free) {
-		ext4_grp_locked_error(sb, group, 0, 0,
-				      "block bitmap and bg descriptor "
-				      "inconsistent: %u vs %u free clusters",
-				      free, grp->bb_free);
+		struct ext4_group_desc *gdp;
+
+		gdp = ext4_get_group_desc(sb, group, NULL);
+		ext4_warning(sb, "group %lu: block bitmap and bg descriptor "
+			     "inconsistent: %u vs %u free clusters "
+			     "%u in gd, %lu pa's",
+			     (long unsigned int)group, free, grp->bb_free,
+			     ext4_free_group_clusters(sb, gdp),
+			     grp->bb_prealloc_nr);
 		/*
 		 * If we intend to continue, we consider group descriptor
 		 * corrupt and update bb_free using bitmap value
@@ -1106,7 +1111,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
 	int block;
 	int pnum;
 	int poff;
-	struct page *page;
+	struct page *page = NULL;
 	int ret;
 	struct ext4_group_info *grp;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -1132,7 +1137,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
 		 */
 		ret = ext4_mb_init_group(sb, group, gfp);
 		if (ret)
-			return ret;
+			goto err;
 	}
 
 	/*
@@ -1235,6 +1240,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
 		put_page(e4b->bd_buddy_page);
 	e4b->bd_buddy = NULL;
 	e4b->bd_bitmap = NULL;
+	ext4_warning(sb, "Error loading buddy information for %u", group);
 	return ret;
 }
 
@@ -3631,8 +3637,10 @@ int ext4_mb_check_ondisk_bitmap(struct super_block *sb, void *bitmap,
 	}
 
 	if (free != free_in_gdp) {
-		ext4_error(sb, "on-disk bitmap for group %d corrupted: %u blocks free in bitmap, %u - in gd\n",
-			   group, free, free_in_gdp);
+		ext4_warning(sb, "on-disk bitmap for group %d corrupted: %u blocks free in bitmap, %u - in gd\n",
+			     group, free, free_in_gdp);
+		ext4_mark_group_bitmap_corrupted(sb, group,
+						 EXT4_GROUP_INFO_BBITMAP_CORRUPT);
 		return -EIO;
 	}
 	return 0;
@@ -4015,14 +4023,6 @@ static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
 	 * otherwise maybe leave some free blocks unavailable, no need to BUG.
 	 */
 	if ((free > pa->pa_free && !pa->pa_error) || (free < pa->pa_free)) {
-		ext4_error(sb, "pa free mismatch: [pa %p] "
-				"[phy %lu] [logic %lu] [len %u] [free %u] "
-				"[error %u] [inode %lu] [freed %u]", pa,
-				(unsigned long)pa->pa_pstart,
-				(unsigned long)pa->pa_lstart,
-				(unsigned)pa->pa_len, (unsigned)pa->pa_free,
-				(unsigned)pa->pa_error, pa->pa_inode->i_ino,
-				free);
 		ext4_grp_locked_error(sb, group, 0, 0, "free %u, pa_free %u",
 					free, pa->pa_free);
 		/*
@@ -4086,15 +4086,11 @@ static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
 	bitmap_bh = ext4_read_block_bitmap(sb, group);
 	if (IS_ERR(bitmap_bh)) {
 		err = PTR_ERR(bitmap_bh);
-		ext4_error(sb, "Error %d reading block bitmap for %u",
-			   err, group);
 		return 0;
 	}
 
 	err = ext4_mb_load_buddy(sb, group, &e4b);
 	if (err) {
-		ext4_warning(sb, "Error %d loading buddy information for %u",
-			     err, group);
 		put_bh(bitmap_bh);
 		return 0;
 	}
@@ -4255,17 +4251,12 @@ void ext4_discard_preallocations(struct inode *inode)
 
 		err = ext4_mb_load_buddy_gfp(sb, group, &e4b,
 					     GFP_NOFS|__GFP_NOFAIL);
-		if (err) {
-			ext4_error(sb, "Error %d loading buddy information for %u",
-				   err, group);
+		if (err)
 			return;
-		}
 
 		bitmap_bh = ext4_read_block_bitmap(sb, group);
 		if (IS_ERR(bitmap_bh)) {
 			err = PTR_ERR(bitmap_bh);
-			ext4_error(sb, "Error %d reading block bitmap for %u",
-					err, group);
 			ext4_mb_unload_buddy(&e4b);
 			continue;
 		}
@@ -4527,11 +4518,8 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
 		group = ext4_get_group_number(sb, pa->pa_pstart);
 		err = ext4_mb_load_buddy_gfp(sb, group, &e4b,
 					     GFP_NOFS|__GFP_NOFAIL);
-		if (err) {
-			ext4_error(sb, "Error %d loading buddy information for %u",
-				   err, group);
+		if (err)
 			continue;
-		}
 		ext4_lock_group(sb, group);
 		list_del(&pa->pa_group_list);
 		ext4_get_group_info(sb, group)->bb_prealloc_nr--;
@@ -4785,7 +4773,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 			 * not revert pa_free back, just mark pa_error
 			 */
 			pa->pa_error++;
-			ext4_error(sb,
+			ext4_warning(sb,
 				   "Updating bitmap error: [err %d] "
 				   "[pa %p] [phy %lu] [logic %lu] "
 				   "[len %u] [free %u] [error %u] "
@@ -4796,6 +4784,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 				   (unsigned)pa->pa_free,
 				   (unsigned)pa->pa_error,
 				   pa->pa_inode ? pa->pa_inode->i_ino : 0);
+			ext4_mark_group_bitmap_corrupted(sb, 0, 0);
 		}
 	}
 	ext4_mb_release_context(ac);
@@ -5081,7 +5070,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 	err = ext4_mb_load_buddy_gfp(sb, block_group, &e4b,
 				     GFP_NOFS|__GFP_NOFAIL);
 	if (err)
-		goto error_return;
+		goto error_brelse;
 
 	/*
 	 * We need to make sure we don't reuse the freed block until after the
@@ -5171,8 +5160,9 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 		goto do_more;
 	}
 error_return:
-	brelse(bitmap_bh);
 	ext4_std_error(sb, err);
+error_brelse:
+	brelse(bitmap_bh);
 	return;
 }
 
@@ -5272,7 +5262,7 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
 
 	err = ext4_mb_load_buddy(sb, block_group, &e4b);
 	if (err)
-		goto error_return;
+		goto error_brelse;
 
 	/*
 	 * need to update group_info->bb_free and bitmap
@@ -5310,8 +5300,9 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
 		err = ret;
 
 error_return:
-	brelse(bitmap_bh);
 	ext4_std_error(sb, err);
+error_brelse:
+	brelse(bitmap_bh);
 	return err;
 }
 
@@ -5386,11 +5377,8 @@ static int ext4_trim_extent(struct super_block *sb, int start, int count,
 	trace_ext4_trim_all_free(sb, group, start, max);
 
 	ret = ext4_mb_load_buddy(sb, group, &e4b);
-	if (ret) {
-		ext4_warning(sb, "Error %d loading buddy information for %u",
-			     ret, group);
+	if (ret)
 		return ret;
-	}
 	bitmap = e4b.bd_bitmap;
 
 	ext4_lock_group(sb, group);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 16/22] ext4: add warning for directory htree growth
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
                   ` (14 preceding siblings ...)
  2019-07-22  1:23 ` [lustre-devel] [PATCH 15/22] ext4: remove bitmap corruption warnings James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  5:24   ` NeilBrown
  2019-07-22  1:23 ` [lustre-devel] [PATCH 17/22] ext4: optimize ext4_journal_callback_add James Simmons
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/ext4.h  |  1 +
 fs/ext4/namei.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 fs/ext4/super.c |  2 ++
 fs/ext4/sysfs.c |  2 ++
 4 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bf74c7c..5f73e19 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1472,6 +1472,7 @@ struct ext4_sb_info {
 	unsigned int s_mb_group_prealloc;
 	unsigned long *s_mb_prealloc_table;
 	unsigned int s_max_dir_size_kb;
+	unsigned long s_warning_dir_size;
 	/* where last allocation was done - for stream allocation */
 	unsigned long s_mb_last_group;
 	unsigned long s_mb_last_start;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 1b6d22a..9b30cc6 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -757,12 +757,20 @@ struct ext4_dir_lock_data {
 #define ext4_htree_lock_data(l)	((struct ext4_dir_lock_data *)(l)->lk_private)
 #define ext4_find_entry(dir, name, dirent, inline) \
 			ext4_find_entry_locked(dir, name, dirent, inline, NULL)
-#define ext4_add_entry(handle, dentry, inode) \
-		       ext4_add_entry_locked(handle, dentry, inode, NULL)
 
 /* NB: ext4_lblk_t is 32 bits so we use high bits to identify invalid blk */
 #define EXT4_HTREE_NODE_CHANGED		(0xcafeULL << 32)
 
+inline int ext4_add_entry(handle_t *handle, struct dentry *dentry,
+			  struct inode *inode)
+{
+	int ret = ext4_add_entry_locked(handle, dentry, inode, NULL);
+
+	if (ret == -ENOBUFS)
+		ret = 0;
+	return ret;
+}
+
 static void ext4_htree_event_cb(void *target, void *event)
 {
 	u64 *block = (u64 *)target;
@@ -2612,6 +2620,55 @@ static int ext4_update_dotdot(handle_t *handle, struct dentry *dentry,
 	return err;
 }
 
+static unsigned long __ext4_max_dir_size(struct dx_frame *frames,
+					 struct dx_frame *frame,
+					 struct inode *dir)
+{
+	unsigned long max_dir_size;
+
+	if (EXT4_SB(dir->i_sb)->s_max_dir_size_kb) {
+		max_dir_size = EXT4_SB(dir->i_sb)->s_max_dir_size_kb << 10;
+	} else {
+		max_dir_size = EXT4_BLOCK_SIZE(dir->i_sb);
+		while (frame >= frames) {
+			max_dir_size *= dx_get_limit(frame->entries);
+			if (frame == frames)
+				break;
+			frame--;
+		}
+		/* use 75% of max dir size in average */
+		max_dir_size = max_dir_size / 4 * 3;
+	}
+	return max_dir_size;
+}
+
+/*
+ * With hash tree growing, it is easy to hit ENOSPC, but it is hard
+ * to predict when it will happen. let's give administrators warning
+ * when reaching 3/5 and 2/3 of limit
+ */
+static inline bool dir_size_in_warning_range(struct dx_frame *frames,
+					     struct dx_frame *frame,
+					     struct inode *dir)
+{
+	struct super_block *sb = dir->i_sb;
+	unsigned long size1, size2;
+
+	if (unlikely(!EXT4_SB(sb)->s_warning_dir_size))
+		EXT4_SB(sb)->s_warning_dir_size =
+			__ext4_max_dir_size(frames, frame, dir);
+
+	size1 = EXT4_SB(sb)->s_warning_dir_size / 16 * 10;
+	size1 = size1 & ~(EXT4_BLOCK_SIZE(sb) - 1);
+	size2 = EXT4_SB(sb)->s_warning_dir_size / 16 * 11;
+	size2 = size2 & ~(EXT4_BLOCK_SIZE(sb) - 1);
+	if (in_range(dir->i_size, size1, EXT4_BLOCK_SIZE(sb)) ||
+	    in_range(dir->i_size, size2, EXT4_BLOCK_SIZE(sb)))
+		return true;
+
+	return false;
+}
+
 /*
  *	ext4_add_entry()
  *
@@ -2739,6 +2796,7 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
 	struct buffer_head *bh;
 	struct super_block *sb = dir->i_sb;
 	struct ext4_dir_entry_2 *de;
+	bool ret_warn = false;
 	int restart;
 	int err;
 
@@ -2769,6 +2827,11 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
 	/* Block full, should compress but for now just split */
 	dxtrace(printk(KERN_DEBUG "using %u of %u node entries\n",
 		       dx_get_count(entries), dx_get_limit(entries)));
+
+	if (frame - frames + 1 >= ext4_dir_htree_level(sb) ||
+	    EXT4_SB(sb)->s_warning_dir_size)
+		ret_warn = dir_size_in_warning_range(frames, frame, dir);
+
 	/* Need to split index? */
 	if (dx_get_count(entries) == dx_get_limit(entries)) {
 		ext4_lblk_t newblock;
@@ -2935,6 +2998,8 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
 	 */
 	if (restart && err == 0)
 		goto again;
+	if (err == 0 && ret_warn)
+		err = -ENOBUFS;
 	return err;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 07242d7..a3179b2 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1901,6 +1901,8 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 		sbi->s_li_wait_mult = arg;
 	} else if (token == Opt_max_dir_size_kb) {
 		sbi->s_max_dir_size_kb = arg;
+		/* reset s_warning_dir_size and make it re-calculated */
+		sbi->s_warning_dir_size = 0;
 	} else if (token == Opt_stripe) {
 		sbi->s_stripe = arg;
 	} else if (token == Opt_resuid) {
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index 3a71a16..575f318 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -182,6 +182,7 @@ static ssize_t journal_task_show(struct ext4_sb_info *sbi, char *buf)
 EXT4_RW_ATTR_SBI_UI(inode_goal, s_inode_goal);
 EXT4_RW_ATTR_SBI_UI(max_dir_size, s_max_dir_size_kb);
 EXT4_RW_ATTR_SBI_UI(max_dir_size_kb, s_max_dir_size_kb);
+EXT4_RW_ATTR_SBI_UI(warning_dir_size, s_warning_dir_size);
 EXT4_RW_ATTR_SBI_UI(mb_stats, s_mb_stats);
 EXT4_RW_ATTR_SBI_UI(mb_max_to_scan, s_mb_max_to_scan);
 EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan);
@@ -214,6 +215,7 @@ static ssize_t journal_task_show(struct ext4_sb_info *sbi, char *buf)
 	ATTR_LIST(inode_goal),
 	ATTR_LIST(max_dir_size),
 	ATTR_LIST(max_dir_size_kb),
+	ATTR_LIST(warning_dir_size),
 	ATTR_LIST(mb_stats),
 	ATTR_LIST(mb_max_to_scan),
 	ATTR_LIST(mb_min_to_scan),
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 17/22] ext4: optimize ext4_journal_callback_add
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
                   ` (15 preceding siblings ...)
  2019-07-22  1:23 ` [lustre-devel] [PATCH 16/22] ext4: add warning for directory htree growth James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  5:27   ` NeilBrown
  2019-07-22  1:23 ` [lustre-devel] [PATCH 18/22] ext4: attach jinode in writepages James Simmons
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

Change list_add_tail to list_add. It gives advantages to ldiskfs
in tgt_cb_last_committed. In the beginning of list will be placed
thandles with the highest transaction numbers. So at the first
iterations we will have the highest transno. It will save from
extra call of ptlrpc_commit_replies.

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/ext4_jbd2.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 75a5309..5ebf8ee 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -169,7 +169,7 @@ static inline void _ext4_journal_callback_add(handle_t *handle,
 			struct ext4_journal_cb_entry *jce)
 {
 	/* Add the jce to transaction's private list */
-	list_add_tail(&jce->jce_list, &handle->h_transaction->t_private_list);
+	list_add(&jce->jce_list, &handle->h_transaction->t_private_list);
 }
 
 static inline void ext4_journal_callback_add(handle_t *handle,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 18/22] ext4: attach jinode in writepages
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
                   ` (16 preceding siblings ...)
  2019-07-22  1:23 ` [lustre-devel] [PATCH 17/22] ext4: optimize ext4_journal_callback_add James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  1:23 ` [lustre-devel] [PATCH 19/22] ext4: don't check before replay James Simmons
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/ext4.h  | 1 +
 fs/ext4/inode.c | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 5f73e19..0ee4606 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2643,6 +2643,7 @@ extern int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
 extern void ext4_process_freed_data(struct super_block *sb, tid_t commit_tid);
 
 /* inode.c */
+#define HAVE_LDISKFS_INFO_JINODE
 int ext4_inode_is_fast_symlink(struct inode *inode);
 struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
 struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9296611..bf00dfc 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -731,6 +731,9 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		    !(flags & EXT4_GET_BLOCKS_ZERO) &&
 		    !ext4_is_quota_file(inode) &&
 		    ext4_should_order_data(inode)) {
+			ret = ext4_inode_attach_jinode(inode);
+			if (ret)
+				return ret;
 			if (flags & EXT4_GET_BLOCKS_IO_SUBMIT)
 				ret = ext4_jbd2_inode_add_wait(handle, inode);
 			else
@@ -2815,6 +2818,9 @@ static int ext4_writepages(struct address_space *mapping,
 		mpd.last_page = wbc->range_end >> PAGE_SHIFT;
 	}
 
+	ret = ext4_inode_attach_jinode(inode);
+	if (ret)
+		return ret;
 	mpd.inode = inode;
 	mpd.wbc = wbc;
 	ext4_io_submit_init(&mpd.io_submit, wbc);
@@ -4432,6 +4438,7 @@ int ext4_inode_attach_jinode(struct inode *inode)
 		jbd2_free_inode(jinode);
 	return 0;
 }
+EXPORT_SYMBOL(ext4_inode_attach_jinode);
 
 /*
  * ext4_truncate()
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 19/22] ext4: don't check before replay
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
                   ` (17 preceding siblings ...)
  2019-07-22  1:23 ` [lustre-devel] [PATCH 18/22] ext4: attach jinode in writepages James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  5:29   ` NeilBrown
  2019-07-22  1:23 ` [lustre-devel] [PATCH 20/22] ext4: use GFP_NOFS in ext4_inode_attach_jinode James Simmons
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

When ldiskfs run in failover mode whith read-only disk.
Part of allocation updates are lost and ldiskfs may fail
while mounting this is due to inconsistent state of
group-descriptor. Group-descriptor check is added after
journal replay.

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/super.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a3179b2..b818acb 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4255,11 +4255,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		}
 	}
 	sbi->s_gdb_count = db_count;
-	if (!ext4_check_descriptors(sb, logical_sb_block, &first_not_zeroed)) {
-		ext4_msg(sb, KERN_ERR, "group descriptors corrupted!");
-		ret = -EFSCORRUPTED;
-		goto failed_mount2;
-	}
 
 	timer_setup(&sbi->s_err_report, print_daily_error_info, 0);
 
@@ -4401,6 +4396,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
 
 no_journal:
+	if (!ext4_check_descriptors(sb, logical_sb_block, &first_not_zeroed)) {
+		ext4_msg(sb, KERN_ERR, "group descriptors corrupted!");
+		ret = -EFSCORRUPTED;
+		goto failed_mount_wq;
+	}
+
 	if (!test_opt(sb, NO_MBCACHE)) {
 		sbi->s_ea_block_cache = ext4_xattr_create_cache();
 		if (!sbi->s_ea_block_cache) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 20/22] ext4: use GFP_NOFS in ext4_inode_attach_jinode
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
                   ` (18 preceding siblings ...)
  2019-07-22  1:23 ` [lustre-devel] [PATCH 19/22] ext4: don't check before replay James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  5:30   ` NeilBrown
  2019-07-22  1:23 ` [lustre-devel] [PATCH 21/22] ext4: export ext4_orphan_add James Simmons
  2019-07-22  1:23 ` [lustre-devel] [PATCH 22/22] ext4: export mb stream allocator variables James Simmons
  21 siblings, 1 reply; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bf00dfc..ca72097 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4422,7 +4422,7 @@ int ext4_inode_attach_jinode(struct inode *inode)
 	if (ei->jinode || !EXT4_SB(inode->i_sb)->s_journal)
 		return 0;
 
-	jinode = jbd2_alloc_inode(GFP_KERNEL);
+	jinode = jbd2_alloc_inode(GFP_NOFS);
 	spin_lock(&inode->i_lock);
 	if (!ei->jinode) {
 		if (!jinode) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 21/22] ext4: export ext4_orphan_add
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
                   ` (19 preceding siblings ...)
  2019-07-22  1:23 ` [lustre-devel] [PATCH 20/22] ext4: use GFP_NOFS in ext4_inode_attach_jinode James Simmons
@ 2019-07-22  1:23 ` James Simmons
  2019-07-22  1:23 ` [lustre-devel] [PATCH 22/22] ext4: export mb stream allocator variables James Simmons
  21 siblings, 0 replies; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/namei.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 9b30cc6..6cb3f63 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -3615,6 +3615,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
 	ext4_std_error(sb, err);
 	return err;
 }
+EXPORT_SYMBOL(ext4_orphan_add);
 
 /*
  * ext4_orphan_del() removes an unlinked or truncated inode from the list
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 22/22] ext4: export mb stream allocator variables
  2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
                   ` (20 preceding siblings ...)
  2019-07-22  1:23 ` [lustre-devel] [PATCH 21/22] ext4: export ext4_orphan_add James Simmons
@ 2019-07-22  1:23 ` James Simmons
  21 siblings, 0 replies; 53+ messages in thread
From: James Simmons @ 2019-07-22  1:23 UTC (permalink / raw)
  To: lustre-devel

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/ext4/ext4.h    |  2 ++
 fs/ext4/mballoc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/sysfs.c   |  4 ++++
 3 files changed, 65 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0ee4606..46f2619 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2620,6 +2620,8 @@ extern int ext4_init_inode_table(struct super_block *sb,
 /* mballoc.c */
 extern const struct file_operations ext4_seq_prealloc_table_fops;
 extern const struct seq_operations ext4_mb_seq_groups_ops;
+extern const struct file_operations ext4_seq_mb_last_group_fops;
+extern int ext4_mb_seq_last_start_seq_show(struct seq_file *m, void *v);
 extern long ext4_mb_stats;
 extern long ext4_mb_max_to_scan;
 extern int ext4_mb_init(struct super_block *);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 82398b0..8270e35 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2459,6 +2459,65 @@ static struct kmem_cache *get_groupinfo_cache(int blocksize_bits)
 	return cachep;
 }
 
+#define EXT4_MB_MAX_INPUT_STRING_SIZE 32
+
+static ssize_t ext4_mb_last_group_write(struct file *file,
+					const char __user *buf,
+					size_t cnt, loff_t *pos)
+{
+	char dummy[EXT4_MB_MAX_INPUT_STRING_SIZE + 1];
+	struct super_block *sb = PDE_DATA(file_inode(file));
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	unsigned long val;
+	char *end;
+
+	if (cnt > EXT4_MB_MAX_INPUT_STRING_SIZE)
+		return -EINVAL;
+	if (copy_from_user(dummy, buf, cnt))
+		return -EFAULT;
+	dummy[cnt] = '\0';
+	val = simple_strtoul(dummy, &end, 0);
+	if (dummy == end)
+		return -EINVAL;
+	if (val >= ext4_get_groups_count(sb))
+		return -ERANGE;
+	spin_lock(&sbi->s_md_lock);
+	sbi->s_mb_last_group = val;
+	sbi->s_mb_last_start = 0;
+	spin_unlock(&sbi->s_md_lock);
+	return cnt;
+}
+
+static int ext4_mb_seq_last_group_seq_show(struct seq_file *m, void *v)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(m->private);
+
+	seq_printf(m , "%ld\n", sbi->s_mb_last_group);
+	return 0;
+}
+
+static int ext4_mb_seq_last_group_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, ext4_mb_seq_last_group_seq_show, PDE_DATA(inode));
+}
+
+const struct file_operations ext4_seq_mb_last_group_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ext4_mb_seq_last_group_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+	.write		= ext4_mb_last_group_write,
+};
+
+int ext4_mb_seq_last_start_seq_show(struct seq_file *m, void *v)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(m->private);
+
+	seq_printf(m , "%ld\n", sbi->s_mb_last_start);
+	return 0;
+}
+
 /*
  * Allocate the top-level s_group_info array for the specified number
  * of groups
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index 575f318..6bcb455 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -425,6 +425,10 @@ int ext4_register_sysfs(struct super_block *sb)
 				&ext4_mb_seq_groups_ops, sb);
 		proc_create_data("prealloc_table", S_IRUGO, sbi->s_proc,
 				 &ext4_seq_prealloc_table_fops, sb);
+		proc_create_data("mb_last_group", S_IRUGO, sbi->s_proc,
+				 &ext4_seq_mb_last_group_fops, sb);
+		proc_create_single_data("mb_last_start", S_IRUGO, sbi->s_proc,
+					ext4_mb_seq_last_start_seq_show, sb);
 	}
 	return 0;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 01/22] ext4: add i_fs_version
  2019-07-22  1:23 ` [lustre-devel] [PATCH 01/22] ext4: add i_fs_version James Simmons
@ 2019-07-22  4:13   ` NeilBrown
  2019-07-23  0:07     ` James Simmons
  2019-07-31 22:03     ` Andreas Dilger
  0 siblings, 2 replies; 53+ messages in thread
From: NeilBrown @ 2019-07-22  4:13 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jul 21 2019, James Simmons wrote:

> From: James Simmons <uja.ornl@yahoo.com>
>
> Add inode version field that osd-ldiskfs uses.

We need more words here.  What is i_fs_version used for?

(I'll probably find out as I keep reading, but I need to know
 the intention first, or I cannot properly review the patches
 that make use of it.
)
NeilBrown


>
> Signed-off-by: James Simmons <uja.ornl@yahoo.com>
> ---
>  fs/ext4/ext4.h   | 2 ++
>  fs/ext4/ialloc.c | 1 +
>  fs/ext4/inode.c  | 4 ++--
>  3 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 1cb6785..8abbcab 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1063,6 +1063,8 @@ struct ext4_inode_info {
>  	struct dquot *i_dquot[MAXQUOTAS];
>  #endif
>  
> +	u64 i_fs_version;
> +
>  	/* Precomputed uuid+inum+igen checksum for seeding inode checksums */
>  	__u32 i_csum_seed;
>  
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 764ff4c..09ae4a4 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -1100,6 +1100,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>  	ei->i_dtime = 0;
>  	ei->i_block_group = group;
>  	ei->i_last_alloc_group = ~0;
> +	ei->i_fs_version = 0;
>  
>  	ext4_set_inode_flags(inode);
>  	if (IS_DIRSYNC(inode))
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c7f77c6..6e66175 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4806,14 +4806,14 @@ static inline void ext4_inode_set_iversion_queried(struct inode *inode, u64 val)
>  	if (unlikely(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
>  		inode_set_iversion_raw(inode, val);
>  	else
> -		inode_set_iversion_queried(inode, val);
> +		EXT4_I(inode)->i_fs_version = val;
>  }
>  static inline u64 ext4_inode_peek_iversion(const struct inode *inode)
>  {
>  	if (unlikely(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
>  		return inode_peek_iversion_raw(inode);
>  	else
> -		return inode_peek_iversion(inode);
> +		return EXT4_I(inode)->i_fs_version;
>  }
>  
>  struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20190722/4860f6d8/attachment.sig>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 02/22] ext4: use d_find_alias() in ext4_lookup
  2019-07-22  1:23 ` [lustre-devel] [PATCH 02/22] ext4: use d_find_alias() in ext4_lookup James Simmons
@ 2019-07-22  4:16   ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2019-07-22  4:16 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jul 21 2019, James Simmons wrote:

> From: James Simmons <uja.ornl@yahoo.com>
>
> Really old bug that might not even exist anymore?

If there is a bug, it is somewhere else.
ext4_lookup() should never be called on "." or "..".
Whatever calls ext4_lookup must filter those out if there are a
possibility.

NeilBrown


>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  fs/ext4/namei.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
>
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index cd01c4a..a616f58 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -1664,6 +1664,35 @@ static struct dentry *ext4_lookup(struct inode *dir, struct dentry *dentry, unsi
>  		}
>  	}
>  
> +	/* ".." shouldn't go into dcache to preserve dcache hierarchy
> +	 * otherwise we'll get parent being a child of actual child.
> +	 * see bug 10458 for details -bzzz
> +	 */
> +	if (inode && (dentry->d_name.name[0] == '.' &&
> +		      (dentry->d_name.len == 1 || (dentry->d_name.len == 2 &&
> +						   dentry->d_name.name[1] == '.')))) {
> +		struct dentry *goal = NULL;
> +
> +		/* first, look for an existing dentry - any one is good */
> +		goal = d_find_any_alias(inode);
> +		if (!goal) {
> +			spin_lock(&dentry->d_lock);
> +			/* there is no alias, we need to make current dentry:
> +			 *  a) inaccessible for __d_lookup()
> +			 *  b) inaccessible for iopen
> +			 */
> +			J_ASSERT(hlist_unhashed(&dentry->d_u.d_alias));
> +			dentry->d_flags |= DCACHE_NFSFS_RENAMED;
> +			/* this is d_instantiate() ... */
> +			hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
> +			dentry->d_inode = inode;
> +			spin_unlock(&dentry->d_lock);
> +		}
> +		if (goal)
> +			iput(inode);
> +		return goal;
> +	}
> +
>  #ifdef CONFIG_UNICODE
>  	if (!inode && IS_CASEFOLDED(dir)) {
>  		/* Eventually we want to call d_add_ci(dentry, NULL)
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20190722/502da634/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 03/22] ext4: prealloc table optimization
  2019-07-22  1:23 ` [lustre-devel] [PATCH 03/22] ext4: prealloc table optimization James Simmons
@ 2019-07-22  4:29   ` NeilBrown
  2019-08-05  7:07   ` Artem Blagodarenko
  1 sibling, 0 replies; 53+ messages in thread
From: NeilBrown @ 2019-07-22  4:29 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jul 21 2019, James Simmons wrote:

> Optimize prealloc table

"Optimize"???  What does that even mean here?

I notice that the patch removes a comment:

  /* XXX: should this table be tunable? */

and at least some of what the patch does is to make that table tunable.
There should be a patch which *only* makes that table tunable, with
hopefully a suggestion of why this is a good thing, and then anything
else need to go in a separate patch (with text telling me why this is
an improvement).

NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20190722/70240ba3/attachment.sig>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 04/22] ext4: export inode management
  2019-07-22  1:23 ` [lustre-devel] [PATCH 04/22] ext4: export inode management James Simmons
@ 2019-07-22  4:34   ` NeilBrown
  2019-07-22  7:16     ` Oleg Drokin
  0 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2019-07-22  4:34 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jul 21 2019, James Simmons wrote:

> Make ext4_delete_entry() exportable for osd-ldiskfs. Also add
> exportable ext4_create_inode().

Why can't lustre use vfs_unlink like everyone else?  And vfs_mknod()?
My hope is that what we land upstream will use generic interfaces only.
If there is something that osd needs to do that the generic interfaces
cannot support, then we need to look at how to add that to the generic
interface.

NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20190722/3881f8b8/attachment.sig>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 06/22] ext4: add extra checks for mballoc
  2019-07-22  1:23 ` [lustre-devel] [PATCH 06/22] ext4: add extra checks for mballoc James Simmons
@ 2019-07-22  4:37   ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2019-07-22  4:37 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jul 21 2019, James Simmons wrote:

> Handle mballoc corruptions.

"Handle" here mean "return -EIO if corruptions are found" ??

Sounds like a good thing to do.  I cannot comment on the code without
understand the fs more, but...

> diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
> index 88c98f1..8325ad9 100644
> --- a/fs/ext4/mballoc.h
> +++ b/fs/ext4/mballoc.h
> @@ -70,7 +70,7 @@
>  /*
>   * for which requests use 2^N search using buddies
>   */
> -#define MB_DEFAULT_ORDER2_REQS		2
> +#define MB_DEFAULT_ORDER2_REQS		8

I don't think this belongs with the reset of the patch.

NeilBrown


>  
>  /*
>   * default group prealloc size 512 blocks
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20190722/59f2a4c0/attachment.sig>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 07/22] ext4: update .. for hash indexed directory
  2019-07-22  1:23 ` [lustre-devel] [PATCH 07/22] ext4: update .. for hash indexed directory James Simmons
@ 2019-07-22  4:45   ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2019-07-22  4:45 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jul 21 2019, James Simmons wrote:

> Handle special .. case.

This seems to be for adding ".." to a directory.
That happens at "mkdir".  When else would you want to do that?
Is this an on-line fsck-repair thing?

NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20190722/3d20ff15/attachment.sig>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 08/22] ext4: kill off struct dx_root
  2019-07-22  1:23 ` [lustre-devel] [PATCH 08/22] ext4: kill off struct dx_root James Simmons
@ 2019-07-22  4:52   ` NeilBrown
  2019-07-23  2:07     ` Andreas Dilger
  2019-08-05  7:31     ` Artem Blagodarenko
  0 siblings, 2 replies; 53+ messages in thread
From: NeilBrown @ 2019-07-22  4:52 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jul 21 2019, James Simmons wrote:

> Use struct dx_root_info directly.

This looks like a style change.
Instead of a 'struct' that models the start of an on-disk directory
block, the location of some fields is done using calculations instead.

Why would we want to do that?  Might we want dx_root_info to be at a
different offset in the directory ??

NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20190722/63f18b48/attachment.sig>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 09/22] ext4: fix mballoc pa free mismatch
  2019-07-22  1:23 ` [lustre-devel] [PATCH 09/22] ext4: fix mballoc pa free mismatch James Simmons
@ 2019-07-22  4:56   ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2019-07-22  4:56 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jul 21 2019, James Simmons wrote:

> Intoduce pa_error so we can find any mballoc pa cleanup problems.

This patch seems to make sense, though
> +	BUG_ON(atomic_read(&sb->s_active) > 0 && pa->pa_free != free);

should probably be a WARN_ON and

> +#include <linux/genhd.h>

seems un-called for.

NeilBrown


>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  fs/ext4/mballoc.c | 45 +++++++++++++++++++++++++++++++++++++++------
>  fs/ext4/mballoc.h |  2 ++
>  2 files changed, 41 insertions(+), 6 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 483fc0f..463fba6 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -3863,6 +3863,7 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
>  	INIT_LIST_HEAD(&pa->pa_group_list);
>  	pa->pa_deleted = 0;
>  	pa->pa_type = MB_INODE_PA;
> +	pa->pa_error = 0;
>  
>  	mb_debug(1, "new inode pa %p: %llu/%u for %u\n", pa,
>  			pa->pa_pstart, pa->pa_len, pa->pa_lstart);
> @@ -3924,6 +3925,7 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
>  	INIT_LIST_HEAD(&pa->pa_group_list);
>  	pa->pa_deleted = 0;
>  	pa->pa_type = MB_GROUP_PA;
> +	pa->pa_error = 0;
>  
>  	mb_debug(1, "new group pa %p: %llu/%u for %u\n", pa,
>  			pa->pa_pstart, pa->pa_len, pa->pa_lstart);
> @@ -3983,7 +3985,9 @@ static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
>  	unsigned long long grp_blk_start;
>  	int free = 0;
>  
> +	assert_spin_locked(ext4_group_lock_ptr(sb, e4b->bd_group));
>  	BUG_ON(pa->pa_deleted == 0);
> +	BUG_ON(pa->pa_inode == NULL);
>  	ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit);
>  	grp_blk_start = pa->pa_pstart - EXT4_C2B(sbi, bit);
>  	BUG_ON(group != e4b->bd_group && pa->pa_len != 0);
> @@ -4006,12 +4010,19 @@ static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
>  		mb_free_blocks(pa->pa_inode, e4b, bit, next - bit);
>  		bit = next + 1;
>  	}
> -	if (free != pa->pa_free) {
> -		ext4_msg(e4b->bd_sb, KERN_CRIT,
> -			 "pa %p: logic %lu, phys. %lu, len %lu",
> -			 pa, (unsigned long) pa->pa_lstart,
> -			 (unsigned long) pa->pa_pstart,
> -			 (unsigned long) pa->pa_len);
> +
> +	/* "free < pa->pa_free" means we maybe double alloc the same blocks,
> +	 * otherwise maybe leave some free blocks unavailable, no need to BUG.
> +	 */
> +	if ((free > pa->pa_free && !pa->pa_error) || (free < pa->pa_free)) {
> +		ext4_error(sb, "pa free mismatch: [pa %p] "
> +				"[phy %lu] [logic %lu] [len %u] [free %u] "
> +				"[error %u] [inode %lu] [freed %u]", pa,
> +				(unsigned long)pa->pa_pstart,
> +				(unsigned long)pa->pa_lstart,
> +				(unsigned)pa->pa_len, (unsigned)pa->pa_free,
> +				(unsigned)pa->pa_error, pa->pa_inode->i_ino,
> +				free);
>  		ext4_grp_locked_error(sb, group, 0, 0, "free %u, pa_free %u",
>  					free, pa->pa_free);
>  		/*
> @@ -4019,6 +4030,8 @@ static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
>  		 * from the bitmap and continue.
>  		 */
>  	}
> +	/* do not verify if the file system is being umounted */
> +	BUG_ON(atomic_read(&sb->s_active) > 0 && pa->pa_free != free);
>  	atomic_add(free, &sbi->s_mb_discarded);
>  
>  	return 0;
> @@ -4764,6 +4777,26 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
>  		ac->ac_b_ex.fe_len = 0;
>  		ar->len = 0;
>  		ext4_mb_show_ac(ac);
> +		if (ac->ac_pa) {
> +			struct ext4_prealloc_space *pa = ac->ac_pa;
> +
> +			/* We can not make sure whether the bitmap has
> +			 * been updated or not when fail case. So can
> +			 * not revert pa_free back, just mark pa_error
> +			 */
> +			pa->pa_error++;
> +			ext4_error(sb,
> +				   "Updating bitmap error: [err %d] "
> +				   "[pa %p] [phy %lu] [logic %lu] "
> +				   "[len %u] [free %u] [error %u] "
> +				   "[inode %lu]", *errp, pa,
> +				   (unsigned long)pa->pa_pstart,
> +				   (unsigned long)pa->pa_lstart,
> +				   (unsigned)pa->pa_len,
> +				   (unsigned)pa->pa_free,
> +				   (unsigned)pa->pa_error,
> +				   pa->pa_inode ? pa->pa_inode->i_ino : 0);
> +		}
>  	}
>  	ext4_mb_release_context(ac);
>  out:
> diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
> index 8325ad9..e00c3b7 100644
> --- a/fs/ext4/mballoc.h
> +++ b/fs/ext4/mballoc.h
> @@ -20,6 +20,7 @@
>  #include <linux/seq_file.h>
>  #include <linux/blkdev.h>
>  #include <linux/mutex.h>
> +#include <linux/genhd.h>
>  #include "ext4_jbd2.h"
>  #include "ext4.h"
>  
> @@ -111,6 +112,7 @@ struct ext4_prealloc_space {
>  	ext4_grpblk_t		pa_len;		/* len of preallocated chunk */
>  	ext4_grpblk_t		pa_free;	/* how many blocks are free */
>  	unsigned short		pa_type;	/* pa type. inode or group */
> +	unsigned short		pa_error;
>  	spinlock_t		*pa_obj_lock;
>  	struct inode		*pa_inode;	/* hack, for history only */
>  };
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20190722/0e7781e6/attachment.sig>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 11/22] ext4: over ride current_time
  2019-07-22  1:23 ` [lustre-devel] [PATCH 11/22] ext4: over ride current_time James Simmons
@ 2019-07-22  5:06   ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2019-07-22  5:06 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jul 21 2019, James Simmons wrote:

> return i_ctime if IS_NOCMTIME is true for inode.

So ... this prevents ext4 from ever changing the i_ctime
on any file with S_NOCMTIME set - and osd-ldiskfs sets that on
all inodes.

Presumably osd wants full control of the ctime.

That's probably a reasonable goal.  I think it should be achieved by
adding a mount/sb flag which suppresses all m/ctime updates except those
made through utimes().

NeilBrown


>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  fs/ext4/ext4.h | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 51b6159..80601a9 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -661,6 +661,13 @@ enum {
>  #define EXT4_GOING_FLAGS_LOGFLUSH		0x1	/* flush log but not data */
>  #define EXT4_GOING_FLAGS_NOLOGFLUSH		0x2	/* don't flush log nor data */
>  
> +static inline struct timespec64 ext4_current_time(struct inode *inode)
> +{
> +	if (IS_NOCMTIME(inode))
> +		return inode->i_ctime;
> +	return current_time(inode);
> +}
> +#define current_time(a) ext4_current_time(a)
>  
>  #if defined(__KERNEL__) && defined(CONFIG_COMPAT)
>  /*
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20190722/6cddc8bd/attachment.sig>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 12/22] ext4: add htree lock implementation
  2019-07-22  1:23 ` [lustre-devel] [PATCH 12/22] ext4: add htree lock implementation James Simmons
@ 2019-07-22  5:10   ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2019-07-22  5:10 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jul 21 2019, James Simmons wrote:

> Used for parallel lookup

This needs more time for review than I have just now ...
I suspect the full EX,PW,PR,CW,CR locking is overkill, but
I won't know until I read the code (unlikely to happen this year).

Also, we need parallel lookups at the VFS level - I do have code for
that.

NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20190722/4b36bb59/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 13/22] ext4: Add a proc interface for max_dir_size.
  2019-07-22  1:23 ` [lustre-devel] [PATCH 13/22] ext4: Add a proc interface for max_dir_size James Simmons
@ 2019-07-22  5:14   ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2019-07-22  5:14 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jul 21 2019, James Simmons wrote:

> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  fs/ext4/sysfs.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
> index 1375815..3a71a16 100644
> --- a/fs/ext4/sysfs.c
> +++ b/fs/ext4/sysfs.c
> @@ -180,6 +180,8 @@ static ssize_t journal_task_show(struct ext4_sb_info *sbi, char *buf)
>  EXT4_ATTR_OFFSET(inode_readahead_blks, 0644, inode_readahead,
>  		 ext4_sb_info, s_inode_readahead_blks);
>  EXT4_RW_ATTR_SBI_UI(inode_goal, s_inode_goal);
> +EXT4_RW_ATTR_SBI_UI(max_dir_size, s_max_dir_size_kb);
> +EXT4_RW_ATTR_SBI_UI(max_dir_size_kb, s_max_dir_size_kb);

Why do we need both?


>  EXT4_RW_ATTR_SBI_UI(mb_stats, s_mb_stats);
>  EXT4_RW_ATTR_SBI_UI(mb_max_to_scan, s_mb_max_to_scan);
>  EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan);
> @@ -210,6 +212,8 @@ static ssize_t journal_task_show(struct ext4_sb_info *sbi, char *buf)
>  	ATTR_LIST(reserved_clusters),
>  	ATTR_LIST(inode_readahead_blks),
>  	ATTR_LIST(inode_goal),
> +	ATTR_LIST(max_dir_size),
> +	ATTR_LIST(max_dir_size_kb),
>  	ATTR_LIST(mb_stats),
>  	ATTR_LIST(mb_max_to_scan),
>  	ATTR_LIST(mb_min_to_scan),
> @@ -350,6 +354,8 @@ static ssize_t ext4_attr_store(struct kobject *kobj,
>  		ret = kstrtoul(skip_spaces(buf), 0, &t);
>  		if (ret)
>  			return ret;
> +		if (strcmp("max_dir_size", a->attr.name) == 0)
> +			t >>= 10;

Give that the attrs are read/write, surely we need a <<= 10 in the read
path too??

NeilBrown


>  		if (a->attr_ptr == ptr_ext4_super_block_offset)
>  			*((__le32 *) ptr) = cpu_to_le32(t);
>  		else
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20190722/4565dc41/attachment.sig>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 14/22] ext4: remove inode_lock handling
  2019-07-22  1:23 ` [lustre-devel] [PATCH 14/22] ext4: remove inode_lock handling James Simmons
@ 2019-07-22  5:16   ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2019-07-22  5:16 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jul 21 2019, James Simmons wrote:

> There will cause a deadlock if invoke ext4_truncate with i_mutex locked
> in lustre. Since lustre has own lock to provide protect so we don't
> need this check at all.

I guess you mean that if you took i_mutex in lustre, that would cause a
deadlock?

More evidence needed, and I strongly suspect that this is a bad idea.

NeilBrown


>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  fs/ext4/inode.c | 2 --
>  fs/ext4/namei.c | 4 ----
>  2 files changed, 6 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5561351..9296611 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4474,8 +4474,6 @@ int ext4_truncate(struct inode *inode)
>  	 * or it's a completely new inode. In those cases we might not
>  	 * have i_mutex locked because it's not necessary.
>  	 */
> -	if (!(inode->i_state & (I_NEW|I_FREEING)))
> -		WARN_ON(!inode_is_locked(inode));
>  	trace_ext4_truncate_enter(inode);
>  
>  	if (!ext4_can_truncate(inode))
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 0153c4d..1b6d22a 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -3485,8 +3485,6 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
>  	if (!sbi->s_journal || is_bad_inode(inode))
>  		return 0;
>  
> -	WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
> -		     !inode_is_locked(inode));
>  	/*
>  	 * Exit early if inode already is on orphan list. This is a big speedup
>  	 * since we don't have to contend on the global s_orphan_lock.
> @@ -3569,8 +3567,6 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
>  	if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS))
>  		return 0;
>  
> -	WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
> -		     !inode_is_locked(inode));
>  	/* Do this quick check before taking global s_orphan_lock. */
>  	if (list_empty(&ei->i_orphan))
>  		return 0;
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20190722/9e5d5f7a/attachment.sig>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 15/22] ext4: remove bitmap corruption warnings
  2019-07-22  1:23 ` [lustre-devel] [PATCH 15/22] ext4: remove bitmap corruption warnings James Simmons
@ 2019-07-22  5:18   ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2019-07-22  5:18 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jul 21 2019, James Simmons wrote:

> Since we could skip corrupt block groups, this patch
> use ext4_warning() intead of ext4_error() to make FS not
> mount RO in default.

Does this related to the previous patch:
  ext4: add extra checks for mballoc

??
If so, it seems like a good idea, but the two patches should be
consecutive.
The description could read "Now that we skip corrupt ....".

NeilBrown


>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  fs/ext4/balloc.c  | 10 ++++-----
>  fs/ext4/ialloc.c  |  6 ++---
>  fs/ext4/mballoc.c | 66 +++++++++++++++++++++++--------------------------------
>  3 files changed, 35 insertions(+), 47 deletions(-)
>
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index bca75c1..aff3981 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -374,7 +374,7 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
>  	if (unlikely(!ext4_block_bitmap_csum_verify(sb, block_group,
>  			desc, bh))) {
>  		ext4_unlock_group(sb, block_group);
> -		ext4_error(sb, "bg %u: bad block bitmap checksum", block_group);
> +		ext4_warning(sb, "bg %u: bad block bitmap checksum", block_group);
>  		ext4_mark_group_bitmap_corrupted(sb, block_group,
>  					EXT4_GROUP_INFO_BBITMAP_CORRUPT);
>  		return -EFSBADCRC;
> @@ -382,8 +382,8 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
>  	blk = ext4_valid_block_bitmap(sb, desc, block_group, bh);
>  	if (unlikely(blk != 0)) {
>  		ext4_unlock_group(sb, block_group);
> -		ext4_error(sb, "bg %u: block %llu: invalid block bitmap",
> -			   block_group, blk);
> +		ext4_warning(sb, "bg %u: block %llu: invalid block bitmap",
> +			     block_group, blk);
>  		ext4_mark_group_bitmap_corrupted(sb, block_group,
>  					EXT4_GROUP_INFO_BBITMAP_CORRUPT);
>  		return -EFSCORRUPTED;
> @@ -459,8 +459,8 @@ struct buffer_head *
>  		ext4_unlock_group(sb, block_group);
>  		unlock_buffer(bh);
>  		if (err) {
> -			ext4_error(sb, "Failed to init block bitmap for group "
> -				   "%u: %d", block_group, err);
> +			ext4_warning(sb, "Failed to init block bitmap for group "
> +				     "%u: %d", block_group, err);
>  			goto out;
>  		}
>  		goto verify;
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 68d41e6..a72fe63 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -96,8 +96,8 @@ static int ext4_validate_inode_bitmap(struct super_block *sb,
>  	if (!ext4_inode_bitmap_csum_verify(sb, block_group, desc, bh,
>  					   EXT4_INODES_PER_GROUP(sb) / 8)) {
>  		ext4_unlock_group(sb, block_group);
> -		ext4_error(sb, "Corrupt inode bitmap - block_group = %u, "
> -			   "inode_bitmap = %llu", block_group, blk);
> +		ext4_warning(sb, "Corrupt inode bitmap - block_group = %u, "
> +			     "inode_bitmap = %llu", block_group, blk);
>  		ext4_mark_group_bitmap_corrupted(sb, block_group,
>  					EXT4_GROUP_INFO_IBITMAP_CORRUPT);
>  		return -EFSBADCRC;
> @@ -346,7 +346,7 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
>  		if (!fatal)
>  			fatal = err;
>  	} else {
> -		ext4_error(sb, "bit already cleared for inode %lu", ino);
> +		ext4_warning(sb, "bit already cleared for inode %lu", ino);
>  		ext4_mark_group_bitmap_corrupted(sb, block_group,
>  					EXT4_GROUP_INFO_IBITMAP_CORRUPT);
>  	}
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 463fba6..82398b0 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -741,10 +741,15 @@ int ext4_mb_generate_buddy(struct super_block *sb,
>  	grp->bb_fragments = fragments;
>  
>  	if (free != grp->bb_free) {
> -		ext4_grp_locked_error(sb, group, 0, 0,
> -				      "block bitmap and bg descriptor "
> -				      "inconsistent: %u vs %u free clusters",
> -				      free, grp->bb_free);
> +		struct ext4_group_desc *gdp;
> +
> +		gdp = ext4_get_group_desc(sb, group, NULL);
> +		ext4_warning(sb, "group %lu: block bitmap and bg descriptor "
> +			     "inconsistent: %u vs %u free clusters "
> +			     "%u in gd, %lu pa's",
> +			     (long unsigned int)group, free, grp->bb_free,
> +			     ext4_free_group_clusters(sb, gdp),
> +			     grp->bb_prealloc_nr);
>  		/*
>  		 * If we intend to continue, we consider group descriptor
>  		 * corrupt and update bb_free using bitmap value
> @@ -1106,7 +1111,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
>  	int block;
>  	int pnum;
>  	int poff;
> -	struct page *page;
> +	struct page *page = NULL;
>  	int ret;
>  	struct ext4_group_info *grp;
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> @@ -1132,7 +1137,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
>  		 */
>  		ret = ext4_mb_init_group(sb, group, gfp);
>  		if (ret)
> -			return ret;
> +			goto err;
>  	}
>  
>  	/*
> @@ -1235,6 +1240,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
>  		put_page(e4b->bd_buddy_page);
>  	e4b->bd_buddy = NULL;
>  	e4b->bd_bitmap = NULL;
> +	ext4_warning(sb, "Error loading buddy information for %u", group);
>  	return ret;
>  }
>  
> @@ -3631,8 +3637,10 @@ int ext4_mb_check_ondisk_bitmap(struct super_block *sb, void *bitmap,
>  	}
>  
>  	if (free != free_in_gdp) {
> -		ext4_error(sb, "on-disk bitmap for group %d corrupted: %u blocks free in bitmap, %u - in gd\n",
> -			   group, free, free_in_gdp);
> +		ext4_warning(sb, "on-disk bitmap for group %d corrupted: %u blocks free in bitmap, %u - in gd\n",
> +			     group, free, free_in_gdp);
> +		ext4_mark_group_bitmap_corrupted(sb, group,
> +						 EXT4_GROUP_INFO_BBITMAP_CORRUPT);
>  		return -EIO;
>  	}
>  	return 0;
> @@ -4015,14 +4023,6 @@ static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
>  	 * otherwise maybe leave some free blocks unavailable, no need to BUG.
>  	 */
>  	if ((free > pa->pa_free && !pa->pa_error) || (free < pa->pa_free)) {
> -		ext4_error(sb, "pa free mismatch: [pa %p] "
> -				"[phy %lu] [logic %lu] [len %u] [free %u] "
> -				"[error %u] [inode %lu] [freed %u]", pa,
> -				(unsigned long)pa->pa_pstart,
> -				(unsigned long)pa->pa_lstart,
> -				(unsigned)pa->pa_len, (unsigned)pa->pa_free,
> -				(unsigned)pa->pa_error, pa->pa_inode->i_ino,
> -				free);
>  		ext4_grp_locked_error(sb, group, 0, 0, "free %u, pa_free %u",
>  					free, pa->pa_free);
>  		/*
> @@ -4086,15 +4086,11 @@ static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
>  	bitmap_bh = ext4_read_block_bitmap(sb, group);
>  	if (IS_ERR(bitmap_bh)) {
>  		err = PTR_ERR(bitmap_bh);
> -		ext4_error(sb, "Error %d reading block bitmap for %u",
> -			   err, group);
>  		return 0;
>  	}
>  
>  	err = ext4_mb_load_buddy(sb, group, &e4b);
>  	if (err) {
> -		ext4_warning(sb, "Error %d loading buddy information for %u",
> -			     err, group);
>  		put_bh(bitmap_bh);
>  		return 0;
>  	}
> @@ -4255,17 +4251,12 @@ void ext4_discard_preallocations(struct inode *inode)
>  
>  		err = ext4_mb_load_buddy_gfp(sb, group, &e4b,
>  					     GFP_NOFS|__GFP_NOFAIL);
> -		if (err) {
> -			ext4_error(sb, "Error %d loading buddy information for %u",
> -				   err, group);
> +		if (err)
>  			return;
> -		}
>  
>  		bitmap_bh = ext4_read_block_bitmap(sb, group);
>  		if (IS_ERR(bitmap_bh)) {
>  			err = PTR_ERR(bitmap_bh);
> -			ext4_error(sb, "Error %d reading block bitmap for %u",
> -					err, group);
>  			ext4_mb_unload_buddy(&e4b);
>  			continue;
>  		}
> @@ -4527,11 +4518,8 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
>  		group = ext4_get_group_number(sb, pa->pa_pstart);
>  		err = ext4_mb_load_buddy_gfp(sb, group, &e4b,
>  					     GFP_NOFS|__GFP_NOFAIL);
> -		if (err) {
> -			ext4_error(sb, "Error %d loading buddy information for %u",
> -				   err, group);
> +		if (err)
>  			continue;
> -		}
>  		ext4_lock_group(sb, group);
>  		list_del(&pa->pa_group_list);
>  		ext4_get_group_info(sb, group)->bb_prealloc_nr--;
> @@ -4785,7 +4773,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
>  			 * not revert pa_free back, just mark pa_error
>  			 */
>  			pa->pa_error++;
> -			ext4_error(sb,
> +			ext4_warning(sb,
>  				   "Updating bitmap error: [err %d] "
>  				   "[pa %p] [phy %lu] [logic %lu] "
>  				   "[len %u] [free %u] [error %u] "
> @@ -4796,6 +4784,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
>  				   (unsigned)pa->pa_free,
>  				   (unsigned)pa->pa_error,
>  				   pa->pa_inode ? pa->pa_inode->i_ino : 0);
> +			ext4_mark_group_bitmap_corrupted(sb, 0, 0);
>  		}
>  	}
>  	ext4_mb_release_context(ac);
> @@ -5081,7 +5070,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
>  	err = ext4_mb_load_buddy_gfp(sb, block_group, &e4b,
>  				     GFP_NOFS|__GFP_NOFAIL);
>  	if (err)
> -		goto error_return;
> +		goto error_brelse;
>  
>  	/*
>  	 * We need to make sure we don't reuse the freed block until after the
> @@ -5171,8 +5160,9 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
>  		goto do_more;
>  	}
>  error_return:
> -	brelse(bitmap_bh);
>  	ext4_std_error(sb, err);
> +error_brelse:
> +	brelse(bitmap_bh);
>  	return;
>  }
>  
> @@ -5272,7 +5262,7 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
>  
>  	err = ext4_mb_load_buddy(sb, block_group, &e4b);
>  	if (err)
> -		goto error_return;
> +		goto error_brelse;
>  
>  	/*
>  	 * need to update group_info->bb_free and bitmap
> @@ -5310,8 +5300,9 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
>  		err = ret;
>  
>  error_return:
> -	brelse(bitmap_bh);
>  	ext4_std_error(sb, err);
> +error_brelse:
> +	brelse(bitmap_bh);
>  	return err;
>  }
>  
> @@ -5386,11 +5377,8 @@ static int ext4_trim_extent(struct super_block *sb, int start, int count,
>  	trace_ext4_trim_all_free(sb, group, start, max);
>  
>  	ret = ext4_mb_load_buddy(sb, group, &e4b);
> -	if (ret) {
> -		ext4_warning(sb, "Error %d loading buddy information for %u",
> -			     ret, group);
> +	if (ret)
>  		return ret;
> -	}
>  	bitmap = e4b.bd_bitmap;
>  
>  	ext4_lock_group(sb, group);
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20190722/2a84fac1/attachment.sig>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 16/22] ext4: add warning for directory htree growth
  2019-07-22  1:23 ` [lustre-devel] [PATCH 16/22] ext4: add warning for directory htree growth James Simmons
@ 2019-07-22  5:24   ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2019-07-22  5:24 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jul 21 2019, James Simmons wrote:

You have really out-done yourself with the commit message here !!!

I think this allows lustre to generate warnings if any single directory
exceeds some particular size ??
Any be default, the max size is 33% larger than the first directory that
anything is added to??
I guess lustre just uses one big directory??

I appreciate that this might be useful functionality.  I suspect a
better interface is needed.

NeilBrown


> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  fs/ext4/ext4.h  |  1 +
>  fs/ext4/namei.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  fs/ext4/super.c |  2 ++
>  fs/ext4/sysfs.c |  2 ++
>  4 files changed, 72 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index bf74c7c..5f73e19 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1472,6 +1472,7 @@ struct ext4_sb_info {
>  	unsigned int s_mb_group_prealloc;
>  	unsigned long *s_mb_prealloc_table;
>  	unsigned int s_max_dir_size_kb;
> +	unsigned long s_warning_dir_size;
>  	/* where last allocation was done - for stream allocation */
>  	unsigned long s_mb_last_group;
>  	unsigned long s_mb_last_start;
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 1b6d22a..9b30cc6 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -757,12 +757,20 @@ struct ext4_dir_lock_data {
>  #define ext4_htree_lock_data(l)	((struct ext4_dir_lock_data *)(l)->lk_private)
>  #define ext4_find_entry(dir, name, dirent, inline) \
>  			ext4_find_entry_locked(dir, name, dirent, inline, NULL)
> -#define ext4_add_entry(handle, dentry, inode) \
> -		       ext4_add_entry_locked(handle, dentry, inode, NULL)
>  
>  /* NB: ext4_lblk_t is 32 bits so we use high bits to identify invalid blk */
>  #define EXT4_HTREE_NODE_CHANGED		(0xcafeULL << 32)
>  
> +inline int ext4_add_entry(handle_t *handle, struct dentry *dentry,
> +			  struct inode *inode)
> +{
> +	int ret = ext4_add_entry_locked(handle, dentry, inode, NULL);
> +
> +	if (ret == -ENOBUFS)
> +		ret = 0;
> +	return ret;
> +}
> +
>  static void ext4_htree_event_cb(void *target, void *event)
>  {
>  	u64 *block = (u64 *)target;
> @@ -2612,6 +2620,55 @@ static int ext4_update_dotdot(handle_t *handle, struct dentry *dentry,
>  	return err;
>  }
>  
> +static unsigned long __ext4_max_dir_size(struct dx_frame *frames,
> +					 struct dx_frame *frame,
> +					 struct inode *dir)
> +{
> +	unsigned long max_dir_size;
> +
> +	if (EXT4_SB(dir->i_sb)->s_max_dir_size_kb) {
> +		max_dir_size = EXT4_SB(dir->i_sb)->s_max_dir_size_kb << 10;
> +	} else {
> +		max_dir_size = EXT4_BLOCK_SIZE(dir->i_sb);
> +		while (frame >= frames) {
> +			max_dir_size *= dx_get_limit(frame->entries);
> +			if (frame == frames)
> +				break;
> +			frame--;
> +		}
> +		/* use 75% of max dir size in average */
> +		max_dir_size = max_dir_size / 4 * 3;
> +	}
> +	return max_dir_size;
> +}
> +
> +/*
> + * With hash tree growing, it is easy to hit ENOSPC, but it is hard
> + * to predict when it will happen. let's give administrators warning
> + * when reaching 3/5 and 2/3 of limit
> + */
> +static inline bool dir_size_in_warning_range(struct dx_frame *frames,
> +					     struct dx_frame *frame,
> +					     struct inode *dir)
> +{
> +	struct super_block *sb = dir->i_sb;
> +	unsigned long size1, size2;
> +
> +	if (unlikely(!EXT4_SB(sb)->s_warning_dir_size))
> +		EXT4_SB(sb)->s_warning_dir_size =
> +			__ext4_max_dir_size(frames, frame, dir);
> +
> +	size1 = EXT4_SB(sb)->s_warning_dir_size / 16 * 10;
> +	size1 = size1 & ~(EXT4_BLOCK_SIZE(sb) - 1);
> +	size2 = EXT4_SB(sb)->s_warning_dir_size / 16 * 11;
> +	size2 = size2 & ~(EXT4_BLOCK_SIZE(sb) - 1);
> +	if (in_range(dir->i_size, size1, EXT4_BLOCK_SIZE(sb)) ||
> +	    in_range(dir->i_size, size2, EXT4_BLOCK_SIZE(sb)))
> +		return true;
> +
> +	return false;
> +}
> +
>  /*
>   *	ext4_add_entry()
>   *
> @@ -2739,6 +2796,7 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
>  	struct buffer_head *bh;
>  	struct super_block *sb = dir->i_sb;
>  	struct ext4_dir_entry_2 *de;
> +	bool ret_warn = false;
>  	int restart;
>  	int err;
>  
> @@ -2769,6 +2827,11 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
>  	/* Block full, should compress but for now just split */
>  	dxtrace(printk(KERN_DEBUG "using %u of %u node entries\n",
>  		       dx_get_count(entries), dx_get_limit(entries)));
> +
> +	if (frame - frames + 1 >= ext4_dir_htree_level(sb) ||
> +	    EXT4_SB(sb)->s_warning_dir_size)
> +		ret_warn = dir_size_in_warning_range(frames, frame, dir);
> +
>  	/* Need to split index? */
>  	if (dx_get_count(entries) == dx_get_limit(entries)) {
>  		ext4_lblk_t newblock;
> @@ -2935,6 +2998,8 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
>  	 */
>  	if (restart && err == 0)
>  		goto again;
> +	if (err == 0 && ret_warn)
> +		err = -ENOBUFS;
>  	return err;
>  }
>  
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 07242d7..a3179b2 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1901,6 +1901,8 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
>  		sbi->s_li_wait_mult = arg;
>  	} else if (token == Opt_max_dir_size_kb) {
>  		sbi->s_max_dir_size_kb = arg;
> +		/* reset s_warning_dir_size and make it re-calculated */
> +		sbi->s_warning_dir_size = 0;
>  	} else if (token == Opt_stripe) {
>  		sbi->s_stripe = arg;
>  	} else if (token == Opt_resuid) {
> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
> index 3a71a16..575f318 100644
> --- a/fs/ext4/sysfs.c
> +++ b/fs/ext4/sysfs.c
> @@ -182,6 +182,7 @@ static ssize_t journal_task_show(struct ext4_sb_info *sbi, char *buf)
>  EXT4_RW_ATTR_SBI_UI(inode_goal, s_inode_goal);
>  EXT4_RW_ATTR_SBI_UI(max_dir_size, s_max_dir_size_kb);
>  EXT4_RW_ATTR_SBI_UI(max_dir_size_kb, s_max_dir_size_kb);
> +EXT4_RW_ATTR_SBI_UI(warning_dir_size, s_warning_dir_size);
>  EXT4_RW_ATTR_SBI_UI(mb_stats, s_mb_stats);
>  EXT4_RW_ATTR_SBI_UI(mb_max_to_scan, s_mb_max_to_scan);
>  EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan);
> @@ -214,6 +215,7 @@ static ssize_t journal_task_show(struct ext4_sb_info *sbi, char *buf)
>  	ATTR_LIST(inode_goal),
>  	ATTR_LIST(max_dir_size),
>  	ATTR_LIST(max_dir_size_kb),
> +	ATTR_LIST(warning_dir_size),
>  	ATTR_LIST(mb_stats),
>  	ATTR_LIST(mb_max_to_scan),
>  	ATTR_LIST(mb_min_to_scan),
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20190722/e1069c6e/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 17/22] ext4: optimize ext4_journal_callback_add
  2019-07-22  1:23 ` [lustre-devel] [PATCH 17/22] ext4: optimize ext4_journal_callback_add James Simmons
@ 2019-07-22  5:27   ` NeilBrown
  2019-07-23  2:01     ` Andreas Dilger
  0 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2019-07-22  5:27 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jul 21 2019, James Simmons wrote:

> Change list_add_tail to list_add. It gives advantages to ldiskfs
> in tgt_cb_last_committed. In the beginning of list will be placed
> thandles with the highest transaction numbers. So at the first
> iterations we will have the highest transno. It will save from
> extra call of ptlrpc_commit_replies.

If only more commit messages were like this !!  Thanks!

I suspect it would be better for that thing that needs to see the
highest transno first, to find a way to look at the end of the list
instead of the beginning.

This isn't the sort of change that can land in ext4.

NeilBrown


>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  fs/ext4/ext4_jbd2.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 75a5309..5ebf8ee 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -169,7 +169,7 @@ static inline void _ext4_journal_callback_add(handle_t *handle,
>  			struct ext4_journal_cb_entry *jce)
>  {
>  	/* Add the jce to transaction's private list */
> -	list_add_tail(&jce->jce_list, &handle->h_transaction->t_private_list);
> +	list_add(&jce->jce_list, &handle->h_transaction->t_private_list);
>  }
>  
>  static inline void ext4_journal_callback_add(handle_t *handle,
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20190722/2504564a/attachment.sig>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 19/22] ext4: don't check before replay
  2019-07-22  1:23 ` [lustre-devel] [PATCH 19/22] ext4: don't check before replay James Simmons
@ 2019-07-22  5:29   ` NeilBrown
       [not found]     ` <506765DD-0068-469E-ADA4-2C71B8B60114@cloudlinux.com>
  2019-07-23  1:57     ` Andreas Dilger
  0 siblings, 2 replies; 53+ messages in thread
From: NeilBrown @ 2019-07-22  5:29 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jul 21 2019, James Simmons wrote:

> When ldiskfs run in failover mode whith read-only disk.
> Part of allocation updates are lost and ldiskfs may fail
> while mounting this is due to inconsistent state of
> group-descriptor. Group-descriptor check is added after
> journal replay.

I think this needs to be enabled by a mount option or super-block flag.

NeilBrown


>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  fs/ext4/super.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index a3179b2..b818acb 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4255,11 +4255,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  		}
>  	}
>  	sbi->s_gdb_count = db_count;
> -	if (!ext4_check_descriptors(sb, logical_sb_block, &first_not_zeroed)) {
> -		ext4_msg(sb, KERN_ERR, "group descriptors corrupted!");
> -		ret = -EFSCORRUPTED;
> -		goto failed_mount2;
> -	}
>  
>  	timer_setup(&sbi->s_err_report, print_daily_error_info, 0);
>  
> @@ -4401,6 +4396,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  	sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
>  
>  no_journal:
> +	if (!ext4_check_descriptors(sb, logical_sb_block, &first_not_zeroed)) {
> +		ext4_msg(sb, KERN_ERR, "group descriptors corrupted!");
> +		ret = -EFSCORRUPTED;
> +		goto failed_mount_wq;
> +	}
> +
>  	if (!test_opt(sb, NO_MBCACHE)) {
>  		sbi->s_ea_block_cache = ext4_xattr_create_cache();
>  		if (!sbi->s_ea_block_cache) {
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20190722/74704b19/attachment.sig>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 20/22] ext4: use GFP_NOFS in ext4_inode_attach_jinode
  2019-07-22  1:23 ` [lustre-devel] [PATCH 20/22] ext4: use GFP_NOFS in ext4_inode_attach_jinode James Simmons
@ 2019-07-22  5:30   ` NeilBrown
  2019-07-23  1:56     ` Andreas Dilger
  0 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2019-07-22  5:30 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jul 21 2019, James Simmons wrote:

If you need to call this from a context where GFP_KERNEL is not safe,
use memalloc_nofs_{save,restore}().

NeilBrown



> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  fs/ext4/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index bf00dfc..ca72097 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4422,7 +4422,7 @@ int ext4_inode_attach_jinode(struct inode *inode)
>  	if (ei->jinode || !EXT4_SB(inode->i_sb)->s_journal)
>  		return 0;
>  
> -	jinode = jbd2_alloc_inode(GFP_KERNEL);
> +	jinode = jbd2_alloc_inode(GFP_NOFS);
>  	spin_lock(&inode->i_lock);
>  	if (!ei->jinode) {
>  		if (!jinode) {
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20190722/24edb3a2/attachment.sig>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 19/22] ext4: don't check before replay
       [not found]     ` <506765DD-0068-469E-ADA4-2C71B8B60114@cloudlinux.com>
@ 2019-07-22  6:46       ` NeilBrown
  2019-07-22  6:56         ` Oleg Drokin
  0 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2019-07-22  6:46 UTC (permalink / raw)
  To: lustre-devel

On Mon, Jul 22 2019, Alexey Lyashkov wrote:

> Why?
> Purpose of this patch is simple and don?t addressed a failover in general.
> Crash can occurred in commit time - when journal _partially_ flushed to the FS. Checking any FS metadata in this time is wrong, because we have no guarantee to be consistence.
> But checking an after journal replay is fine, it check have verification no corruption hit and FS is fine. 

If the corruption can occur in non-ldiskfs usage, and would be fixed by
a journal replay, then yes - the patch looks like a good idea.

Possibly I misunderstood the source of the corruption... maybe if that
could be made clearer in the commit message, that would help.

Thanks,
NeilBrown


>
>
>> 22 ???? 2019 ?., ? 8:29, NeilBrown <neilb@suse.com> ???????(?):
>> 
>> On Sun, Jul 21 2019, James Simmons wrote:
>> 
>>> When ldiskfs run in failover mode whith read-only disk.
>>> Part of allocation updates are lost and ldiskfs may fail
>>> while mounting this is due to inconsistent state of
>>> group-descriptor. Group-descriptor check is added after
>>> journal replay.
>> 
>> I think this needs to be enabled by a mount option or super-block flag.
>> 
>> NeilBrown
>> 
>> 
>>> 
>>> Signed-off-by: James Simmons <jsimmons@infradead.org>
>>> ---
>>> fs/ext4/super.c | 11 ++++++-----
>>> 1 file changed, 6 insertions(+), 5 deletions(-)
>>> 
>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>>> index a3179b2..b818acb 100644
>>> --- a/fs/ext4/super.c
>>> +++ b/fs/ext4/super.c
>>> @@ -4255,11 +4255,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>>> 		}
>>> 	}
>>> 	sbi->s_gdb_count = db_count;
>>> -	if (!ext4_check_descriptors(sb, logical_sb_block, &first_not_zeroed)) {
>>> -		ext4_msg(sb, KERN_ERR, "group descriptors corrupted!");
>>> -		ret = -EFSCORRUPTED;
>>> -		goto failed_mount2;
>>> -	}
>>> 
>>> 	timer_setup(&sbi->s_err_report, print_daily_error_info, 0);
>>> 
>>> @@ -4401,6 +4396,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>>> 	sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
>>> 
>>> no_journal:
>>> +	if (!ext4_check_descriptors(sb, logical_sb_block, &first_not_zeroed)) {
>>> +		ext4_msg(sb, KERN_ERR, "group descriptors corrupted!");
>>> +		ret = -EFSCORRUPTED;
>>> +		goto failed_mount_wq;
>>> +	}
>>> +
>>> 	if (!test_opt(sb, NO_MBCACHE)) {
>>> 		sbi->s_ea_block_cache = ext4_xattr_create_cache();
>>> 		if (!sbi->s_ea_block_cache) {
>>> -- 
>>> 1.8.3.1
>> _______________________________________________
>> lustre-devel mailing list
>> lustre-devel at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20190722/a8b8b2a4/attachment.sig>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 19/22] ext4: don't check before replay
  2019-07-22  6:46       ` NeilBrown
@ 2019-07-22  6:56         ` Oleg Drokin
  2019-07-22  9:51           ` Alexey Lyashkov
  0 siblings, 1 reply; 53+ messages in thread
From: Oleg Drokin @ 2019-07-22  6:56 UTC (permalink / raw)
  To: lustre-devel



> On Jul 22, 2019, at 2:46 AM, NeilBrown <neilb@suse.com> wrote:
> 
> On Mon, Jul 22 2019, Alexey Lyashkov wrote:
> 
>> Why?
>> Purpose of this patch is simple and don?t addressed a failover in general.
>> Crash can occurred in commit time - when journal _partially_ flushed to the FS. Checking any FS metadata in this time is wrong, because we have no guarantee to be consistence.
>> But checking an after journal replay is fine, it check have verification no corruption hit and FS is fine. 
> 
> If the corruption can occur in non-ldiskfs usage, and would be fixed by
> a journal replay, then yes - the patch looks like a good idea.
> 
> Possibly I misunderstood the source of the corruption... maybe if that
> could be made clearer in the commit message, that would help.

I think the argument is: wtf do we even look at metadata (or anything, really)
before journal replay.
The patch is actually good because it does move the read after the journal replay
though?


> 
>> 
>> 
>>> 22 ???? 2019 ?., ? 8:29, NeilBrown <neilb@suse.com> ???????(?):
>>> 
>>> On Sun, Jul 21 2019, James Simmons wrote:
>>> 
>>>> When ldiskfs run in failover mode whith read-only disk.
>>>> Part of allocation updates are lost and ldiskfs may fail
>>>> while mounting this is due to inconsistent state of
>>>> group-descriptor. Group-descriptor check is added after
>>>> journal replay.
>>> 
>>> I think this needs to be enabled by a mount option or super-block flag.
>>> 
>>> NeilBrown
>>> 
>>> 
>>>> 
>>>> Signed-off-by: James Simmons <jsimmons@infradead.org>
>>>> ---
>>>> fs/ext4/super.c | 11 ++++++-----
>>>> 1 file changed, 6 insertions(+), 5 deletions(-)
>>>> 
>>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>>>> index a3179b2..b818acb 100644
>>>> --- a/fs/ext4/super.c
>>>> +++ b/fs/ext4/super.c
>>>> @@ -4255,11 +4255,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>>>> 		}
>>>> 	}
>>>> 	sbi->s_gdb_count = db_count;
>>>> -	if (!ext4_check_descriptors(sb, logical_sb_block, &first_not_zeroed)) {
>>>> -		ext4_msg(sb, KERN_ERR, "group descriptors corrupted!");
>>>> -		ret = -EFSCORRUPTED;
>>>> -		goto failed_mount2;
>>>> -	}
>>>> 
>>>> 	timer_setup(&sbi->s_err_report, print_daily_error_info, 0);
>>>> 
>>>> @@ -4401,6 +4396,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>>>> 	sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
>>>> 
>>>> no_journal:
>>>> +	if (!ext4_check_descriptors(sb, logical_sb_block, &first_not_zeroed)) {
>>>> +		ext4_msg(sb, KERN_ERR, "group descriptors corrupted!");
>>>> +		ret = -EFSCORRUPTED;
>>>> +		goto failed_mount_wq;
>>>> +	}
>>>> +
>>>> 	if (!test_opt(sb, NO_MBCACHE)) {
>>>> 		sbi->s_ea_block_cache = ext4_xattr_create_cache();
>>>> 		if (!sbi->s_ea_block_cache) {
>>>> -- 
>>>> 1.8.3.1
>>> _______________________________________________
>>> lustre-devel mailing list
>>> lustre-devel at lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 04/22] ext4: export inode management
  2019-07-22  4:34   ` NeilBrown
@ 2019-07-22  7:16     ` Oleg Drokin
  0 siblings, 0 replies; 53+ messages in thread
From: Oleg Drokin @ 2019-07-22  7:16 UTC (permalink / raw)
  To: lustre-devel

> On Jul 22, 2019, at 12:34 AM, NeilBrown <neilb@suse.com> wrote:
> 
> On Sun, Jul 21 2019, James Simmons wrote:
> 
>> Make ext4_delete_entry() exportable for osd-ldiskfs. Also add
>> exportable ext4_create_inode().
> 
> Why can't lustre use vfs_unlink like everyone else?  And vfs_mknod()?
> My hope is that what we land upstream will use generic interfaces only.
> If there is something that osd needs to do that the generic interfaces
> cannot support, then we need to look at how to add that to the generic
> interface.

That?s probably going to be a tough thing to do. At times we need certain things
that are bad from vfs standpoint in general sense, but because we have various
outside guarantees from locking and such - we can still do them.

nothing precise comes to mind now, but I remember various games with
nlink count/moving and resurrecting such files.

We also have wider ?transaction boundaries?. Where say normal via-vfs transaction
is metadata only, we can request certain data (from ext4 perspective, but metadata
from Lustre) to be part of a particular transaction.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 19/22] ext4: don't check before replay
  2019-07-22  6:56         ` Oleg Drokin
@ 2019-07-22  9:51           ` Alexey Lyashkov
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Lyashkov @ 2019-07-22  9:51 UTC (permalink / raw)
  To: lustre-devel


> 22 ???? 2019 ?., ? 9:56, Oleg Drokin <green@whamcloud.com> ???????(?):
> 
> 
> 
>> On Jul 22, 2019, at 2:46 AM, NeilBrown <neilb@suse.com> wrote:
>> 
>> On Mon, Jul 22 2019, Alexey Lyashkov wrote:
>> 
>>> Why?
>>> Purpose of this patch is simple and don?t addressed a failover in general.
>>> Crash can occurred in commit time - when journal _partially_ flushed to the FS. Checking any FS metadata in this time is wrong, because we have no guarantee to be consistence.
>>> But checking an after journal replay is fine, it check have verification no corruption hit and FS is fine. 
>> 
>> If the corruption can occur in non-ldiskfs usage, and would be fixed by
>> a journal replay, then yes - the patch looks like a good idea.
Yes, bug should fixed by journal replay. BUT this check was run _before_ journal replay and patch move groups check _after_ journal replay done.


>> 
>> Possibly I misunderstood the source of the corruption... maybe if that
>> could be made clearer in the commit message, that would help.
> 
> I think the argument is: wtf do we even look at metadata (or anything, really)
> before journal replay.
> The patch is actually good because it does move the read after the journal replay
> though?

Oleg,

you are right. 
> 
> 
>> 
>>> 
>>> 
>>>> 22 ???? 2019 ?., ? 8:29, NeilBrown <neilb@suse.com> ???????(?):
>>>> 
>>>> On Sun, Jul 21 2019, James Simmons wrote:
>>>> 
>>>>> When ldiskfs run in failover mode whith read-only disk.
>>>>> Part of allocation updates are lost and ldiskfs may fail
>>>>> while mounting this is due to inconsistent state of
>>>>> group-descriptor. Group-descriptor check is added after
>>>>> journal replay.
>>>> 
>>>> I think this needs to be enabled by a mount option or super-block flag.
>>>> 
>>>> NeilBrown
>>>> 
>>>> 
>>>>> 
>>>>> Signed-off-by: James Simmons <jsimmons@infradead.org>
>>>>> ---
>>>>> fs/ext4/super.c | 11 ++++++-----
>>>>> 1 file changed, 6 insertions(+), 5 deletions(-)
>>>>> 
>>>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>>>>> index a3179b2..b818acb 100644
>>>>> --- a/fs/ext4/super.c
>>>>> +++ b/fs/ext4/super.c
>>>>> @@ -4255,11 +4255,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>>>>> 		}
>>>>> 	}
>>>>> 	sbi->s_gdb_count = db_count;
>>>>> -	if (!ext4_check_descriptors(sb, logical_sb_block, &first_not_zeroed)) {
>>>>> -		ext4_msg(sb, KERN_ERR, "group descriptors corrupted!");
>>>>> -		ret = -EFSCORRUPTED;
>>>>> -		goto failed_mount2;
>>>>> -	}
>>>>> 
>>>>> 	timer_setup(&sbi->s_err_report, print_daily_error_info, 0);
>>>>> 
>>>>> @@ -4401,6 +4396,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>>>>> 	sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
>>>>> 
>>>>> no_journal:
>>>>> +	if (!ext4_check_descriptors(sb, logical_sb_block, &first_not_zeroed)) {
>>>>> +		ext4_msg(sb, KERN_ERR, "group descriptors corrupted!");
>>>>> +		ret = -EFSCORRUPTED;
>>>>> +		goto failed_mount_wq;
>>>>> +	}
>>>>> +
>>>>> 	if (!test_opt(sb, NO_MBCACHE)) {
>>>>> 		sbi->s_ea_block_cache = ext4_xattr_create_cache();
>>>>> 		if (!sbi->s_ea_block_cache) {
>>>>> -- 
>>>>> 1.8.3.1
>>>> _______________________________________________
>>>> lustre-devel mailing list
>>>> lustre-devel at lists.lustre.org
>>>> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
> 
> _______________________________________________
> lustre-devel mailing list
> lustre-devel at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 01/22] ext4: add i_fs_version
  2019-07-22  4:13   ` NeilBrown
@ 2019-07-23  0:07     ` James Simmons
  2019-07-31 22:03     ` Andreas Dilger
  1 sibling, 0 replies; 53+ messages in thread
From: James Simmons @ 2019-07-23  0:07 UTC (permalink / raw)
  To: lustre-devel


> > From: James Simmons <uja.ornl@yahoo.com>
> >
> > Add inode version field that osd-ldiskfs uses.
> 
> We need more words here.  What is i_fs_version used for?
> 
> (I'll probably find out as I keep reading, but I need to know
>  the intention first, or I cannot properly review the patches
>  that make use of it.
> )

Sadly these patches don't contain much in terms of why they exist.
I did some digging and it seems at one time Lustre did use the 
i_version field directly for it recovery process. It seems ext4 stomps
on its i_version independently of lustre so it can't be trusted in
use of recovery.

> NeilBrown
> 
> 
> >
> > Signed-off-by: James Simmons <uja.ornl@yahoo.com>
> > ---
> >  fs/ext4/ext4.h   | 2 ++
> >  fs/ext4/ialloc.c | 1 +
> >  fs/ext4/inode.c  | 4 ++--
> >  3 files changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 1cb6785..8abbcab 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -1063,6 +1063,8 @@ struct ext4_inode_info {
> >  	struct dquot *i_dquot[MAXQUOTAS];
> >  #endif
> >  
> > +	u64 i_fs_version;
> > +
> >  	/* Precomputed uuid+inum+igen checksum for seeding inode checksums */
> >  	__u32 i_csum_seed;
> >  
> > diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> > index 764ff4c..09ae4a4 100644
> > --- a/fs/ext4/ialloc.c
> > +++ b/fs/ext4/ialloc.c
> > @@ -1100,6 +1100,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
> >  	ei->i_dtime = 0;
> >  	ei->i_block_group = group;
> >  	ei->i_last_alloc_group = ~0;
> > +	ei->i_fs_version = 0;
> >  
> >  	ext4_set_inode_flags(inode);
> >  	if (IS_DIRSYNC(inode))
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index c7f77c6..6e66175 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -4806,14 +4806,14 @@ static inline void ext4_inode_set_iversion_queried(struct inode *inode, u64 val)
> >  	if (unlikely(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
> >  		inode_set_iversion_raw(inode, val);
> >  	else
> > -		inode_set_iversion_queried(inode, val);
> > +		EXT4_I(inode)->i_fs_version = val;
> >  }
> >  static inline u64 ext4_inode_peek_iversion(const struct inode *inode)
> >  {
> >  	if (unlikely(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
> >  		return inode_peek_iversion_raw(inode);
> >  	else
> > -		return inode_peek_iversion(inode);
> > +		return EXT4_I(inode)->i_fs_version;
> >  }
> >  
> >  struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
> > -- 
> > 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 20/22] ext4: use GFP_NOFS in ext4_inode_attach_jinode
  2019-07-22  5:30   ` NeilBrown
@ 2019-07-23  1:56     ` Andreas Dilger
  0 siblings, 0 replies; 53+ messages in thread
From: Andreas Dilger @ 2019-07-23  1:56 UTC (permalink / raw)
  To: lustre-devel

That interface has only existed for a handful of kernel releases, so was/is
not a viable option for earlier kernels. 

Cheers, Andreas

> On Jul 21, 2019, at 23:31, NeilBrown <neilb@suse.com> wrote:
> 
> On Sun, Jul 21 2019, James Simmons wrote:
> 
> If you need to call this from a context where GFP_KERNEL is not safe,
> use memalloc_nofs_{save,restore}().
> 
> NeilBrown
> 
> 
> 
>> Signed-off-by: James Simmons <jsimmons@infradead.org>
>> ---
>> fs/ext4/inode.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index bf00dfc..ca72097 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4422,7 +4422,7 @@ int ext4_inode_attach_jinode(struct inode *inode)
>>    if (ei->jinode || !EXT4_SB(inode->i_sb)->s_journal)
>>        return 0;
>> 
>> -    jinode = jbd2_alloc_inode(GFP_KERNEL);
>> +    jinode = jbd2_alloc_inode(GFP_NOFS);
>>    spin_lock(&inode->i_lock);
>>    if (!ei->jinode) {
>>        if (!jinode) {
>> -- 
>> 1.8.3.1

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 19/22] ext4: don't check before replay
  2019-07-22  5:29   ` NeilBrown
       [not found]     ` <506765DD-0068-469E-ADA4-2C71B8B60114@cloudlinux.com>
@ 2019-07-23  1:57     ` Andreas Dilger
  2019-07-23  2:01       ` Oleg Drokin
  1 sibling, 1 reply; 53+ messages in thread
From: Andreas Dilger @ 2019-07-23  1:57 UTC (permalink / raw)
  To: lustre-devel

Actually, I think this patch would be OK to push upstream. 

Cheers, Andreas

> On Jul 21, 2019, at 23:29, NeilBrown <neilb@suse.com> wrote:
> 
>> On Sun, Jul 21 2019, James Simmons wrote:
>> 
>> When ldiskfs run in failover mode whith read-only disk.
>> Part of allocation updates are lost and ldiskfs may fail
>> while mounting this is due to inconsistent state of
>> group-descriptor. Group-descriptor check is added after
>> journal replay.
> 
> I think this needs to be enabled by a mount option or super-block flag.
> 
> NeilBrown
> 
> 
>> 
>> Signed-off-by: James Simmons <jsimmons@infradead.org>
>> ---
>> fs/ext4/super.c | 11 ++++++-----
>> 1 file changed, 6 insertions(+), 5 deletions(-)
>> 
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index a3179b2..b818acb 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -4255,11 +4255,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>>        }
>>    }
>>    sbi->s_gdb_count = db_count;
>> -    if (!ext4_check_descriptors(sb, logical_sb_block, &first_not_zeroed)) {
>> -        ext4_msg(sb, KERN_ERR, "group descriptors corrupted!");
>> -        ret = -EFSCORRUPTED;
>> -        goto failed_mount2;
>> -    }
>> 
>>    timer_setup(&sbi->s_err_report, print_daily_error_info, 0);
>> 
>> @@ -4401,6 +4396,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>>    sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
>> 
>> no_journal:
>> +    if (!ext4_check_descriptors(sb, logical_sb_block, &first_not_zeroed)) {
>> +        ext4_msg(sb, KERN_ERR, "group descriptors corrupted!");
>> +        ret = -EFSCORRUPTED;
>> +        goto failed_mount_wq;
>> +    }
>> +
>>    if (!test_opt(sb, NO_MBCACHE)) {
>>        sbi->s_ea_block_cache = ext4_xattr_create_cache();
>>        if (!sbi->s_ea_block_cache) {
>> -- 
>> 1.8.3.1

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 19/22] ext4: don't check before replay
  2019-07-23  1:57     ` Andreas Dilger
@ 2019-07-23  2:01       ` Oleg Drokin
  0 siblings, 0 replies; 53+ messages in thread
From: Oleg Drokin @ 2019-07-23  2:01 UTC (permalink / raw)
  To: lustre-devel

what I think needs to happen is a better description.

Something like:

In a crash group descriptors might not be written completely
in place that would lead to FS error message on subsequent mount.

Move the check to after journal replay to ensure we are
dealing with up to date (and hopefully correct) information
before declaring the FS as bad.

> On Jul 22, 2019, at 9:57 PM, Andreas Dilger <adilger@whamcloud.com> wrote:
> 
> Actually, I think this patch would be OK to push upstream. 
> 
> Cheers, Andreas
> 
>> On Jul 21, 2019, at 23:29, NeilBrown <neilb@suse.com> wrote:
>> 
>>> On Sun, Jul 21 2019, James Simmons wrote:
>>> 
>>> When ldiskfs run in failover mode whith read-only disk.
>>> Part of allocation updates are lost and ldiskfs may fail
>>> while mounting this is due to inconsistent state of
>>> group-descriptor. Group-descriptor check is added after
>>> journal replay.
>> 
>> I think this needs to be enabled by a mount option or super-block flag.
>> 
>> NeilBrown
>> 
>> 
>>> 
>>> Signed-off-by: James Simmons <jsimmons@infradead.org>
>>> ---
>>> fs/ext4/super.c | 11 ++++++-----
>>> 1 file changed, 6 insertions(+), 5 deletions(-)
>>> 
>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>>> index a3179b2..b818acb 100644
>>> --- a/fs/ext4/super.c
>>> +++ b/fs/ext4/super.c
>>> @@ -4255,11 +4255,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>>>       }
>>>   }
>>>   sbi->s_gdb_count = db_count;
>>> -    if (!ext4_check_descriptors(sb, logical_sb_block, &first_not_zeroed)) {
>>> -        ext4_msg(sb, KERN_ERR, "group descriptors corrupted!");
>>> -        ret = -EFSCORRUPTED;
>>> -        goto failed_mount2;
>>> -    }
>>> 
>>>   timer_setup(&sbi->s_err_report, print_daily_error_info, 0);
>>> 
>>> @@ -4401,6 +4396,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>>>   sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
>>> 
>>> no_journal:
>>> +    if (!ext4_check_descriptors(sb, logical_sb_block, &first_not_zeroed)) {
>>> +        ext4_msg(sb, KERN_ERR, "group descriptors corrupted!");
>>> +        ret = -EFSCORRUPTED;
>>> +        goto failed_mount_wq;
>>> +    }
>>> +
>>>   if (!test_opt(sb, NO_MBCACHE)) {
>>>       sbi->s_ea_block_cache = ext4_xattr_create_cache();
>>>       if (!sbi->s_ea_block_cache) {
>>> -- 
>>> 1.8.3.1

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 17/22] ext4: optimize ext4_journal_callback_add
  2019-07-22  5:27   ` NeilBrown
@ 2019-07-23  2:01     ` Andreas Dilger
  0 siblings, 0 replies; 53+ messages in thread
From: Andreas Dilger @ 2019-07-23  2:01 UTC (permalink / raw)
  To: lustre-devel

This might be accommodated by a new API variant that allows the caller
to add this commit callback at the head of the list instead of the tail. 

The main question is whether such an API would be accepted upstream
without a user?  

Cheers, Andreas

> On Jul 21, 2019, at 23:27, NeilBrown <neilb@suse.com> wrote:
> 
>> On Sun, Jul 21 2019, James Simmons wrote:
>> 
>> Change list_add_tail to list_add. It gives advantages to ldiskfs
>> in tgt_cb_last_committed. In the beginning of list will be placed
>> thandles with the highest transaction numbers. So at the first
>> iterations we will have the highest transno. It will save from
>> extra call of ptlrpc_commit_replies.
> 
> If only more commit messages were like this !!  Thanks!
> 
> I suspect it would be better for that thing that needs to see the
> highest transno first, to find a way to look at the end of the list
> instead of the beginning.
> 
> This isn't the sort of change that can land in ext4.
> 
> NeilBrown
> 
> 
>> 
>> Signed-off-by: James Simmons <jsimmons@infradead.org>
>> ---
>> fs/ext4/ext4_jbd2.h | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>> index 75a5309..5ebf8ee 100644
>> --- a/fs/ext4/ext4_jbd2.h
>> +++ b/fs/ext4/ext4_jbd2.h
>> @@ -169,7 +169,7 @@ static inline void _ext4_journal_callback_add(handle_t *handle,
>>            struct ext4_journal_cb_entry *jce)
>> {
>>    /* Add the jce to transaction's private list */
>> -    list_add_tail(&jce->jce_list, &handle->h_transaction->t_private_list);
>> +    list_add(&jce->jce_list, &handle->h_transaction->t_private_list);
>> }
>> 
>> static inline void ext4_journal_callback_add(handle_t *handle,
>> -- 
>> 1.8.3.1

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 08/22] ext4: kill off struct dx_root
  2019-07-22  4:52   ` NeilBrown
@ 2019-07-23  2:07     ` Andreas Dilger
  2019-08-05  7:31     ` Artem Blagodarenko
  1 sibling, 0 replies; 53+ messages in thread
From: Andreas Dilger @ 2019-07-23  2:07 UTC (permalink / raw)
  To: lustre-devel

This is needed so that we can add extra data to struct ext4_dir_entry
(Lustre FID) via the "dirdata" feature, and struct dx_root assumes "." and
".." entries are a fixed size.

So there is a reason for the patch, and I think it would be acceptable
upstream. Artem @ Cray has an updated version of this patch that
was submitted upstream to linux-ext4 and reviewed, but it needs to
be refreshed again. 

Cheers, Andreas

> On Jul 21, 2019, at 22:52, NeilBrown <neilb@suse.com> wrote:
> 
>> On Sun, Jul 21 2019, James Simmons wrote:
>> 
>> Use struct dx_root_info directly.
> 
> This looks like a style change.
> Instead of a 'struct' that models the start of an on-disk directory
> block, the location of some fields is done using calculations instead.
> 
> Why would we want to do that?  Might we want dx_root_info to be at a
> different offset in the directory ??
> 
> NeilBrown

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 01/22] ext4: add i_fs_version
  2019-07-22  4:13   ` NeilBrown
  2019-07-23  0:07     ` James Simmons
@ 2019-07-31 22:03     ` Andreas Dilger
  1 sibling, 0 replies; 53+ messages in thread
From: Andreas Dilger @ 2019-07-31 22:03 UTC (permalink / raw)
  To: lustre-devel

On Jul 21, 2019, at 22:13, NeilBrown <neilb@suse.com> wrote:
> 
> On Sun, Jul 21 2019, James Simmons wrote:
> 
>> From: James Simmons <uja.ornl@yahoo.com>
>> 
>> Add inode version field that osd-ldiskfs uses.
> 
> We need more words here.  What is i_fs_version used for?
> 
> (I'll probably find out as I keep reading, but I need to know
> the intention first, or I cannot properly review the patches
> that make use of it.

We use the 64-bit i_version field to store the Lustre transno in
which the inode was last modified.  This maintains the NFS
required semantics that the version change when the inode is
modified, but the number has a specific meaning so we don't
want the ext4 code or VFS to just increment it.

When a client modifies an inode, the old version is returned in
the RPC reply to the client.  The old version number can be used
during Lustre recovery (RPC replay using "Version Based Recovery",
VBR) after an MDS crash to determine if it is safe to apply this
change to the inode, in the case that some clients have been
evicted and we don't have a full stream of RPCs to replay.

If we could get some way to prevent the kernel from updating the
inode->i_version (which is now also a 64-bit value) then we could
remove one more patch from our code.  In the meantime, the current
kernel is not conducive to hijacking inode->i_version so we need
to store our own value in the ext4_inode_info and write that to
disk instead of the kernel version.

Cheers, Andreas

> 
>> 
>> Signed-off-by: James Simmons <uja.ornl@yahoo.com>
>> ---
>> fs/ext4/ext4.h   | 2 ++
>> fs/ext4/ialloc.c | 1 +
>> fs/ext4/inode.c  | 4 ++--
>> 3 files changed, 5 insertions(+), 2 deletions(-)
>> 
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 1cb6785..8abbcab 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1063,6 +1063,8 @@ struct ext4_inode_info {
>> 	struct dquot *i_dquot[MAXQUOTAS];
>> #endif
>> 
>> +	u64 i_fs_version;
>> +
>> 	/* Precomputed uuid+inum+igen checksum for seeding inode checksums */
>> 	__u32 i_csum_seed;
>> 
>> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
>> index 764ff4c..09ae4a4 100644
>> --- a/fs/ext4/ialloc.c
>> +++ b/fs/ext4/ialloc.c
>> @@ -1100,6 +1100,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>> 	ei->i_dtime = 0;
>> 	ei->i_block_group = group;
>> 	ei->i_last_alloc_group = ~0;
>> +	ei->i_fs_version = 0;
>> 
>> 	ext4_set_inode_flags(inode);
>> 	if (IS_DIRSYNC(inode))
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index c7f77c6..6e66175 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4806,14 +4806,14 @@ static inline void ext4_inode_set_iversion_queried(struct inode *inode, u64 val)
>> 	if (unlikely(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
>> 		inode_set_iversion_raw(inode, val);
>> 	else
>> -		inode_set_iversion_queried(inode, val);
>> +		EXT4_I(inode)->i_fs_version = val;
>> }
>> static inline u64 ext4_inode_peek_iversion(const struct inode *inode)
>> {
>> 	if (unlikely(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
>> 		return inode_peek_iversion_raw(inode);
>> 	else
>> -		return inode_peek_iversion(inode);
>> +		return EXT4_I(inode)->i_fs_version;
>> }
>> 
>> struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
>> -- 
>> 1.8.3.1

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 03/22] ext4: prealloc table optimization
  2019-07-22  1:23 ` [lustre-devel] [PATCH 03/22] ext4: prealloc table optimization James Simmons
  2019-07-22  4:29   ` NeilBrown
@ 2019-08-05  7:07   ` Artem Blagodarenko
  1 sibling, 0 replies; 53+ messages in thread
From: Artem Blagodarenko @ 2019-08-05  7:07 UTC (permalink / raw)
  To: lustre-devel

Hello,

I am glad to see LU-12335 patch is integrated here.

Reviewed-by:  Artem Blagodarenko <c17828@cray.com>

Best regards,
Artem Blagodarenko.


?On 22/07/2019, 04:24, "James Simmons" <jsimmons@infradead.org> wrote:

    Optimize prealloc table
    
    Signed-off-by: James Simmons <jsimmons@infradead.org>
    ---
     fs/ext4/ext4.h    |   7 +-
     fs/ext4/inode.c   |   3 +
     fs/ext4/mballoc.c | 221 +++++++++++++++++++++++++++++++++++++++++-------------
     fs/ext4/namei.c   |   4 +-
     fs/ext4/sysfs.c   |   8 +-
     5 files changed, 186 insertions(+), 57 deletions(-)
    
    diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
    index 8abbcab..423ab4d 100644
    --- a/fs/ext4/ext4.h
    +++ b/fs/ext4/ext4.h
    @@ -1190,6 +1190,8 @@ struct ext4_inode_info {
     /* Metadata checksum algorithm codes */
     #define EXT4_CRC32C_CHKSUM		1
     
    +#define EXT4_MAX_PREALLOC_TABLE		64
    +
     /*
      * Structure of the super block
      */
    @@ -1447,12 +1449,14 @@ struct ext4_sb_info {
     
     	/* tunables */
     	unsigned long s_stripe;
    -	unsigned int s_mb_stream_request;
    +	unsigned long s_mb_small_req;
    +	unsigned long s_mb_large_req;
     	unsigned int s_mb_max_to_scan;
     	unsigned int s_mb_min_to_scan;
     	unsigned int s_mb_stats;
     	unsigned int s_mb_order2_reqs;
     	unsigned int s_mb_group_prealloc;
    +	unsigned long *s_mb_prealloc_table;
     	unsigned int s_max_dir_size_kb;
     	/* where last allocation was done - for stream allocation */
     	unsigned long s_mb_last_group;
    @@ -2457,6 +2461,7 @@ extern int ext4_init_inode_table(struct super_block *sb,
     extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate);
     
     /* mballoc.c */
    +extern const struct file_operations ext4_seq_prealloc_table_fops;
     extern const struct seq_operations ext4_mb_seq_groups_ops;
     extern long ext4_mb_stats;
     extern long ext4_mb_max_to_scan;
    diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
    index 6e66175..c37418a 100644
    --- a/fs/ext4/inode.c
    +++ b/fs/ext4/inode.c
    @@ -2796,6 +2796,9 @@ static int ext4_writepages(struct address_space *mapping,
     		ext4_journal_stop(handle);
     	}
     
    +	if (wbc->nr_to_write < sbi->s_mb_small_req)
    +		wbc->nr_to_write = sbi->s_mb_small_req;
    +
     	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
     		range_whole = 1;
     
    diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
    index 99ba720..3be3bef 100644
    --- a/fs/ext4/mballoc.c
    +++ b/fs/ext4/mballoc.c
    @@ -2339,6 +2339,101 @@ static void ext4_mb_seq_groups_stop(struct seq_file *seq, void *v)
     	.show   = ext4_mb_seq_groups_show,
     };
     
    +static int ext4_mb_check_and_update_prealloc(struct ext4_sb_info *sbi,
    +					     char *str, size_t cnt,
    +					     int update)
    +{
    +	unsigned long value;
    +	unsigned long prev = 0;
    +	char *cur;
    +	char *next;
    +	char *end;
    +	int num = 0;
    +
    +	cur = str;
    +	end = str + cnt;
    +	while (cur < end) {
    +		while ((cur < end) && (*cur == ' ')) cur++;
    +		/* Yuck - simple_strtol */
    +		value = simple_strtol(cur, &next, 0);
    +		if (value == 0)
    +			break;
    +		if (cur == next)
    +			return -EINVAL;
    +
    +		cur = next;
    +
    +		if (value > (sbi->s_blocks_per_group - 1 - 1 - sbi->s_itb_per_group))
    +			return -EINVAL;
    +
    +		/* they should add values in order */
    +		if (value <= prev)
    +			return -EINVAL;
    +
    +		if (update)
    +			sbi->s_mb_prealloc_table[num] = value;
    +
    +		prev = value;
    +		num++;
    +	}
    +
    +	if (num > EXT4_MAX_PREALLOC_TABLE - 1)
    +		return -EOVERFLOW;
    +
    +	if (update)
    +		sbi->s_mb_prealloc_table[num] = 0;
    +
    +	return 0;
    +}
    +
    +static ssize_t ext4_mb_prealloc_table_proc_write(struct file *file,
    +						 const char __user *buf,
    +						 size_t cnt, loff_t *pos)
    +{
    +	struct ext4_sb_info *sbi = EXT4_SB(PDE_DATA(file_inode(file)));
    +	char str[128];
    +	int rc;
    +
    +	if (cnt >= sizeof(str))
    +		return -EINVAL;
    +	if (copy_from_user(str, buf, cnt))
    +		return -EFAULT;
    +
    +	rc = ext4_mb_check_and_update_prealloc(sbi, str, cnt, 0);
    +	if (rc)
    +		return rc;
    +
    +	rc = ext4_mb_check_and_update_prealloc(sbi, str, cnt, 1);
    +	return rc ? rc : cnt;
    +}
    +
    +static int mb_prealloc_table_seq_show(struct seq_file *m, void *v)
    +{
    +	struct ext4_sb_info *sbi = EXT4_SB(m->private);
    +	int i;
    +
    +	for (i = 0; i < EXT4_MAX_PREALLOC_TABLE &&
    +			sbi->s_mb_prealloc_table[i] != 0; i++)
    +		seq_printf(m, "%ld ", sbi->s_mb_prealloc_table[i]);
    +	seq_printf(m, "\n");
    +
    +	return 0;
    +}
    +
    +static int mb_prealloc_table_seq_open(struct inode *inode, struct file *file)
    +{
    +	return single_open(file, mb_prealloc_table_seq_show, PDE_DATA(inode));
    +}
    +
    +const struct file_operations ext4_seq_prealloc_table_fops = {
    +	.owner	 = THIS_MODULE,
    +	.open	 = mb_prealloc_table_seq_open,
    +	.read	 = seq_read,
    +	.llseek	 = seq_lseek,
    +	.release = single_release,
    +	.write	 = ext4_mb_prealloc_table_proc_write,
    +};
    +
     static struct kmem_cache *get_groupinfo_cache(int blocksize_bits)
     {
     	int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE;
    @@ -2567,7 +2662,7 @@ static int ext4_groupinfo_create_slab(size_t size)
     int ext4_mb_init(struct super_block *sb)
     {
     	struct ext4_sb_info *sbi = EXT4_SB(sb);
    -	unsigned i, j;
    +	unsigned i, j, k, l;
     	unsigned offset, offset_incr;
     	unsigned max;
     	int ret;
    @@ -2616,7 +2711,6 @@ int ext4_mb_init(struct super_block *sb)
     	sbi->s_mb_max_to_scan = MB_DEFAULT_MAX_TO_SCAN;
     	sbi->s_mb_min_to_scan = MB_DEFAULT_MIN_TO_SCAN;
     	sbi->s_mb_stats = MB_DEFAULT_STATS;
    -	sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD;
     	sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS;
     	/*
     	 * The default group preallocation is 512, which for 4k block
    @@ -2640,9 +2734,28 @@ int ext4_mb_init(struct super_block *sb)
     	 * RAID stripe size so that preallocations don't fragment
     	 * the stripes.
     	 */
    -	if (sbi->s_stripe > 1) {
    -		sbi->s_mb_group_prealloc = roundup(
    -			sbi->s_mb_group_prealloc, sbi->s_stripe);
    +	/* Allocate table once */
    +	sbi->s_mb_prealloc_table = kzalloc(
    +		EXT4_MAX_PREALLOC_TABLE * sizeof(unsigned long), GFP_NOFS);
    +	if (!sbi->s_mb_prealloc_table) {
    +		ret = -ENOMEM;
    +		goto out;
    +	}
    +
    +	if (sbi->s_stripe == 0) {
    +		for (k = 0, l = 4; k <= 9; ++k, l *= 2)
    +			sbi->s_mb_prealloc_table[k] = l;
    +
    +		sbi->s_mb_small_req = 256;
    +		sbi->s_mb_large_req = 1024;
    +		sbi->s_mb_group_prealloc = 512;
    +	} else {
    +		for (k = 0, l = sbi->s_stripe; k <= 2; ++k, l *= 2)
    +			sbi->s_mb_prealloc_table[k] = l;
    +
    +		sbi->s_mb_small_req = sbi->s_stripe;
    +		sbi->s_mb_large_req = sbi->s_stripe * 8;
    +		sbi->s_mb_group_prealloc = sbi->s_stripe * 4;
     	}
     
     	sbi->s_locality_groups = alloc_percpu(struct ext4_locality_group);
    @@ -2670,6 +2783,7 @@ int ext4_mb_init(struct super_block *sb)
     	free_percpu(sbi->s_locality_groups);
     	sbi->s_locality_groups = NULL;
     out:
    +	kfree(sbi->s_mb_prealloc_table);
     	kfree(sbi->s_mb_offsets);
     	sbi->s_mb_offsets = NULL;
     	kfree(sbi->s_mb_maxs);
    @@ -2932,7 +3046,6 @@ void ext4_exit_mballoc(void)
     	int err, len;
     
     	BUG_ON(ac->ac_status != AC_STATUS_FOUND);
    -	BUG_ON(ac->ac_b_ex.fe_len <= 0);
     
     	sb = ac->ac_sb;
     	sbi = EXT4_SB(sb);
    @@ -3062,13 +3175,14 @@ static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac)
     				struct ext4_allocation_request *ar)
     {
     	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
    -	int bsbits, max;
    +	int bsbits, i, wind;
     	ext4_lblk_t end;
    -	loff_t size, start_off;
    +	loff_t size;
     	loff_t orig_size __maybe_unused;
     	ext4_lblk_t start;
     	struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
     	struct ext4_prealloc_space *pa;
    +	unsigned long value, last_non_zero;
     
     	/* do normalize only data requests, metadata requests
     	   do not need preallocation */
    @@ -3097,51 +3211,47 @@ static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac)
     	size = size << bsbits;
     	if (size < i_size_read(ac->ac_inode))
     		size = i_size_read(ac->ac_inode);
    -	orig_size = size;
    +	size = (size + ac->ac_sb->s_blocksize - 1) >> bsbits;
     
    -	/* max size of free chunks */
    -	max = 2 << bsbits;
    -
    -#define NRL_CHECK_SIZE(req, size, max, chunk_size)	\
    -		(req <= (size) || max <= (chunk_size))
    -
    -	/* first, try to predict filesize */
    -	/* XXX: should this table be tunable? */
    -	start_off = 0;
    -	if (size <= 16 * 1024) {
    -		size = 16 * 1024;
    -	} else if (size <= 32 * 1024) {
    -		size = 32 * 1024;
    -	} else if (size <= 64 * 1024) {
    -		size = 64 * 1024;
    -	} else if (size <= 128 * 1024) {
    -		size = 128 * 1024;
    -	} else if (size <= 256 * 1024) {
    -		size = 256 * 1024;
    -	} else if (size <= 512 * 1024) {
    -		size = 512 * 1024;
    -	} else if (size <= 1024 * 1024) {
    -		size = 1024 * 1024;
    -	} else if (NRL_CHECK_SIZE(size, 4 * 1024 * 1024, max, 2 * 1024)) {
    -		start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
    -						(21 - bsbits)) << 21;
    -		size = 2 * 1024 * 1024;
    -	} else if (NRL_CHECK_SIZE(size, 8 * 1024 * 1024, max, 4 * 1024)) {
    -		start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
    -							(22 - bsbits)) << 22;
    -		size = 4 * 1024 * 1024;
    -	} else if (NRL_CHECK_SIZE(ac->ac_o_ex.fe_len,
    -					(8<<20)>>bsbits, max, 8 * 1024)) {
    -		start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
    -							(23 - bsbits)) << 23;
    -		size = 8 * 1024 * 1024;
    +	start = wind = 0;
    +	value = last_non_zero = 0;
    +
    +	/* let's choose preallocation window depending on file size */
    +	for (i = 0; i < EXT4_MAX_PREALLOC_TABLE; i++) {
    +		value = sbi->s_mb_prealloc_table[i];
    +		if (value == 0)
    +			break;
    +		else
    +			last_non_zero = value;
    +
    +		if (size <= value) {
    +			wind = value;
    +			break;
    +		}
    +	}
    +
    +	if (wind == 0) {
    +		if (last_non_zero != 0) {
    +			u64 tstart, tend;
    +
    +			/* file is quite large, we now preallocate with
    +			 * the biggest configured window with regart to
    +			 * logical offset
    +			 */
    +			wind = last_non_zero;
    +			tstart = ac->ac_o_ex.fe_logical;
    +			do_div(tstart, wind);
    +			start = tstart * wind;
    +			tend = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len - 1;
    +			do_div(tend, wind);
    +			tend = tend * wind + wind;
    +			size = tend - start;
    +		}
     	} else {
    -		start_off = (loff_t) ac->ac_o_ex.fe_logical << bsbits;
    -		size	  = (loff_t) EXT4_C2B(EXT4_SB(ac->ac_sb),
    -					      ac->ac_o_ex.fe_len) << bsbits;
    +		size = wind;
     	}
    -	size = size >> bsbits;
    -	start = start_off >> bsbits;
    +
    +	orig_size = size;
     
     	/* don't cover already allocated blocks in selected range */
     	if (ar->pleft && start <= ar->lleft) {
    @@ -3223,7 +3333,6 @@ static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac)
     			 (unsigned long) ac->ac_o_ex.fe_logical);
     		BUG();
     	}
    -	BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
     
     	/* now prepare goal request */
     
    @@ -4191,11 +4300,19 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
     
     	/* don't use group allocation for large files */
     	size = max(size, isize);
    -	if (size > sbi->s_mb_stream_request) {
    +	if ((ac->ac_o_ex.fe_len >= sbi->s_mb_small_req) ||
    +	    (size >= sbi->s_mb_large_req)) {
     		ac->ac_flags |= EXT4_MB_STREAM_ALLOC;
     		return;
     	}
     
    +	/*
    +	 * request is so large that we don't care about
    +	 * streaming - it overweights any possible seek
    +	 */
    +	if (ac->ac_o_ex.fe_len >= sbi->s_mb_large_req)
    +		return;
    +
     	BUG_ON(ac->ac_lg != NULL);
     	/*
     	 * locality group prealloc space are per cpu. The reason for having
    diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
    index a616f58..a52b311 100644
    --- a/fs/ext4/namei.c
    +++ b/fs/ext4/namei.c
    @@ -752,8 +752,8 @@ struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir,
     	if (root->info.hash_version != DX_HASH_TEA &&
     	    root->info.hash_version != DX_HASH_HALF_MD4 &&
     	    root->info.hash_version != DX_HASH_LEGACY) {
    -		ext4_warning_inode(dir, "Unrecognised inode hash code %u",
    -				   root->info.hash_version);
    +		ext4_warning_inode(dir, "Unrecognised inode hash code %u for directory %lu",
    +				   root->info.hash_version, dir->i_ino);
     		goto fail;
     	}
     	if (fname)
    diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
    index 04b4f53..1375815 100644
    --- a/fs/ext4/sysfs.c
    +++ b/fs/ext4/sysfs.c
    @@ -184,7 +184,8 @@ static ssize_t journal_task_show(struct ext4_sb_info *sbi, char *buf)
     EXT4_RW_ATTR_SBI_UI(mb_max_to_scan, s_mb_max_to_scan);
     EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan);
     EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
    -EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request);
    +EXT4_RW_ATTR_SBI_UI(mb_small_req, s_mb_small_req);
    +EXT4_RW_ATTR_SBI_UI(mb_large_req, s_mb_large_req);
     EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc);
     EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb);
     EXT4_ATTR(trigger_fs_error, 0200, trigger_test_error);
    @@ -213,7 +214,8 @@ static ssize_t journal_task_show(struct ext4_sb_info *sbi, char *buf)
     	ATTR_LIST(mb_max_to_scan),
     	ATTR_LIST(mb_min_to_scan),
     	ATTR_LIST(mb_order2_req),
    -	ATTR_LIST(mb_stream_req),
    +	ATTR_LIST(mb_small_req),
    +	ATTR_LIST(mb_large_req),
     	ATTR_LIST(mb_group_prealloc),
     	ATTR_LIST(max_writeback_mb_bump),
     	ATTR_LIST(extent_max_zeroout_kb),
    @@ -413,6 +415,8 @@ int ext4_register_sysfs(struct super_block *sb)
     				sb);
     		proc_create_seq_data("mb_groups", S_IRUGO, sbi->s_proc,
     				&ext4_mb_seq_groups_ops, sb);
    +		proc_create_data("prealloc_table", S_IRUGO, sbi->s_proc,
    +				 &ext4_seq_prealloc_table_fops, sb);
     	}
     	return 0;
     }
    -- 
    1.8.3.1
    
    

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [lustre-devel] [PATCH 08/22] ext4: kill off struct dx_root
  2019-07-22  4:52   ` NeilBrown
  2019-07-23  2:07     ` Andreas Dilger
@ 2019-08-05  7:31     ` Artem Blagodarenko
  1 sibling, 0 replies; 53+ messages in thread
From: Artem Blagodarenko @ 2019-08-05  7:31 UTC (permalink / raw)
  To: lustre-devel

We need this for dirdata feature.

?On 22/07/2019, 07:52, "NeilBrown" <neilb@suse.com> wrote:

    On Sun, Jul 21 2019, James Simmons wrote:
    
    > Use struct dx_root_info directly.
    
    This looks like a style change.
    Instead of a 'struct' that models the start of an on-disk directory
    block, the location of some fields is done using calculations instead.
    
    Why would we want to do that?  Might we want dx_root_info to be at a
    different offset in the directory ??
    
    NeilBrown
    

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2019-08-05  7:31 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-22  1:23 [lustre-devel] [PATCH 00/22] [RFC] ldiskfs patches against 5.2-rc2+ James Simmons
2019-07-22  1:23 ` [lustre-devel] [PATCH 01/22] ext4: add i_fs_version James Simmons
2019-07-22  4:13   ` NeilBrown
2019-07-23  0:07     ` James Simmons
2019-07-31 22:03     ` Andreas Dilger
2019-07-22  1:23 ` [lustre-devel] [PATCH 02/22] ext4: use d_find_alias() in ext4_lookup James Simmons
2019-07-22  4:16   ` NeilBrown
2019-07-22  1:23 ` [lustre-devel] [PATCH 03/22] ext4: prealloc table optimization James Simmons
2019-07-22  4:29   ` NeilBrown
2019-08-05  7:07   ` Artem Blagodarenko
2019-07-22  1:23 ` [lustre-devel] [PATCH 04/22] ext4: export inode management James Simmons
2019-07-22  4:34   ` NeilBrown
2019-07-22  7:16     ` Oleg Drokin
2019-07-22  1:23 ` [lustre-devel] [PATCH 05/22] ext4: various misc changes James Simmons
2019-07-22  1:23 ` [lustre-devel] [PATCH 06/22] ext4: add extra checks for mballoc James Simmons
2019-07-22  4:37   ` NeilBrown
2019-07-22  1:23 ` [lustre-devel] [PATCH 07/22] ext4: update .. for hash indexed directory James Simmons
2019-07-22  4:45   ` NeilBrown
2019-07-22  1:23 ` [lustre-devel] [PATCH 08/22] ext4: kill off struct dx_root James Simmons
2019-07-22  4:52   ` NeilBrown
2019-07-23  2:07     ` Andreas Dilger
2019-08-05  7:31     ` Artem Blagodarenko
2019-07-22  1:23 ` [lustre-devel] [PATCH 09/22] ext4: fix mballoc pa free mismatch James Simmons
2019-07-22  4:56   ` NeilBrown
2019-07-22  1:23 ` [lustre-devel] [PATCH 10/22] ext4: add data in dentry feature James Simmons
2019-07-22  1:23 ` [lustre-devel] [PATCH 11/22] ext4: over ride current_time James Simmons
2019-07-22  5:06   ` NeilBrown
2019-07-22  1:23 ` [lustre-devel] [PATCH 12/22] ext4: add htree lock implementation James Simmons
2019-07-22  5:10   ` NeilBrown
2019-07-22  1:23 ` [lustre-devel] [PATCH 13/22] ext4: Add a proc interface for max_dir_size James Simmons
2019-07-22  5:14   ` NeilBrown
2019-07-22  1:23 ` [lustre-devel] [PATCH 14/22] ext4: remove inode_lock handling James Simmons
2019-07-22  5:16   ` NeilBrown
2019-07-22  1:23 ` [lustre-devel] [PATCH 15/22] ext4: remove bitmap corruption warnings James Simmons
2019-07-22  5:18   ` NeilBrown
2019-07-22  1:23 ` [lustre-devel] [PATCH 16/22] ext4: add warning for directory htree growth James Simmons
2019-07-22  5:24   ` NeilBrown
2019-07-22  1:23 ` [lustre-devel] [PATCH 17/22] ext4: optimize ext4_journal_callback_add James Simmons
2019-07-22  5:27   ` NeilBrown
2019-07-23  2:01     ` Andreas Dilger
2019-07-22  1:23 ` [lustre-devel] [PATCH 18/22] ext4: attach jinode in writepages James Simmons
2019-07-22  1:23 ` [lustre-devel] [PATCH 19/22] ext4: don't check before replay James Simmons
2019-07-22  5:29   ` NeilBrown
     [not found]     ` <506765DD-0068-469E-ADA4-2C71B8B60114@cloudlinux.com>
2019-07-22  6:46       ` NeilBrown
2019-07-22  6:56         ` Oleg Drokin
2019-07-22  9:51           ` Alexey Lyashkov
2019-07-23  1:57     ` Andreas Dilger
2019-07-23  2:01       ` Oleg Drokin
2019-07-22  1:23 ` [lustre-devel] [PATCH 20/22] ext4: use GFP_NOFS in ext4_inode_attach_jinode James Simmons
2019-07-22  5:30   ` NeilBrown
2019-07-23  1:56     ` Andreas Dilger
2019-07-22  1:23 ` [lustre-devel] [PATCH 21/22] ext4: export ext4_orphan_add James Simmons
2019-07-22  1:23 ` [lustre-devel] [PATCH 22/22] ext4: export mb stream allocator variables James Simmons

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.