All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/10 v3] ext4: Fix possible fs corruption due to xattr races
@ 2022-07-12 10:54 Jan Kara
  2022-07-12 10:54 ` [PATCH 01/10] mbcache: Don't reclaim used entries Jan Kara
                   ` (10 more replies)
  0 siblings, 11 replies; 31+ messages in thread
From: Jan Kara @ 2022-07-12 10:54 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Ritesh Harjani, Jan Kara

Hello!

I've noticed this series didn't get merged yet. I was waiting for more review
feedback from Ritesh but somehow that didn't happen. So this is the third
submission of the series fixing the races of ext4 xattr block reuse with the
few changes that have accumulated since v2. Ted, do you think you can add this
series to your tree so that we can merge it during the merge window? Thanks!

Changes since v1:
* Reworked the series to fix all corner cases and make API less errorprone.

Changes since v2:
* Renamed mb_cache_entry_try_delete() to mb_cache_entry_delete_and_get()
* Added Tested-by tag from Ritesh

								Honza

Previous versions:
Link: http://lore.kernel.org/r/20220606142215.17962-1-jack@suse.cz # v1
Link: http://lore.kernel.org/r/20220614124146.21594-1-jack@suse.cz # v2

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 01/10] mbcache: Don't reclaim used entries
  2022-07-12 10:54 [PATCH 0/10 v3] ext4: Fix possible fs corruption due to xattr races Jan Kara
@ 2022-07-12 10:54 ` Jan Kara
  2022-07-14 11:47   ` Ritesh Harjani
  2022-07-22 13:58   ` Theodore Ts'o
  2022-07-12 10:54 ` [PATCH 02/10] mbcache: Add functions to delete entry if unused Jan Kara
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 31+ messages in thread
From: Jan Kara @ 2022-07-12 10:54 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Ritesh Harjani, Jan Kara, stable

Do not reclaim entries that are currently used by somebody from a
shrinker. Firstly, these entries are likely useful. Secondly, we will
need to keep such entries to protect pending increment of xattr block
refcount.

CC: stable@vger.kernel.org
Fixes: 82939d7999df ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/mbcache.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index 97c54d3a2227..cfc28129fb6f 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -288,7 +288,7 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
 	while (nr_to_scan-- && !list_empty(&cache->c_list)) {
 		entry = list_first_entry(&cache->c_list,
 					 struct mb_cache_entry, e_list);
-		if (entry->e_referenced) {
+		if (entry->e_referenced || atomic_read(&entry->e_refcnt) > 2) {
 			entry->e_referenced = 0;
 			list_move_tail(&entry->e_list, &cache->c_list);
 			continue;
@@ -302,6 +302,14 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
 		spin_unlock(&cache->c_list_lock);
 		head = mb_cache_entry_head(cache, entry->e_key);
 		hlist_bl_lock(head);
+		/* Now a reliable check if the entry didn't get used... */
+		if (atomic_read(&entry->e_refcnt) > 2) {
+			hlist_bl_unlock(head);
+			spin_lock(&cache->c_list_lock);
+			list_add_tail(&entry->e_list, &cache->c_list);
+			cache->c_entry_count++;
+			continue;
+		}
 		if (!hlist_bl_unhashed(&entry->e_hash_list)) {
 			hlist_bl_del_init(&entry->e_hash_list);
 			atomic_dec(&entry->e_refcnt);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 02/10] mbcache: Add functions to delete entry if unused
  2022-07-12 10:54 [PATCH 0/10 v3] ext4: Fix possible fs corruption due to xattr races Jan Kara
  2022-07-12 10:54 ` [PATCH 01/10] mbcache: Don't reclaim used entries Jan Kara
@ 2022-07-12 10:54 ` Jan Kara
  2022-07-14 12:15   ` Ritesh Harjani
  2022-07-12 10:54 ` [PATCH 03/10] ext4: Remove EA inode entry from mbcache on inode eviction Jan Kara
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2022-07-12 10:54 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Ritesh Harjani, Jan Kara, stable

Add function mb_cache_entry_delete_or_get() to delete mbcache entry if
it is unused and also add a function to wait for entry to become unused
- mb_cache_entry_wait_unused(). We do not share code between the two
deleting function as one of them will go away soon.

CC: stable@vger.kernel.org
Fixes: 82939d7999df ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/mbcache.c            | 66 +++++++++++++++++++++++++++++++++++++++--
 include/linux/mbcache.h | 10 ++++++-
 2 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index cfc28129fb6f..2010bc80a3f2 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -11,7 +11,7 @@
 /*
  * Mbcache is a simple key-value store. Keys need not be unique, however
  * key-value pairs are expected to be unique (we use this fact in
- * mb_cache_entry_delete()).
+ * mb_cache_entry_delete_or_get()).
  *
  * Ext2 and ext4 use this cache for deduplication of extended attribute blocks.
  * Ext4 also uses it for deduplication of xattr values stored in inodes.
@@ -125,6 +125,19 @@ void __mb_cache_entry_free(struct mb_cache_entry *entry)
 }
 EXPORT_SYMBOL(__mb_cache_entry_free);
 
+/*
+ * mb_cache_entry_wait_unused - wait to be the last user of the entry
+ *
+ * @entry - entry to work on
+ *
+ * Wait to be the last user of the entry.
+ */
+void mb_cache_entry_wait_unused(struct mb_cache_entry *entry)
+{
+	wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 3);
+}
+EXPORT_SYMBOL(mb_cache_entry_wait_unused);
+
 static struct mb_cache_entry *__entry_find(struct mb_cache *cache,
 					   struct mb_cache_entry *entry,
 					   u32 key)
@@ -217,7 +230,7 @@ struct mb_cache_entry *mb_cache_entry_get(struct mb_cache *cache, u32 key,
 }
 EXPORT_SYMBOL(mb_cache_entry_get);
 
-/* mb_cache_entry_delete - remove a cache entry
+/* mb_cache_entry_delete - try to remove a cache entry
  * @cache - cache we work with
  * @key - key
  * @value - value
@@ -254,6 +267,55 @@ void mb_cache_entry_delete(struct mb_cache *cache, u32 key, u64 value)
 }
 EXPORT_SYMBOL(mb_cache_entry_delete);
 
+/* mb_cache_entry_delete_or_get - remove a cache entry if it has no users
+ * @cache - cache we work with
+ * @key - key
+ * @value - value
+ *
+ * Remove entry from cache @cache with key @key and value @value. The removal
+ * happens only if the entry is unused. The function returns NULL in case the
+ * entry was successfully removed or there's no entry in cache. Otherwise the
+ * function grabs reference of the entry that we failed to delete because it
+ * still has users and return it.
+ */
+struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
+						    u32 key, u64 value)
+{
+	struct hlist_bl_node *node;
+	struct hlist_bl_head *head;
+	struct mb_cache_entry *entry;
+
+	head = mb_cache_entry_head(cache, key);
+	hlist_bl_lock(head);
+	hlist_bl_for_each_entry(entry, node, head, e_hash_list) {
+		if (entry->e_key == key && entry->e_value == value) {
+			if (atomic_read(&entry->e_refcnt) > 2) {
+				atomic_inc(&entry->e_refcnt);
+				hlist_bl_unlock(head);
+				return entry;
+			}
+			/* We keep hash list reference to keep entry alive */
+			hlist_bl_del_init(&entry->e_hash_list);
+			hlist_bl_unlock(head);
+			spin_lock(&cache->c_list_lock);
+			if (!list_empty(&entry->e_list)) {
+				list_del_init(&entry->e_list);
+				if (!WARN_ONCE(cache->c_entry_count == 0,
+		"mbcache: attempt to decrement c_entry_count past zero"))
+					cache->c_entry_count--;
+				atomic_dec(&entry->e_refcnt);
+			}
+			spin_unlock(&cache->c_list_lock);
+			mb_cache_entry_put(cache, entry);
+			return NULL;
+		}
+	}
+	hlist_bl_unlock(head);
+
+	return NULL;
+}
+EXPORT_SYMBOL(mb_cache_entry_delete_or_get);
+
 /* mb_cache_entry_touch - cache entry got used
  * @cache - cache the entry belongs to
  * @entry - entry that got used
diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
index 20f1e3ff6013..8eca7f25c432 100644
--- a/include/linux/mbcache.h
+++ b/include/linux/mbcache.h
@@ -30,15 +30,23 @@ void mb_cache_destroy(struct mb_cache *cache);
 int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
 			  u64 value, bool reusable);
 void __mb_cache_entry_free(struct mb_cache_entry *entry);
+void mb_cache_entry_wait_unused(struct mb_cache_entry *entry);
 static inline int mb_cache_entry_put(struct mb_cache *cache,
 				     struct mb_cache_entry *entry)
 {
-	if (!atomic_dec_and_test(&entry->e_refcnt))
+	unsigned int cnt = atomic_dec_return(&entry->e_refcnt);
+
+	if (cnt > 0) {
+		if (cnt <= 3)
+			wake_up_var(&entry->e_refcnt);
 		return 0;
+	}
 	__mb_cache_entry_free(entry);
 	return 1;
 }
 
+struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
+						    u32 key, u64 value);
 void mb_cache_entry_delete(struct mb_cache *cache, u32 key, u64 value);
 struct mb_cache_entry *mb_cache_entry_get(struct mb_cache *cache, u32 key,
 					  u64 value);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 03/10] ext4: Remove EA inode entry from mbcache on inode eviction
  2022-07-12 10:54 [PATCH 0/10 v3] ext4: Fix possible fs corruption due to xattr races Jan Kara
  2022-07-12 10:54 ` [PATCH 01/10] mbcache: Don't reclaim used entries Jan Kara
  2022-07-12 10:54 ` [PATCH 02/10] mbcache: Add functions to delete entry if unused Jan Kara
@ 2022-07-12 10:54 ` Jan Kara
  2022-07-12 10:54 ` [PATCH 04/10] ext4: Unindent codeblock in ext4_xattr_block_set() Jan Kara
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2022-07-12 10:54 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Ritesh Harjani, Jan Kara, stable

Currently we remove EA inode from mbcache as soon as its xattr refcount
drops to zero. However there can be pending attempts to reuse the inode
and thus refcount handling code has to handle the situation when
refcount increases from zero anyway. So save some work and just keep EA
inode in mbcache until it is getting evicted. At that moment we are sure
following iget() of EA inode will fail anyway (or wait for eviction to
finish and load things from the disk again) and so removing mbcache
entry at that moment is fine and simplifies the code a bit.

CC: stable@vger.kernel.org
Fixes: 82939d7999df ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c |  2 ++
 fs/ext4/xattr.c | 24 ++++++++----------------
 fs/ext4/xattr.h |  1 +
 3 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3dce7d058985..7450ee734262 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -177,6 +177,8 @@ void ext4_evict_inode(struct inode *inode)
 
 	trace_ext4_evict_inode(inode);
 
+	if (EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL)
+		ext4_evict_ea_inode(inode);
 	if (inode->i_nlink) {
 		/*
 		 * When journalling data dirty buffers are tracked only in the
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 042325349098..7fc40fb1e6b3 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -436,6 +436,14 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
 	return err;
 }
 
+/* Remove entry from mbcache when EA inode is getting evicted */
+void ext4_evict_ea_inode(struct inode *inode)
+{
+	if (EA_INODE_CACHE(inode))
+		mb_cache_entry_delete(EA_INODE_CACHE(inode),
+			ext4_xattr_inode_get_hash(inode), inode->i_ino);
+}
+
 static int
 ext4_xattr_inode_verify_hashes(struct inode *ea_inode,
 			       struct ext4_xattr_entry *entry, void *buffer,
@@ -976,10 +984,8 @@ int __ext4_xattr_set_credits(struct super_block *sb, struct inode *inode,
 static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode,
 				       int ref_change)
 {
-	struct mb_cache *ea_inode_cache = EA_INODE_CACHE(ea_inode);
 	struct ext4_iloc iloc;
 	s64 ref_count;
-	u32 hash;
 	int ret;
 
 	inode_lock(ea_inode);
@@ -1002,14 +1008,6 @@ static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode,
 
 			set_nlink(ea_inode, 1);
 			ext4_orphan_del(handle, ea_inode);
-
-			if (ea_inode_cache) {
-				hash = ext4_xattr_inode_get_hash(ea_inode);
-				mb_cache_entry_create(ea_inode_cache,
-						      GFP_NOFS, hash,
-						      ea_inode->i_ino,
-						      true /* reusable */);
-			}
 		}
 	} else {
 		WARN_ONCE(ref_count < 0, "EA inode %lu ref_count=%lld",
@@ -1022,12 +1020,6 @@ static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode,
 
 			clear_nlink(ea_inode);
 			ext4_orphan_add(handle, ea_inode);
-
-			if (ea_inode_cache) {
-				hash = ext4_xattr_inode_get_hash(ea_inode);
-				mb_cache_entry_delete(ea_inode_cache, hash,
-						      ea_inode->i_ino);
-			}
 		}
 	}
 
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 77efb9a627ad..060b7a2f485c 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -178,6 +178,7 @@ extern void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *array);
 
 extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 			    struct ext4_inode *raw_inode, handle_t *handle);
+extern void ext4_evict_ea_inode(struct inode *inode);
 
 extern const struct xattr_handler *ext4_xattr_handlers[];
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 04/10] ext4: Unindent codeblock in ext4_xattr_block_set()
  2022-07-12 10:54 [PATCH 0/10 v3] ext4: Fix possible fs corruption due to xattr races Jan Kara
                   ` (2 preceding siblings ...)
  2022-07-12 10:54 ` [PATCH 03/10] ext4: Remove EA inode entry from mbcache on inode eviction Jan Kara
@ 2022-07-12 10:54 ` Jan Kara
  2022-07-14 12:19   ` Ritesh Harjani
  2022-07-12 10:54 ` [PATCH 05/10] ext4: Fix race when reusing xattr blocks Jan Kara
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2022-07-12 10:54 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Ritesh Harjani, Jan Kara, stable

Remove unnecessary else (and thus indentation level) from a code block
in ext4_xattr_block_set(). It will also make following code changes
easier. No functional changes.

CC: stable@vger.kernel.org
Fixes: 82939d7999df ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/xattr.c | 79 ++++++++++++++++++++++++-------------------------
 1 file changed, 39 insertions(+), 40 deletions(-)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 7fc40fb1e6b3..aadfae53d055 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1850,6 +1850,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 #define header(x) ((struct ext4_xattr_header *)(x))
 
 	if (s->base) {
+		int offset = (char *)s->here - bs->bh->b_data;
+
 		BUFFER_TRACE(bs->bh, "get_write_access");
 		error = ext4_journal_get_write_access(handle, sb, bs->bh,
 						      EXT4_JTR_NONE);
@@ -1882,50 +1884,47 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 			if (error)
 				goto cleanup;
 			goto inserted;
-		} else {
-			int offset = (char *)s->here - bs->bh->b_data;
+		}
+		unlock_buffer(bs->bh);
+		ea_bdebug(bs->bh, "cloning");
+		s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
+		error = -ENOMEM;
+		if (s->base == NULL)
+			goto cleanup;
+		memcpy(s->base, BHDR(bs->bh), bs->bh->b_size);
+		s->first = ENTRY(header(s->base)+1);
+		header(s->base)->h_refcount = cpu_to_le32(1);
+		s->here = ENTRY(s->base + offset);
+		s->end = s->base + bs->bh->b_size;
 
-			unlock_buffer(bs->bh);
-			ea_bdebug(bs->bh, "cloning");
-			s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
-			error = -ENOMEM;
-			if (s->base == NULL)
+		/*
+		 * If existing entry points to an xattr inode, we need
+		 * to prevent ext4_xattr_set_entry() from decrementing
+		 * ref count on it because the reference belongs to the
+		 * original block. In this case, make the entry look
+		 * like it has an empty value.
+		 */
+		if (!s->not_found && s->here->e_value_inum) {
+			ea_ino = le32_to_cpu(s->here->e_value_inum);
+			error = ext4_xattr_inode_iget(inode, ea_ino,
+				      le32_to_cpu(s->here->e_hash),
+				      &tmp_inode);
+			if (error)
 				goto cleanup;
-			memcpy(s->base, BHDR(bs->bh), bs->bh->b_size);
-			s->first = ENTRY(header(s->base)+1);
-			header(s->base)->h_refcount = cpu_to_le32(1);
-			s->here = ENTRY(s->base + offset);
-			s->end = s->base + bs->bh->b_size;
 
-			/*
-			 * If existing entry points to an xattr inode, we need
-			 * to prevent ext4_xattr_set_entry() from decrementing
-			 * ref count on it because the reference belongs to the
-			 * original block. In this case, make the entry look
-			 * like it has an empty value.
-			 */
-			if (!s->not_found && s->here->e_value_inum) {
-				ea_ino = le32_to_cpu(s->here->e_value_inum);
-				error = ext4_xattr_inode_iget(inode, ea_ino,
-					      le32_to_cpu(s->here->e_hash),
-					      &tmp_inode);
-				if (error)
-					goto cleanup;
-
-				if (!ext4_test_inode_state(tmp_inode,
-						EXT4_STATE_LUSTRE_EA_INODE)) {
-					/*
-					 * Defer quota free call for previous
-					 * inode until success is guaranteed.
-					 */
-					old_ea_inode_quota = le32_to_cpu(
-							s->here->e_value_size);
-				}
-				iput(tmp_inode);
-
-				s->here->e_value_inum = 0;
-				s->here->e_value_size = 0;
+			if (!ext4_test_inode_state(tmp_inode,
+					EXT4_STATE_LUSTRE_EA_INODE)) {
+				/*
+				 * Defer quota free call for previous
+				 * inode until success is guaranteed.
+				 */
+				old_ea_inode_quota = le32_to_cpu(
+						s->here->e_value_size);
 			}
+			iput(tmp_inode);
+
+			s->here->e_value_inum = 0;
+			s->here->e_value_size = 0;
 		}
 	} else {
 		/* Allocate a buffer where we construct the new block. */
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 05/10] ext4: Fix race when reusing xattr blocks
  2022-07-12 10:54 [PATCH 0/10 v3] ext4: Fix possible fs corruption due to xattr races Jan Kara
                   ` (3 preceding siblings ...)
  2022-07-12 10:54 ` [PATCH 04/10] ext4: Unindent codeblock in ext4_xattr_block_set() Jan Kara
@ 2022-07-12 10:54 ` Jan Kara
  2022-07-14 12:26   ` Ritesh Harjani
  2022-07-16  3:00   ` Zhihao Cheng
  2022-07-12 10:54 ` [PATCH 06/10] ext2: Factor our freeing of xattr block reference Jan Kara
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 31+ messages in thread
From: Jan Kara @ 2022-07-12 10:54 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Ritesh Harjani, Jan Kara, stable

When ext4_xattr_block_set() decides to remove xattr block the following
race can happen:

CPU1                                    CPU2
ext4_xattr_block_set()                  ext4_xattr_release_block()
  new_bh = ext4_xattr_block_cache_find()

                                          lock_buffer(bh);
                                          ref = le32_to_cpu(BHDR(bh)->h_refcount);
                                          if (ref == 1) {
                                            ...
                                            mb_cache_entry_delete();
                                            unlock_buffer(bh);
                                            ext4_free_blocks();
                                              ...
                                              ext4_forget(..., bh, ...);
                                                jbd2_journal_revoke(..., bh);

  ext4_journal_get_write_access(..., new_bh, ...)
    do_get_write_access()
      jbd2_journal_cancel_revoke(..., new_bh);

Later the code in ext4_xattr_block_set() finds out the block got freed
and cancels reusal of the block but the revoke stays canceled and so in
case of block reuse and journal replay the filesystem can get corrupted.
If the race works out slightly differently, we can also hit assertions
in the jbd2 code.

Fix the problem by making sure that once matching mbcache entry is
found, code dropping the last xattr block reference (or trying to modify
xattr block in place) waits until the mbcache entry reference is
dropped. This way code trying to reuse xattr block is protected from
someone trying to drop the last reference to xattr block.

Reported-and-tested-by: Ritesh Harjani <ritesh.list@gmail.com>
CC: stable@vger.kernel.org
Fixes: 82939d7999df ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/xattr.c | 67 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 45 insertions(+), 22 deletions(-)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index aadfae53d055..3a0928c8720e 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -439,9 +439,16 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
 /* Remove entry from mbcache when EA inode is getting evicted */
 void ext4_evict_ea_inode(struct inode *inode)
 {
-	if (EA_INODE_CACHE(inode))
-		mb_cache_entry_delete(EA_INODE_CACHE(inode),
-			ext4_xattr_inode_get_hash(inode), inode->i_ino);
+	struct mb_cache_entry *oe;
+
+	if (!EA_INODE_CACHE(inode))
+		return;
+	/* Wait for entry to get unused so that we can remove it */
+	while ((oe = mb_cache_entry_delete_or_get(EA_INODE_CACHE(inode),
+			ext4_xattr_inode_get_hash(inode), inode->i_ino))) {
+		mb_cache_entry_wait_unused(oe);
+		mb_cache_entry_put(EA_INODE_CACHE(inode), oe);
+	}
 }
 
 static int
@@ -1229,6 +1236,7 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
 	if (error)
 		goto out;
 
+retry_ref:
 	lock_buffer(bh);
 	hash = le32_to_cpu(BHDR(bh)->h_hash);
 	ref = le32_to_cpu(BHDR(bh)->h_refcount);
@@ -1238,9 +1246,18 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
 		 * This must happen under buffer lock for
 		 * ext4_xattr_block_set() to reliably detect freed block
 		 */
-		if (ea_block_cache)
-			mb_cache_entry_delete(ea_block_cache, hash,
-					      bh->b_blocknr);
+		if (ea_block_cache) {
+			struct mb_cache_entry *oe;
+
+			oe = mb_cache_entry_delete_or_get(ea_block_cache, hash,
+							  bh->b_blocknr);
+			if (oe) {
+				unlock_buffer(bh);
+				mb_cache_entry_wait_unused(oe);
+				mb_cache_entry_put(ea_block_cache, oe);
+				goto retry_ref;
+			}
+		}
 		get_bh(bh);
 		unlock_buffer(bh);
 
@@ -1867,9 +1884,20 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 			 * ext4_xattr_block_set() to reliably detect modified
 			 * block
 			 */
-			if (ea_block_cache)
-				mb_cache_entry_delete(ea_block_cache, hash,
-						      bs->bh->b_blocknr);
+			if (ea_block_cache) {
+				struct mb_cache_entry *oe;
+
+				oe = mb_cache_entry_delete_or_get(ea_block_cache,
+					hash, bs->bh->b_blocknr);
+				if (oe) {
+					/*
+					 * Xattr block is getting reused. Leave
+					 * it alone.
+					 */
+					mb_cache_entry_put(ea_block_cache, oe);
+					goto clone_block;
+				}
+			}
 			ea_bdebug(bs->bh, "modifying in-place");
 			error = ext4_xattr_set_entry(i, s, handle, inode,
 						     true /* is_block */);
@@ -1885,6 +1913,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 				goto cleanup;
 			goto inserted;
 		}
+clone_block:
 		unlock_buffer(bs->bh);
 		ea_bdebug(bs->bh, "cloning");
 		s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
@@ -1991,18 +2020,13 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 				lock_buffer(new_bh);
 				/*
 				 * We have to be careful about races with
-				 * freeing, rehashing or adding references to
-				 * xattr block. Once we hold buffer lock xattr
-				 * block's state is stable so we can check
-				 * whether the block got freed / rehashed or
-				 * not.  Since we unhash mbcache entry under
-				 * buffer lock when freeing / rehashing xattr
-				 * block, checking whether entry is still
-				 * hashed is reliable. Same rules hold for
-				 * e_reusable handling.
+				 * adding references to xattr block. Once we
+				 * hold buffer lock xattr block's state is
+				 * stable so we can check the additional
+				 * reference fits.
 				 */
-				if (hlist_bl_unhashed(&ce->e_hash_list) ||
-				    !ce->e_reusable) {
+				ref = le32_to_cpu(BHDR(new_bh)->h_refcount) + 1;
+				if (ref > EXT4_XATTR_REFCOUNT_MAX) {
 					/*
 					 * Undo everything and check mbcache
 					 * again.
@@ -2017,9 +2041,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 					new_bh = NULL;
 					goto inserted;
 				}
-				ref = le32_to_cpu(BHDR(new_bh)->h_refcount) + 1;
 				BHDR(new_bh)->h_refcount = cpu_to_le32(ref);
-				if (ref >= EXT4_XATTR_REFCOUNT_MAX)
+				if (ref == EXT4_XATTR_REFCOUNT_MAX)
 					ce->e_reusable = 0;
 				ea_bdebug(new_bh, "reusing; refcount now=%d",
 					  ref);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 06/10] ext2: Factor our freeing of xattr block reference
  2022-07-12 10:54 [PATCH 0/10 v3] ext4: Fix possible fs corruption due to xattr races Jan Kara
                   ` (4 preceding siblings ...)
  2022-07-12 10:54 ` [PATCH 05/10] ext4: Fix race when reusing xattr blocks Jan Kara
@ 2022-07-12 10:54 ` Jan Kara
  2022-07-14 12:37   ` Ritesh Harjani
  2022-07-12 10:54 ` [PATCH 07/10] ext2: Unindent codeblock in ext2_xattr_set() Jan Kara
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2022-07-12 10:54 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Ritesh Harjani, Jan Kara

Free of xattr block reference is opencode in two places. Factor it out
into a separate function and use it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/xattr.c | 90 +++++++++++++++++++++----------------------------
 1 file changed, 38 insertions(+), 52 deletions(-)

diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
index 841fa6d9d744..9885294993ef 100644
--- a/fs/ext2/xattr.c
+++ b/fs/ext2/xattr.c
@@ -651,6 +651,42 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
 	return error;
 }
 
+static void ext2_xattr_release_block(struct inode *inode,
+				     struct buffer_head *bh)
+{
+	struct mb_cache *ea_block_cache = EA_BLOCK_CACHE(inode);
+
+	lock_buffer(bh);
+	if (HDR(bh)->h_refcount == cpu_to_le32(1)) {
+		__u32 hash = le32_to_cpu(HDR(bh)->h_hash);
+
+		/*
+		 * This must happen under buffer lock for
+		 * ext2_xattr_set2() to reliably detect freed block
+		 */
+		mb_cache_entry_delete(ea_block_cache, hash,
+				      bh->b_blocknr);
+		/* Free the old block. */
+		ea_bdebug(bh, "freeing");
+		ext2_free_blocks(inode, bh->b_blocknr, 1);
+		/* We let our caller release bh, so we
+		 * need to duplicate the buffer before. */
+		get_bh(bh);
+		bforget(bh);
+		unlock_buffer(bh);
+	} else {
+		/* Decrement the refcount only. */
+		le32_add_cpu(&HDR(bh)->h_refcount, -1);
+		dquot_free_block(inode, 1);
+		mark_buffer_dirty(bh);
+		unlock_buffer(bh);
+		ea_bdebug(bh, "refcount now=%d",
+			le32_to_cpu(HDR(bh)->h_refcount));
+		if (IS_SYNC(inode))
+			sync_dirty_buffer(bh);
+	}
+}
+
 /*
  * Second half of ext2_xattr_set(): Update the file system.
  */
@@ -747,34 +783,7 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
 		 * If there was an old block and we are no longer using it,
 		 * release the old block.
 		 */
-		lock_buffer(old_bh);
-		if (HDR(old_bh)->h_refcount == cpu_to_le32(1)) {
-			__u32 hash = le32_to_cpu(HDR(old_bh)->h_hash);
-
-			/*
-			 * This must happen under buffer lock for
-			 * ext2_xattr_set2() to reliably detect freed block
-			 */
-			mb_cache_entry_delete(ea_block_cache, hash,
-					      old_bh->b_blocknr);
-			/* Free the old block. */
-			ea_bdebug(old_bh, "freeing");
-			ext2_free_blocks(inode, old_bh->b_blocknr, 1);
-			mark_inode_dirty(inode);
-			/* We let our caller release old_bh, so we
-			 * need to duplicate the buffer before. */
-			get_bh(old_bh);
-			bforget(old_bh);
-		} else {
-			/* Decrement the refcount only. */
-			le32_add_cpu(&HDR(old_bh)->h_refcount, -1);
-			dquot_free_block_nodirty(inode, 1);
-			mark_inode_dirty(inode);
-			mark_buffer_dirty(old_bh);
-			ea_bdebug(old_bh, "refcount now=%d",
-				le32_to_cpu(HDR(old_bh)->h_refcount));
-		}
-		unlock_buffer(old_bh);
+		ext2_xattr_release_block(inode, old_bh);
 	}
 
 cleanup:
@@ -828,30 +837,7 @@ ext2_xattr_delete_inode(struct inode *inode)
 			EXT2_I(inode)->i_file_acl);
 		goto cleanup;
 	}
-	lock_buffer(bh);
-	if (HDR(bh)->h_refcount == cpu_to_le32(1)) {
-		__u32 hash = le32_to_cpu(HDR(bh)->h_hash);
-
-		/*
-		 * This must happen under buffer lock for ext2_xattr_set2() to
-		 * reliably detect freed block
-		 */
-		mb_cache_entry_delete(EA_BLOCK_CACHE(inode), hash,
-				      bh->b_blocknr);
-		ext2_free_blocks(inode, EXT2_I(inode)->i_file_acl, 1);
-		get_bh(bh);
-		bforget(bh);
-		unlock_buffer(bh);
-	} else {
-		le32_add_cpu(&HDR(bh)->h_refcount, -1);
-		ea_bdebug(bh, "refcount now=%d",
-			le32_to_cpu(HDR(bh)->h_refcount));
-		unlock_buffer(bh);
-		mark_buffer_dirty(bh);
-		if (IS_SYNC(inode))
-			sync_dirty_buffer(bh);
-		dquot_free_block_nodirty(inode, 1);
-	}
+	ext2_xattr_release_block(inode, bh);
 	EXT2_I(inode)->i_file_acl = 0;
 
 cleanup:
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 07/10] ext2: Unindent codeblock in ext2_xattr_set()
  2022-07-12 10:54 [PATCH 0/10 v3] ext4: Fix possible fs corruption due to xattr races Jan Kara
                   ` (5 preceding siblings ...)
  2022-07-12 10:54 ` [PATCH 06/10] ext2: Factor our freeing of xattr block reference Jan Kara
@ 2022-07-12 10:54 ` Jan Kara
  2022-07-14 12:38   ` Ritesh Harjani
  2022-07-12 10:54 ` [PATCH 08/10] ext2: Avoid deleting xattr block that is being reused Jan Kara
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2022-07-12 10:54 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Ritesh Harjani, Jan Kara

Replace one else in ext2_xattr_set() with a goto. This makes following
code changes simpler to follow. No functional changes.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/xattr.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
index 9885294993ef..37ce495eb279 100644
--- a/fs/ext2/xattr.c
+++ b/fs/ext2/xattr.c
@@ -517,7 +517,8 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
 	/* Here we know that we can set the new attribute. */
 
 	if (header) {
-		/* assert(header == HDR(bh)); */
+		int offset;
+
 		lock_buffer(bh);
 		if (header->h_refcount == cpu_to_le32(1)) {
 			__u32 hash = le32_to_cpu(header->h_hash);
@@ -531,22 +532,20 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
 					      bh->b_blocknr);
 
 			/* keep the buffer locked while modifying it. */
-		} else {
-			int offset;
-
-			unlock_buffer(bh);
-			ea_bdebug(bh, "cloning");
-			header = kmemdup(HDR(bh), bh->b_size, GFP_KERNEL);
-			error = -ENOMEM;
-			if (header == NULL)
-				goto cleanup;
-			header->h_refcount = cpu_to_le32(1);
-
-			offset = (char *)here - bh->b_data;
-			here = ENTRY((char *)header + offset);
-			offset = (char *)last - bh->b_data;
-			last = ENTRY((char *)header + offset);
+			goto update_block;
 		}
+		unlock_buffer(bh);
+		ea_bdebug(bh, "cloning");
+		header = kmemdup(HDR(bh), bh->b_size, GFP_KERNEL);
+		error = -ENOMEM;
+		if (header == NULL)
+			goto cleanup;
+		header->h_refcount = cpu_to_le32(1);
+
+		offset = (char *)here - bh->b_data;
+		here = ENTRY((char *)header + offset);
+		offset = (char *)last - bh->b_data;
+		last = ENTRY((char *)header + offset);
 	} else {
 		/* Allocate a buffer where we construct the new block. */
 		header = kzalloc(sb->s_blocksize, GFP_KERNEL);
@@ -559,6 +558,7 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
 		last = here = ENTRY(header+1);
 	}
 
+update_block:
 	/* Iff we are modifying the block in-place, bh is locked here. */
 
 	if (not_found) {
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 08/10] ext2: Avoid deleting xattr block that is being reused
  2022-07-12 10:54 [PATCH 0/10 v3] ext4: Fix possible fs corruption due to xattr races Jan Kara
                   ` (6 preceding siblings ...)
  2022-07-12 10:54 ` [PATCH 07/10] ext2: Unindent codeblock in ext2_xattr_set() Jan Kara
@ 2022-07-12 10:54 ` Jan Kara
  2022-07-12 10:54 ` [PATCH 09/10] mbcache: Remove mb_cache_entry_delete() Jan Kara
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2022-07-12 10:54 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Ritesh Harjani, Jan Kara

Currently when we decide to reuse xattr block we detect the case when
the last reference to xattr block is being dropped at the same time and
cancel the reuse attempt. Convert ext2 to a new scheme when as soon as
matching mbcache entry is found, we wait with dropping the last xattr
block reference until mbcache entry reference is dropped (meaning either
the xattr block reference is increased or we decided not to reuse the
block).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/xattr.c | 58 ++++++++++++++++++++++++-------------------------
 1 file changed, 29 insertions(+), 29 deletions(-)

diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
index 37ce495eb279..641abfa4b718 100644
--- a/fs/ext2/xattr.c
+++ b/fs/ext2/xattr.c
@@ -522,17 +522,18 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
 		lock_buffer(bh);
 		if (header->h_refcount == cpu_to_le32(1)) {
 			__u32 hash = le32_to_cpu(header->h_hash);
+			struct mb_cache_entry *oe;
 
-			ea_bdebug(bh, "modifying in-place");
+			oe = mb_cache_entry_delete_or_get(EA_BLOCK_CACHE(inode),
+					hash, bh->b_blocknr);
+			if (!oe) {
+				ea_bdebug(bh, "modifying in-place");
+				goto update_block;
+			}
 			/*
-			 * This must happen under buffer lock for
-			 * ext2_xattr_set2() to reliably detect modified block
+			 * Someone is trying to reuse the block, leave it alone
 			 */
-			mb_cache_entry_delete(EA_BLOCK_CACHE(inode), hash,
-					      bh->b_blocknr);
-
-			/* keep the buffer locked while modifying it. */
-			goto update_block;
+			mb_cache_entry_put(EA_BLOCK_CACHE(inode), oe);
 		}
 		unlock_buffer(bh);
 		ea_bdebug(bh, "cloning");
@@ -656,16 +657,29 @@ static void ext2_xattr_release_block(struct inode *inode,
 {
 	struct mb_cache *ea_block_cache = EA_BLOCK_CACHE(inode);
 
+retry_ref:
 	lock_buffer(bh);
 	if (HDR(bh)->h_refcount == cpu_to_le32(1)) {
 		__u32 hash = le32_to_cpu(HDR(bh)->h_hash);
+		struct mb_cache_entry *oe;
 
 		/*
-		 * This must happen under buffer lock for
-		 * ext2_xattr_set2() to reliably detect freed block
+		 * This must happen under buffer lock to properly
+		 * serialize with ext2_xattr_set() reusing the block.
 		 */
-		mb_cache_entry_delete(ea_block_cache, hash,
-				      bh->b_blocknr);
+		oe = mb_cache_entry_delete_or_get(ea_block_cache, hash,
+						  bh->b_blocknr);
+		if (oe) {
+			/*
+			 * Someone is trying to reuse the block. Wait
+			 * and retry.
+			 */
+			unlock_buffer(bh);
+			mb_cache_entry_wait_unused(oe);
+			mb_cache_entry_put(ea_block_cache, oe);
+			goto retry_ref;
+		}
+
 		/* Free the old block. */
 		ea_bdebug(bh, "freeing");
 		ext2_free_blocks(inode, bh->b_blocknr, 1);
@@ -929,7 +943,7 @@ ext2_xattr_cache_find(struct inode *inode, struct ext2_xattr_header *header)
 	if (!header->h_hash)
 		return NULL;  /* never share */
 	ea_idebug(inode, "looking for cached blocks [%x]", (int)hash);
-again:
+
 	ce = mb_cache_entry_find_first(ea_block_cache, hash);
 	while (ce) {
 		struct buffer_head *bh;
@@ -941,22 +955,8 @@ ext2_xattr_cache_find(struct inode *inode, struct ext2_xattr_header *header)
 				inode->i_ino, (unsigned long) ce->e_value);
 		} else {
 			lock_buffer(bh);
-			/*
-			 * We have to be careful about races with freeing or
-			 * rehashing of xattr block. Once we hold buffer lock
-			 * xattr block's state is stable so we can check
-			 * whether the block got freed / rehashed or not.
-			 * Since we unhash mbcache entry under buffer lock when
-			 * freeing / rehashing xattr block, checking whether
-			 * entry is still hashed is reliable.
-			 */
-			if (hlist_bl_unhashed(&ce->e_hash_list)) {
-				mb_cache_entry_put(ea_block_cache, ce);
-				unlock_buffer(bh);
-				brelse(bh);
-				goto again;
-			} else if (le32_to_cpu(HDR(bh)->h_refcount) >
-				   EXT2_XATTR_REFCOUNT_MAX) {
+			if (le32_to_cpu(HDR(bh)->h_refcount) >
+			    EXT2_XATTR_REFCOUNT_MAX) {
 				ea_idebug(inode, "block %ld refcount %d>%d",
 					  (unsigned long) ce->e_value,
 					  le32_to_cpu(HDR(bh)->h_refcount),
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 09/10] mbcache: Remove mb_cache_entry_delete()
  2022-07-12 10:54 [PATCH 0/10 v3] ext4: Fix possible fs corruption due to xattr races Jan Kara
                   ` (7 preceding siblings ...)
  2022-07-12 10:54 ` [PATCH 08/10] ext2: Avoid deleting xattr block that is being reused Jan Kara
@ 2022-07-12 10:54 ` Jan Kara
  2022-07-12 10:54 ` [PATCH 10/10] mbcache: Automatically delete entries from cache on freeing Jan Kara
  2022-07-12 12:47 ` [PATCH 0/10 v3] ext4: Fix possible fs corruption due to xattr races Ritesh Harjani
  10 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2022-07-12 10:54 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Ritesh Harjani, Jan Kara

Nobody uses mb_cache_entry_delete() anymore. Remove it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/mbcache.c            | 37 -------------------------------------
 include/linux/mbcache.h |  1 -
 2 files changed, 38 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index 2010bc80a3f2..d1ebb5df2856 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -230,43 +230,6 @@ struct mb_cache_entry *mb_cache_entry_get(struct mb_cache *cache, u32 key,
 }
 EXPORT_SYMBOL(mb_cache_entry_get);
 
-/* mb_cache_entry_delete - try to remove a cache entry
- * @cache - cache we work with
- * @key - key
- * @value - value
- *
- * Remove entry from cache @cache with key @key and value @value.
- */
-void mb_cache_entry_delete(struct mb_cache *cache, u32 key, u64 value)
-{
-	struct hlist_bl_node *node;
-	struct hlist_bl_head *head;
-	struct mb_cache_entry *entry;
-
-	head = mb_cache_entry_head(cache, key);
-	hlist_bl_lock(head);
-	hlist_bl_for_each_entry(entry, node, head, e_hash_list) {
-		if (entry->e_key == key && entry->e_value == value) {
-			/* We keep hash list reference to keep entry alive */
-			hlist_bl_del_init(&entry->e_hash_list);
-			hlist_bl_unlock(head);
-			spin_lock(&cache->c_list_lock);
-			if (!list_empty(&entry->e_list)) {
-				list_del_init(&entry->e_list);
-				if (!WARN_ONCE(cache->c_entry_count == 0,
-		"mbcache: attempt to decrement c_entry_count past zero"))
-					cache->c_entry_count--;
-				atomic_dec(&entry->e_refcnt);
-			}
-			spin_unlock(&cache->c_list_lock);
-			mb_cache_entry_put(cache, entry);
-			return;
-		}
-	}
-	hlist_bl_unlock(head);
-}
-EXPORT_SYMBOL(mb_cache_entry_delete);
-
 /* mb_cache_entry_delete_or_get - remove a cache entry if it has no users
  * @cache - cache we work with
  * @key - key
diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
index 8eca7f25c432..452b579856d4 100644
--- a/include/linux/mbcache.h
+++ b/include/linux/mbcache.h
@@ -47,7 +47,6 @@ static inline int mb_cache_entry_put(struct mb_cache *cache,
 
 struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
 						    u32 key, u64 value);
-void mb_cache_entry_delete(struct mb_cache *cache, u32 key, u64 value);
 struct mb_cache_entry *mb_cache_entry_get(struct mb_cache *cache, u32 key,
 					  u64 value);
 struct mb_cache_entry *mb_cache_entry_find_first(struct mb_cache *cache,
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 10/10] mbcache: Automatically delete entries from cache on freeing
  2022-07-12 10:54 [PATCH 0/10 v3] ext4: Fix possible fs corruption due to xattr races Jan Kara
                   ` (8 preceding siblings ...)
  2022-07-12 10:54 ` [PATCH 09/10] mbcache: Remove mb_cache_entry_delete() Jan Kara
@ 2022-07-12 10:54 ` Jan Kara
  2022-07-14 13:09   ` Ritesh Harjani
  2022-07-12 12:47 ` [PATCH 0/10 v3] ext4: Fix possible fs corruption due to xattr races Ritesh Harjani
  10 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2022-07-12 10:54 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Ritesh Harjani, Jan Kara

Use the fact that entries with elevated refcount are not removed from
the hash and just move removal of the entry from the hash to the entry
freeing time. When doing this we also change the generic code to hold
one reference to the cache entry, not two of them, which makes code
somewhat more obvious.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/mbcache.c            | 108 +++++++++++++++-------------------------
 include/linux/mbcache.h |  24 ++++++---
 2 files changed, 55 insertions(+), 77 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index d1ebb5df2856..96f1d49d30a5 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -90,7 +90,7 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
 		return -ENOMEM;
 
 	INIT_LIST_HEAD(&entry->e_list);
-	/* One ref for hash, one ref returned */
+	/* Initial hash reference */
 	atomic_set(&entry->e_refcnt, 1);
 	entry->e_key = key;
 	entry->e_value = value;
@@ -106,21 +106,28 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
 		}
 	}
 	hlist_bl_add_head(&entry->e_hash_list, head);
-	hlist_bl_unlock(head);
-
+	/*
+	 * Add entry to LRU list before it can be found by
+	 * mb_cache_entry_delete() to avoid races
+	 */
 	spin_lock(&cache->c_list_lock);
 	list_add_tail(&entry->e_list, &cache->c_list);
-	/* Grab ref for LRU list */
-	atomic_inc(&entry->e_refcnt);
 	cache->c_entry_count++;
 	spin_unlock(&cache->c_list_lock);
+	hlist_bl_unlock(head);
 
 	return 0;
 }
 EXPORT_SYMBOL(mb_cache_entry_create);
 
-void __mb_cache_entry_free(struct mb_cache_entry *entry)
+void __mb_cache_entry_free(struct mb_cache *cache, struct mb_cache_entry *entry)
 {
+	struct hlist_bl_head *head;
+
+	head = mb_cache_entry_head(cache, entry->e_key);
+	hlist_bl_lock(head);
+	hlist_bl_del(&entry->e_hash_list);
+	hlist_bl_unlock(head);
 	kmem_cache_free(mb_entry_cache, entry);
 }
 EXPORT_SYMBOL(__mb_cache_entry_free);
@@ -134,7 +141,7 @@ EXPORT_SYMBOL(__mb_cache_entry_free);
  */
 void mb_cache_entry_wait_unused(struct mb_cache_entry *entry)
 {
-	wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 3);
+	wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 2);
 }
 EXPORT_SYMBOL(mb_cache_entry_wait_unused);
 
@@ -155,10 +162,9 @@ static struct mb_cache_entry *__entry_find(struct mb_cache *cache,
 	while (node) {
 		entry = hlist_bl_entry(node, struct mb_cache_entry,
 				       e_hash_list);
-		if (entry->e_key == key && entry->e_reusable) {
-			atomic_inc(&entry->e_refcnt);
+		if (entry->e_key == key && entry->e_reusable &&
+		    atomic_inc_not_zero(&entry->e_refcnt))
 			goto out;
-		}
 		node = node->next;
 	}
 	entry = NULL;
@@ -218,10 +224,9 @@ struct mb_cache_entry *mb_cache_entry_get(struct mb_cache *cache, u32 key,
 	head = mb_cache_entry_head(cache, key);
 	hlist_bl_lock(head);
 	hlist_bl_for_each_entry(entry, node, head, e_hash_list) {
-		if (entry->e_key == key && entry->e_value == value) {
-			atomic_inc(&entry->e_refcnt);
+		if (entry->e_key == key && entry->e_value == value &&
+		    atomic_inc_not_zero(&entry->e_refcnt))
 			goto out;
-		}
 	}
 	entry = NULL;
 out:
@@ -244,37 +249,25 @@ EXPORT_SYMBOL(mb_cache_entry_get);
 struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
 						    u32 key, u64 value)
 {
-	struct hlist_bl_node *node;
-	struct hlist_bl_head *head;
 	struct mb_cache_entry *entry;
 
-	head = mb_cache_entry_head(cache, key);
-	hlist_bl_lock(head);
-	hlist_bl_for_each_entry(entry, node, head, e_hash_list) {
-		if (entry->e_key == key && entry->e_value == value) {
-			if (atomic_read(&entry->e_refcnt) > 2) {
-				atomic_inc(&entry->e_refcnt);
-				hlist_bl_unlock(head);
-				return entry;
-			}
-			/* We keep hash list reference to keep entry alive */
-			hlist_bl_del_init(&entry->e_hash_list);
-			hlist_bl_unlock(head);
-			spin_lock(&cache->c_list_lock);
-			if (!list_empty(&entry->e_list)) {
-				list_del_init(&entry->e_list);
-				if (!WARN_ONCE(cache->c_entry_count == 0,
-		"mbcache: attempt to decrement c_entry_count past zero"))
-					cache->c_entry_count--;
-				atomic_dec(&entry->e_refcnt);
-			}
-			spin_unlock(&cache->c_list_lock);
-			mb_cache_entry_put(cache, entry);
-			return NULL;
-		}
-	}
-	hlist_bl_unlock(head);
+	entry = mb_cache_entry_get(cache, key, value);
+	if (!entry)
+		return NULL;
+
+	/*
+	 * Drop the ref we got from mb_cache_entry_get() and the initial hash
+	 * ref if we are the last user
+	 */
+	if (atomic_cmpxchg(&entry->e_refcnt, 2, 0) != 2)
+		return entry;
 
+	spin_lock(&cache->c_list_lock);
+	if (!list_empty(&entry->e_list))
+		list_del_init(&entry->e_list);
+	cache->c_entry_count--;
+	spin_unlock(&cache->c_list_lock);
+	__mb_cache_entry_free(cache, entry);
 	return NULL;
 }
 EXPORT_SYMBOL(mb_cache_entry_delete_or_get);
@@ -306,42 +299,24 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
 				     unsigned long nr_to_scan)
 {
 	struct mb_cache_entry *entry;
-	struct hlist_bl_head *head;
 	unsigned long shrunk = 0;
 
 	spin_lock(&cache->c_list_lock);
 	while (nr_to_scan-- && !list_empty(&cache->c_list)) {
 		entry = list_first_entry(&cache->c_list,
 					 struct mb_cache_entry, e_list);
-		if (entry->e_referenced || atomic_read(&entry->e_refcnt) > 2) {
+		/* Drop initial hash reference if there is no user */
+		if (entry->e_referenced ||
+		    atomic_cmpxchg(&entry->e_refcnt, 1, 0) != 1) {
 			entry->e_referenced = 0;
 			list_move_tail(&entry->e_list, &cache->c_list);
 			continue;
 		}
 		list_del_init(&entry->e_list);
 		cache->c_entry_count--;
-		/*
-		 * We keep LRU list reference so that entry doesn't go away
-		 * from under us.
-		 */
 		spin_unlock(&cache->c_list_lock);
-		head = mb_cache_entry_head(cache, entry->e_key);
-		hlist_bl_lock(head);
-		/* Now a reliable check if the entry didn't get used... */
-		if (atomic_read(&entry->e_refcnt) > 2) {
-			hlist_bl_unlock(head);
-			spin_lock(&cache->c_list_lock);
-			list_add_tail(&entry->e_list, &cache->c_list);
-			cache->c_entry_count++;
-			continue;
-		}
-		if (!hlist_bl_unhashed(&entry->e_hash_list)) {
-			hlist_bl_del_init(&entry->e_hash_list);
-			atomic_dec(&entry->e_refcnt);
-		}
-		hlist_bl_unlock(head);
-		if (mb_cache_entry_put(cache, entry))
-			shrunk++;
+		__mb_cache_entry_free(cache, entry);
+		shrunk++;
 		cond_resched();
 		spin_lock(&cache->c_list_lock);
 	}
@@ -433,11 +408,6 @@ void mb_cache_destroy(struct mb_cache *cache)
 	 * point.
 	 */
 	list_for_each_entry_safe(entry, next, &cache->c_list, e_list) {
-		if (!hlist_bl_unhashed(&entry->e_hash_list)) {
-			hlist_bl_del_init(&entry->e_hash_list);
-			atomic_dec(&entry->e_refcnt);
-		} else
-			WARN_ON(1);
 		list_del(&entry->e_list);
 		WARN_ON(atomic_read(&entry->e_refcnt) != 1);
 		mb_cache_entry_put(cache, entry);
diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
index 452b579856d4..2da63fd7b98f 100644
--- a/include/linux/mbcache.h
+++ b/include/linux/mbcache.h
@@ -13,8 +13,16 @@ struct mb_cache;
 struct mb_cache_entry {
 	/* List of entries in cache - protected by cache->c_list_lock */
 	struct list_head	e_list;
-	/* Hash table list - protected by hash chain bitlock */
+	/*
+	 * Hash table list - protected by hash chain bitlock. The entry is
+	 * guaranteed to be hashed while e_refcnt > 0.
+	 */
 	struct hlist_bl_node	e_hash_list;
+	/*
+	 * Entry refcount. Once it reaches zero, entry is unhashed and freed.
+	 * While refcount > 0, the entry is guaranteed to stay in the hash and
+	 * e.g. mb_cache_entry_try_delete() will fail.
+	 */
 	atomic_t		e_refcnt;
 	/* Key in hash - stable during lifetime of the entry */
 	u32			e_key;
@@ -29,20 +37,20 @@ void mb_cache_destroy(struct mb_cache *cache);
 
 int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
 			  u64 value, bool reusable);
-void __mb_cache_entry_free(struct mb_cache_entry *entry);
+void __mb_cache_entry_free(struct mb_cache *cache,
+			   struct mb_cache_entry *entry);
 void mb_cache_entry_wait_unused(struct mb_cache_entry *entry);
-static inline int mb_cache_entry_put(struct mb_cache *cache,
-				     struct mb_cache_entry *entry)
+static inline void mb_cache_entry_put(struct mb_cache *cache,
+				      struct mb_cache_entry *entry)
 {
 	unsigned int cnt = atomic_dec_return(&entry->e_refcnt);
 
 	if (cnt > 0) {
-		if (cnt <= 3)
+		if (cnt <= 2)
 			wake_up_var(&entry->e_refcnt);
-		return 0;
+		return;
 	}
-	__mb_cache_entry_free(entry);
-	return 1;
+	__mb_cache_entry_free(cache, entry);
 }
 
 struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/10 v3] ext4: Fix possible fs corruption due to xattr races
  2022-07-12 10:54 [PATCH 0/10 v3] ext4: Fix possible fs corruption due to xattr races Jan Kara
                   ` (9 preceding siblings ...)
  2022-07-12 10:54 ` [PATCH 10/10] mbcache: Automatically delete entries from cache on freeing Jan Kara
@ 2022-07-12 12:47 ` Ritesh Harjani
  10 siblings, 0 replies; 31+ messages in thread
From: Ritesh Harjani @ 2022-07-12 12:47 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On 22/07/12 12:54PM, Jan Kara wrote:
> Hello!
>
> I've noticed this series didn't get merged yet. I was waiting for more review
> feedback from Ritesh but somehow that didn't happen. So this is the third

Hello Jan,

I had reviewed this series till 05/10 which were meant for stable fixes too.
But I didn't quiet add any Reviewed-by because I didn't find any obvious
problem (also my familiarity with mbcache and revoke code paths are not as
good).

But fair point, I do wanted to continue reviewing the series of later patches
too. I will complete those before our next call (btw, I forgot to check on
these in last call actually)

But this doesn't has to delay picking this patch series for merge any further.
Please feel free to pick it up, if required.

Thanks again for your help!!
-ritesh

> submission of the series fixing the races of ext4 xattr block reuse with the
> few changes that have accumulated since v2. Ted, do you think you can add this
> series to your tree so that we can merge it during the merge window? Thanks!
>
> Changes since v1:
> * Reworked the series to fix all corner cases and make API less errorprone.
>
> Changes since v2:
> * Renamed mb_cache_entry_try_delete() to mb_cache_entry_delete_and_get()
> * Added Tested-by tag from Ritesh
>
> 								Honza
>
> Previous versions:
> Link: http://lore.kernel.org/r/20220606142215.17962-1-jack@suse.cz # v1
> Link: http://lore.kernel.org/r/20220614124146.21594-1-jack@suse.cz # v2

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 01/10] mbcache: Don't reclaim used entries
  2022-07-12 10:54 ` [PATCH 01/10] mbcache: Don't reclaim used entries Jan Kara
@ 2022-07-14 11:47   ` Ritesh Harjani
  2022-07-14 14:36     ` Jan Kara
  2022-07-22 13:58   ` Theodore Ts'o
  1 sibling, 1 reply; 31+ messages in thread
From: Ritesh Harjani @ 2022-07-14 11:47 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4, stable

On 22/07/12 12:54PM, Jan Kara wrote:
> Do not reclaim entries that are currently used by somebody from a
> shrinker. Firstly, these entries are likely useful. Secondly, we will
> need to keep such entries to protect pending increment of xattr block
> refcount.
>
> CC: stable@vger.kernel.org
> Fixes: 82939d7999df ("ext4: convert to mbcache2")
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/mbcache.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/fs/mbcache.c b/fs/mbcache.c
> index 97c54d3a2227..cfc28129fb6f 100644
> --- a/fs/mbcache.c
> +++ b/fs/mbcache.c
> @@ -288,7 +288,7 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
>  	while (nr_to_scan-- && !list_empty(&cache->c_list)) {
>  		entry = list_first_entry(&cache->c_list,
>  					 struct mb_cache_entry, e_list);
> -		if (entry->e_referenced) {
> +		if (entry->e_referenced || atomic_read(&entry->e_refcnt) > 2) {
>  			entry->e_referenced = 0;
>  			list_move_tail(&entry->e_list, &cache->c_list);
>  			continue;
> @@ -302,6 +302,14 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
>  		spin_unlock(&cache->c_list_lock);
>  		head = mb_cache_entry_head(cache, entry->e_key);
>  		hlist_bl_lock(head);
> +		/* Now a reliable check if the entry didn't get used... */
> +		if (atomic_read(&entry->e_refcnt) > 2) {

On taking a look at this patchset again. I think if we move this "if" condition
of checking refcnt to above i.e. before we delete the entry from c_list.
Then we can avoid =>
removing of the entry -> checking it's refcnt under lock -> adding it back
if the refcnt is elevated.

Thoughts?

-ritesh

> +			hlist_bl_unlock(head);
> +			spin_lock(&cache->c_list_lock);
> +			list_add_tail(&entry->e_list, &cache->c_list);
> +			cache->c_entry_count++;
> +			continue;
> +		}
>  		if (!hlist_bl_unhashed(&entry->e_hash_list)) {
>  			hlist_bl_del_init(&entry->e_hash_list);
>  			atomic_dec(&entry->e_refcnt);
> --
> 2.35.3
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 02/10] mbcache: Add functions to delete entry if unused
  2022-07-12 10:54 ` [PATCH 02/10] mbcache: Add functions to delete entry if unused Jan Kara
@ 2022-07-14 12:15   ` Ritesh Harjani
  2022-07-14 14:49     ` Jan Kara
  0 siblings, 1 reply; 31+ messages in thread
From: Ritesh Harjani @ 2022-07-14 12:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4, stable

On 22/07/12 12:54PM, Jan Kara wrote:
> Add function mb_cache_entry_delete_or_get() to delete mbcache entry if
> it is unused and also add a function to wait for entry to become unused
> - mb_cache_entry_wait_unused(). We do not share code between the two
> deleting function as one of them will go away soon.
>
> CC: stable@vger.kernel.org
> Fixes: 82939d7999df ("ext4: convert to mbcache2")
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/mbcache.c            | 66 +++++++++++++++++++++++++++++++++++++++--
>  include/linux/mbcache.h | 10 ++++++-
>  2 files changed, 73 insertions(+), 3 deletions(-)
>
> diff --git a/fs/mbcache.c b/fs/mbcache.c
> index cfc28129fb6f..2010bc80a3f2 100644
> --- a/fs/mbcache.c
> +++ b/fs/mbcache.c
> @@ -11,7 +11,7 @@
>  /*
>   * Mbcache is a simple key-value store. Keys need not be unique, however
>   * key-value pairs are expected to be unique (we use this fact in
> - * mb_cache_entry_delete()).
> + * mb_cache_entry_delete_or_get()).
>   *
>   * Ext2 and ext4 use this cache for deduplication of extended attribute blocks.
>   * Ext4 also uses it for deduplication of xattr values stored in inodes.
> @@ -125,6 +125,19 @@ void __mb_cache_entry_free(struct mb_cache_entry *entry)
>  }
>  EXPORT_SYMBOL(__mb_cache_entry_free);
>
> +/*
> + * mb_cache_entry_wait_unused - wait to be the last user of the entry
> + *
> + * @entry - entry to work on
> + *
> + * Wait to be the last user of the entry.
> + */
> +void mb_cache_entry_wait_unused(struct mb_cache_entry *entry)
> +{
> +	wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 3);

It's not very intuitive of why we check for refcnt <= 3.
A small note at top of this function might be helpful.
IIUC, it is because by default when anyone creates an entry we start with
a refcnt of 2 (in mb_cache_entry_create.
- Now when the user of the entry wants to delete this, it will try and call
  mb_cache_entry_delete_or_get(). If during this function call it sees that the
  refcnt is elevated more than 2, that means there is another user of this entry
  currently active and hence we should wait before we remove this entry from the
  cache. So it will take an extra refcnt and return.
- So then this caller will call mb_cache_entry_wait_unused() for the refcnt to
  be <= 3, so that the entry can be deleted.

Quick qn -
So now is the design like, ext4_evict_ea_inode() will be waiting indefinitely
until the other user of this mb_cache entry releases the reference right?
And that will not happen until,
- either the shrinker removes this entry from the cache during which we are
  checking if the refcnt <= 3, then we call a wakeup event
- Or the user removes/deletes the xattr entry
Is the above understanding correct?

-ritesh


> +}
> +EXPORT_SYMBOL(mb_cache_entry_wait_unused);
> +
>  static struct mb_cache_entry *__entry_find(struct mb_cache *cache,
>  					   struct mb_cache_entry *entry,
>  					   u32 key)
> @@ -217,7 +230,7 @@ struct mb_cache_entry *mb_cache_entry_get(struct mb_cache *cache, u32 key,
>  }
>  EXPORT_SYMBOL(mb_cache_entry_get);
>
> -/* mb_cache_entry_delete - remove a cache entry
> +/* mb_cache_entry_delete - try to remove a cache entry
>   * @cache - cache we work with
>   * @key - key
>   * @value - value
> @@ -254,6 +267,55 @@ void mb_cache_entry_delete(struct mb_cache *cache, u32 key, u64 value)
>  }
>  EXPORT_SYMBOL(mb_cache_entry_delete);
>
> +/* mb_cache_entry_delete_or_get - remove a cache entry if it has no users
> + * @cache - cache we work with
> + * @key - key
> + * @value - value
> + *
> + * Remove entry from cache @cache with key @key and value @value. The removal
> + * happens only if the entry is unused. The function returns NULL in case the
> + * entry was successfully removed or there's no entry in cache. Otherwise the
> + * function grabs reference of the entry that we failed to delete because it
> + * still has users and return it.
> + */
> +struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
> +						    u32 key, u64 value)
> +{
> +	struct hlist_bl_node *node;
> +	struct hlist_bl_head *head;
> +	struct mb_cache_entry *entry;
> +
> +	head = mb_cache_entry_head(cache, key);
> +	hlist_bl_lock(head);
> +	hlist_bl_for_each_entry(entry, node, head, e_hash_list) {
> +		if (entry->e_key == key && entry->e_value == value) {
> +			if (atomic_read(&entry->e_refcnt) > 2) {
> +				atomic_inc(&entry->e_refcnt);
> +				hlist_bl_unlock(head);
> +				return entry;
> +			}
> +			/* We keep hash list reference to keep entry alive */
> +			hlist_bl_del_init(&entry->e_hash_list);
> +			hlist_bl_unlock(head);
> +			spin_lock(&cache->c_list_lock);
> +			if (!list_empty(&entry->e_list)) {
> +				list_del_init(&entry->e_list);
> +				if (!WARN_ONCE(cache->c_entry_count == 0,
> +		"mbcache: attempt to decrement c_entry_count past zero"))
> +					cache->c_entry_count--;
> +				atomic_dec(&entry->e_refcnt);
> +			}
> +			spin_unlock(&cache->c_list_lock);
> +			mb_cache_entry_put(cache, entry);
> +			return NULL;
> +		}
> +	}
> +	hlist_bl_unlock(head);
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL(mb_cache_entry_delete_or_get);
> +
>  /* mb_cache_entry_touch - cache entry got used
>   * @cache - cache the entry belongs to
>   * @entry - entry that got used
> diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
> index 20f1e3ff6013..8eca7f25c432 100644
> --- a/include/linux/mbcache.h
> +++ b/include/linux/mbcache.h
> @@ -30,15 +30,23 @@ void mb_cache_destroy(struct mb_cache *cache);
>  int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
>  			  u64 value, bool reusable);
>  void __mb_cache_entry_free(struct mb_cache_entry *entry);
> +void mb_cache_entry_wait_unused(struct mb_cache_entry *entry);
>  static inline int mb_cache_entry_put(struct mb_cache *cache,
>  				     struct mb_cache_entry *entry)
>  {
> -	if (!atomic_dec_and_test(&entry->e_refcnt))
> +	unsigned int cnt = atomic_dec_return(&entry->e_refcnt);
> +
> +	if (cnt > 0) {
> +		if (cnt <= 3)
> +			wake_up_var(&entry->e_refcnt);
>  		return 0;
> +	}
>  	__mb_cache_entry_free(entry);
>  	return 1;
>  }
>
> +struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
> +						    u32 key, u64 value);
>  void mb_cache_entry_delete(struct mb_cache *cache, u32 key, u64 value);
>  struct mb_cache_entry *mb_cache_entry_get(struct mb_cache *cache, u32 key,
>  					  u64 value);
> --
> 2.35.3
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 04/10] ext4: Unindent codeblock in ext4_xattr_block_set()
  2022-07-12 10:54 ` [PATCH 04/10] ext4: Unindent codeblock in ext4_xattr_block_set() Jan Kara
@ 2022-07-14 12:19   ` Ritesh Harjani
  0 siblings, 0 replies; 31+ messages in thread
From: Ritesh Harjani @ 2022-07-14 12:19 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4, stable

On 22/07/12 12:54PM, Jan Kara wrote:
> Remove unnecessary else (and thus indentation level) from a code block
> in ext4_xattr_block_set(). It will also make following code changes
> easier. No functional changes.

The patch looks good to me. Just a note, while applying this patch on ext4-dev
tree, I found a minor conflict with below patch.

ext4: use kmemdup() to replace kmalloc + memcpy

Replace kmalloc + memcpy with kmemdup()

-ritesh

>
> CC: stable@vger.kernel.org
> Fixes: 82939d7999df ("ext4: convert to mbcache2")
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/xattr.c | 79 ++++++++++++++++++++++++-------------------------
>  1 file changed, 39 insertions(+), 40 deletions(-)
>
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 7fc40fb1e6b3..aadfae53d055 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -1850,6 +1850,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>  #define header(x) ((struct ext4_xattr_header *)(x))
>
>  	if (s->base) {
> +		int offset = (char *)s->here - bs->bh->b_data;
> +
>  		BUFFER_TRACE(bs->bh, "get_write_access");
>  		error = ext4_journal_get_write_access(handle, sb, bs->bh,
>  						      EXT4_JTR_NONE);
> @@ -1882,50 +1884,47 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>  			if (error)
>  				goto cleanup;
>  			goto inserted;
> -		} else {
> -			int offset = (char *)s->here - bs->bh->b_data;
> +		}
> +		unlock_buffer(bs->bh);
> +		ea_bdebug(bs->bh, "cloning");
> +		s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
> +		error = -ENOMEM;
> +		if (s->base == NULL)
> +			goto cleanup;
> +		memcpy(s->base, BHDR(bs->bh), bs->bh->b_size);
> +		s->first = ENTRY(header(s->base)+1);
> +		header(s->base)->h_refcount = cpu_to_le32(1);
> +		s->here = ENTRY(s->base + offset);
> +		s->end = s->base + bs->bh->b_size;
>
> -			unlock_buffer(bs->bh);
> -			ea_bdebug(bs->bh, "cloning");
> -			s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
> -			error = -ENOMEM;
> -			if (s->base == NULL)
> +		/*
> +		 * If existing entry points to an xattr inode, we need
> +		 * to prevent ext4_xattr_set_entry() from decrementing
> +		 * ref count on it because the reference belongs to the
> +		 * original block. In this case, make the entry look
> +		 * like it has an empty value.
> +		 */
> +		if (!s->not_found && s->here->e_value_inum) {
> +			ea_ino = le32_to_cpu(s->here->e_value_inum);
> +			error = ext4_xattr_inode_iget(inode, ea_ino,
> +				      le32_to_cpu(s->here->e_hash),
> +				      &tmp_inode);
> +			if (error)
>  				goto cleanup;
> -			memcpy(s->base, BHDR(bs->bh), bs->bh->b_size);
> -			s->first = ENTRY(header(s->base)+1);
> -			header(s->base)->h_refcount = cpu_to_le32(1);
> -			s->here = ENTRY(s->base + offset);
> -			s->end = s->base + bs->bh->b_size;
>
> -			/*
> -			 * If existing entry points to an xattr inode, we need
> -			 * to prevent ext4_xattr_set_entry() from decrementing
> -			 * ref count on it because the reference belongs to the
> -			 * original block. In this case, make the entry look
> -			 * like it has an empty value.
> -			 */
> -			if (!s->not_found && s->here->e_value_inum) {
> -				ea_ino = le32_to_cpu(s->here->e_value_inum);
> -				error = ext4_xattr_inode_iget(inode, ea_ino,
> -					      le32_to_cpu(s->here->e_hash),
> -					      &tmp_inode);
> -				if (error)
> -					goto cleanup;
> -
> -				if (!ext4_test_inode_state(tmp_inode,
> -						EXT4_STATE_LUSTRE_EA_INODE)) {
> -					/*
> -					 * Defer quota free call for previous
> -					 * inode until success is guaranteed.
> -					 */
> -					old_ea_inode_quota = le32_to_cpu(
> -							s->here->e_value_size);
> -				}
> -				iput(tmp_inode);
> -
> -				s->here->e_value_inum = 0;
> -				s->here->e_value_size = 0;
> +			if (!ext4_test_inode_state(tmp_inode,
> +					EXT4_STATE_LUSTRE_EA_INODE)) {
> +				/*
> +				 * Defer quota free call for previous
> +				 * inode until success is guaranteed.
> +				 */
> +				old_ea_inode_quota = le32_to_cpu(
> +						s->here->e_value_size);
>  			}
> +			iput(tmp_inode);
> +
> +			s->here->e_value_inum = 0;
> +			s->here->e_value_size = 0;
>  		}
>  	} else {
>  		/* Allocate a buffer where we construct the new block. */
> --
> 2.35.3
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 05/10] ext4: Fix race when reusing xattr blocks
  2022-07-12 10:54 ` [PATCH 05/10] ext4: Fix race when reusing xattr blocks Jan Kara
@ 2022-07-14 12:26   ` Ritesh Harjani
  2022-07-16  3:00   ` Zhihao Cheng
  1 sibling, 0 replies; 31+ messages in thread
From: Ritesh Harjani @ 2022-07-14 12:26 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4, stable

On 22/07/12 12:54PM, Jan Kara wrote:
> When ext4_xattr_block_set() decides to remove xattr block the following
> race can happen:
>
> CPU1                                    CPU2
> ext4_xattr_block_set()                  ext4_xattr_release_block()
>   new_bh = ext4_xattr_block_cache_find()
>
>                                           lock_buffer(bh);
>                                           ref = le32_to_cpu(BHDR(bh)->h_refcount);
>                                           if (ref == 1) {
>                                             ...
>                                             mb_cache_entry_delete();
>                                             unlock_buffer(bh);
>                                             ext4_free_blocks();
>                                               ...
>                                               ext4_forget(..., bh, ...);
>                                                 jbd2_journal_revoke(..., bh);
>
>   ext4_journal_get_write_access(..., new_bh, ...)
>     do_get_write_access()
>       jbd2_journal_cancel_revoke(..., new_bh);
>
> Later the code in ext4_xattr_block_set() finds out the block got freed
> and cancels reusal of the block but the revoke stays canceled and so in
> case of block reuse and journal replay the filesystem can get corrupted.
> If the race works out slightly differently, we can also hit assertions
> in the jbd2 code.
>
> Fix the problem by making sure that once matching mbcache entry is
> found, code dropping the last xattr block reference (or trying to modify
> xattr block in place) waits until the mbcache entry reference is
> dropped. This way code trying to reuse xattr block is protected from
> someone trying to drop the last reference to xattr block.
>
> Reported-and-tested-by: Ritesh Harjani <ritesh.list@gmail.com>
> CC: stable@vger.kernel.org
> Fixes: 82939d7999df ("ext4: convert to mbcache2")
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks Jan,
Just a note - I retested the patches only till here (marked stable) with
stress-ng --xattr 16.
And I didn't find any issues so far for ext2, ext3, ext4 default mkfs options.

Also I re-ran full v3 patch series with the same test case on all 3 filesystem,
and I didn't find any failures of the same test case.

-ritesh




> ---
>  fs/ext4/xattr.c | 67 +++++++++++++++++++++++++++++++++----------------
>  1 file changed, 45 insertions(+), 22 deletions(-)
>
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index aadfae53d055..3a0928c8720e 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -439,9 +439,16 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
>  /* Remove entry from mbcache when EA inode is getting evicted */
>  void ext4_evict_ea_inode(struct inode *inode)
>  {
> -	if (EA_INODE_CACHE(inode))
> -		mb_cache_entry_delete(EA_INODE_CACHE(inode),
> -			ext4_xattr_inode_get_hash(inode), inode->i_ino);
> +	struct mb_cache_entry *oe;
> +
> +	if (!EA_INODE_CACHE(inode))
> +		return;
> +	/* Wait for entry to get unused so that we can remove it */
> +	while ((oe = mb_cache_entry_delete_or_get(EA_INODE_CACHE(inode),
> +			ext4_xattr_inode_get_hash(inode), inode->i_ino))) {
> +		mb_cache_entry_wait_unused(oe);
> +		mb_cache_entry_put(EA_INODE_CACHE(inode), oe);
> +	}
>  }
>
>  static int
> @@ -1229,6 +1236,7 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
>  	if (error)
>  		goto out;
>
> +retry_ref:
>  	lock_buffer(bh);
>  	hash = le32_to_cpu(BHDR(bh)->h_hash);
>  	ref = le32_to_cpu(BHDR(bh)->h_refcount);
> @@ -1238,9 +1246,18 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
>  		 * This must happen under buffer lock for
>  		 * ext4_xattr_block_set() to reliably detect freed block
>  		 */
> -		if (ea_block_cache)
> -			mb_cache_entry_delete(ea_block_cache, hash,
> -					      bh->b_blocknr);
> +		if (ea_block_cache) {
> +			struct mb_cache_entry *oe;
> +
> +			oe = mb_cache_entry_delete_or_get(ea_block_cache, hash,
> +							  bh->b_blocknr);
> +			if (oe) {
> +				unlock_buffer(bh);
> +				mb_cache_entry_wait_unused(oe);
> +				mb_cache_entry_put(ea_block_cache, oe);
> +				goto retry_ref;
> +			}
> +		}
>  		get_bh(bh);
>  		unlock_buffer(bh);
>
> @@ -1867,9 +1884,20 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>  			 * ext4_xattr_block_set() to reliably detect modified
>  			 * block
>  			 */
> -			if (ea_block_cache)
> -				mb_cache_entry_delete(ea_block_cache, hash,
> -						      bs->bh->b_blocknr);
> +			if (ea_block_cache) {
> +				struct mb_cache_entry *oe;
> +
> +				oe = mb_cache_entry_delete_or_get(ea_block_cache,
> +					hash, bs->bh->b_blocknr);
> +				if (oe) {
> +					/*
> +					 * Xattr block is getting reused. Leave
> +					 * it alone.
> +					 */
> +					mb_cache_entry_put(ea_block_cache, oe);
> +					goto clone_block;
> +				}
> +			}
>  			ea_bdebug(bs->bh, "modifying in-place");
>  			error = ext4_xattr_set_entry(i, s, handle, inode,
>  						     true /* is_block */);
> @@ -1885,6 +1913,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>  				goto cleanup;
>  			goto inserted;
>  		}
> +clone_block:
>  		unlock_buffer(bs->bh);
>  		ea_bdebug(bs->bh, "cloning");
>  		s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
> @@ -1991,18 +2020,13 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>  				lock_buffer(new_bh);
>  				/*
>  				 * We have to be careful about races with
> -				 * freeing, rehashing or adding references to
> -				 * xattr block. Once we hold buffer lock xattr
> -				 * block's state is stable so we can check
> -				 * whether the block got freed / rehashed or
> -				 * not.  Since we unhash mbcache entry under
> -				 * buffer lock when freeing / rehashing xattr
> -				 * block, checking whether entry is still
> -				 * hashed is reliable. Same rules hold for
> -				 * e_reusable handling.
> +				 * adding references to xattr block. Once we
> +				 * hold buffer lock xattr block's state is
> +				 * stable so we can check the additional
> +				 * reference fits.
>  				 */
> -				if (hlist_bl_unhashed(&ce->e_hash_list) ||
> -				    !ce->e_reusable) {
> +				ref = le32_to_cpu(BHDR(new_bh)->h_refcount) + 1;
> +				if (ref > EXT4_XATTR_REFCOUNT_MAX) {
>  					/*
>  					 * Undo everything and check mbcache
>  					 * again.
> @@ -2017,9 +2041,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>  					new_bh = NULL;
>  					goto inserted;
>  				}
> -				ref = le32_to_cpu(BHDR(new_bh)->h_refcount) + 1;
>  				BHDR(new_bh)->h_refcount = cpu_to_le32(ref);
> -				if (ref >= EXT4_XATTR_REFCOUNT_MAX)
> +				if (ref == EXT4_XATTR_REFCOUNT_MAX)
>  					ce->e_reusable = 0;
>  				ea_bdebug(new_bh, "reusing; refcount now=%d",
>  					  ref);
> --
> 2.35.3
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 06/10] ext2: Factor our freeing of xattr block reference
  2022-07-12 10:54 ` [PATCH 06/10] ext2: Factor our freeing of xattr block reference Jan Kara
@ 2022-07-14 12:37   ` Ritesh Harjani
  2022-07-14 14:55     ` Jan Kara
  0 siblings, 1 reply; 31+ messages in thread
From: Ritesh Harjani @ 2022-07-14 12:37 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On 22/07/12 12:54PM, Jan Kara wrote:
> Free of xattr block reference is opencode in two places. Factor it out
> into a separate function and use it.

Looked into the refactoring logic. The patch looks good to me.
Small queries below -

>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext2/xattr.c | 90 +++++++++++++++++++++----------------------------
>  1 file changed, 38 insertions(+), 52 deletions(-)
>
> diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
> index 841fa6d9d744..9885294993ef 100644
> --- a/fs/ext2/xattr.c
> +++ b/fs/ext2/xattr.c
> @@ -651,6 +651,42 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
>  	return error;
>  }
>
> +static void ext2_xattr_release_block(struct inode *inode,
> +				     struct buffer_head *bh)
> +{
> +	struct mb_cache *ea_block_cache = EA_BLOCK_CACHE(inode);
> +
> +	lock_buffer(bh);
> +	if (HDR(bh)->h_refcount == cpu_to_le32(1)) {
> +		__u32 hash = le32_to_cpu(HDR(bh)->h_hash);
> +
> +		/*
> +		 * This must happen under buffer lock for
> +		 * ext2_xattr_set2() to reliably detect freed block
> +		 */
> +		mb_cache_entry_delete(ea_block_cache, hash,
> +				      bh->b_blocknr);
> +		/* Free the old block. */
> +		ea_bdebug(bh, "freeing");
> +		ext2_free_blocks(inode, bh->b_blocknr, 1);
> +		/* We let our caller release bh, so we
> +		 * need to duplicate the buffer before. */
> +		get_bh(bh);
> +		bforget(bh);
> +		unlock_buffer(bh);
> +	} else {
> +		/* Decrement the refcount only. */
> +		le32_add_cpu(&HDR(bh)->h_refcount, -1);
> +		dquot_free_block(inode, 1);
> +		mark_buffer_dirty(bh);
> +		unlock_buffer(bh);
> +		ea_bdebug(bh, "refcount now=%d",
> +			le32_to_cpu(HDR(bh)->h_refcount));
> +		if (IS_SYNC(inode))
> +			sync_dirty_buffer(bh);
> +	}
> +}
> +
>  /*
>   * Second half of ext2_xattr_set(): Update the file system.
>   */
> @@ -747,34 +783,7 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
>  		 * If there was an old block and we are no longer using it,
>  		 * release the old block.
>  		 */
> -		lock_buffer(old_bh);
> -		if (HDR(old_bh)->h_refcount == cpu_to_le32(1)) {
> -			__u32 hash = le32_to_cpu(HDR(old_bh)->h_hash);
> -
> -			/*
> -			 * This must happen under buffer lock for
> -			 * ext2_xattr_set2() to reliably detect freed block
> -			 */
> -			mb_cache_entry_delete(ea_block_cache, hash,
> -					      old_bh->b_blocknr);
> -			/* Free the old block. */
> -			ea_bdebug(old_bh, "freeing");
> -			ext2_free_blocks(inode, old_bh->b_blocknr, 1);
> -			mark_inode_dirty(inode);

^^^ this is not needed because ext2_free_blocks() will take care of it.
Hence you have dropped this in ext2_xattr_release_block()

> -			/* We let our caller release old_bh, so we
> -			 * need to duplicate the buffer before. */
> -			get_bh(old_bh);
> -			bforget(old_bh);
> -		} else {
> -			/* Decrement the refcount only. */
> -			le32_add_cpu(&HDR(old_bh)->h_refcount, -1);
> -			dquot_free_block_nodirty(inode, 1);
> -			mark_inode_dirty(inode);

Quick qn -> Don't we need mark_inode_dirty() here?

> -			mark_buffer_dirty(old_bh);
> -			ea_bdebug(old_bh, "refcount now=%d",
> -				le32_to_cpu(HDR(old_bh)->h_refcount));
> -		}
> -		unlock_buffer(old_bh);
> +		ext2_xattr_release_block(inode, old_bh);
>  	}
>
>  cleanup:
> @@ -828,30 +837,7 @@ ext2_xattr_delete_inode(struct inode *inode)
>  			EXT2_I(inode)->i_file_acl);
>  		goto cleanup;
>  	}
> -	lock_buffer(bh);
> -	if (HDR(bh)->h_refcount == cpu_to_le32(1)) {
> -		__u32 hash = le32_to_cpu(HDR(bh)->h_hash);
> -
> -		/*
> -		 * This must happen under buffer lock for ext2_xattr_set2() to
> -		 * reliably detect freed block
> -		 */
> -		mb_cache_entry_delete(EA_BLOCK_CACHE(inode), hash,
> -				      bh->b_blocknr);
> -		ext2_free_blocks(inode, EXT2_I(inode)->i_file_acl, 1);
> -		get_bh(bh);
> -		bforget(bh);
> -		unlock_buffer(bh);
> -	} else {
> -		le32_add_cpu(&HDR(bh)->h_refcount, -1);
> -		ea_bdebug(bh, "refcount now=%d",
> -			le32_to_cpu(HDR(bh)->h_refcount));
> -		unlock_buffer(bh);
> -		mark_buffer_dirty(bh);
> -		if (IS_SYNC(inode))
> -			sync_dirty_buffer(bh);
> -		dquot_free_block_nodirty(inode, 1);
> -	}
> +	ext2_xattr_release_block(inode, bh);
>  	EXT2_I(inode)->i_file_acl = 0;
>
>  cleanup:
> --
> 2.35.3
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 07/10] ext2: Unindent codeblock in ext2_xattr_set()
  2022-07-12 10:54 ` [PATCH 07/10] ext2: Unindent codeblock in ext2_xattr_set() Jan Kara
@ 2022-07-14 12:38   ` Ritesh Harjani
  0 siblings, 0 replies; 31+ messages in thread
From: Ritesh Harjani @ 2022-07-14 12:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On 22/07/12 12:54PM, Jan Kara wrote:
> Replace one else in ext2_xattr_set() with a goto. This makes following
> code changes simpler to follow. No functional changes.

Straight forward refactoring. Looks good to me.

Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext2/xattr.c | 32 ++++++++++++++++----------------
>  1 file changed, 16 insertions(+), 16 deletions(-)
>
> diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
> index 9885294993ef..37ce495eb279 100644
> --- a/fs/ext2/xattr.c
> +++ b/fs/ext2/xattr.c
> @@ -517,7 +517,8 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
>  	/* Here we know that we can set the new attribute. */
>
>  	if (header) {
> -		/* assert(header == HDR(bh)); */
> +		int offset;
> +
>  		lock_buffer(bh);
>  		if (header->h_refcount == cpu_to_le32(1)) {
>  			__u32 hash = le32_to_cpu(header->h_hash);
> @@ -531,22 +532,20 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
>  					      bh->b_blocknr);
>
>  			/* keep the buffer locked while modifying it. */
> -		} else {
> -			int offset;
> -
> -			unlock_buffer(bh);
> -			ea_bdebug(bh, "cloning");
> -			header = kmemdup(HDR(bh), bh->b_size, GFP_KERNEL);
> -			error = -ENOMEM;
> -			if (header == NULL)
> -				goto cleanup;
> -			header->h_refcount = cpu_to_le32(1);
> -
> -			offset = (char *)here - bh->b_data;
> -			here = ENTRY((char *)header + offset);
> -			offset = (char *)last - bh->b_data;
> -			last = ENTRY((char *)header + offset);
> +			goto update_block;
>  		}
> +		unlock_buffer(bh);
> +		ea_bdebug(bh, "cloning");
> +		header = kmemdup(HDR(bh), bh->b_size, GFP_KERNEL);
> +		error = -ENOMEM;
> +		if (header == NULL)
> +			goto cleanup;
> +		header->h_refcount = cpu_to_le32(1);
> +
> +		offset = (char *)here - bh->b_data;
> +		here = ENTRY((char *)header + offset);
> +		offset = (char *)last - bh->b_data;
> +		last = ENTRY((char *)header + offset);
>  	} else {
>  		/* Allocate a buffer where we construct the new block. */
>  		header = kzalloc(sb->s_blocksize, GFP_KERNEL);
> @@ -559,6 +558,7 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
>  		last = here = ENTRY(header+1);
>  	}
>
> +update_block:
>  	/* Iff we are modifying the block in-place, bh is locked here. */
>
>  	if (not_found) {
> --
> 2.35.3
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 10/10] mbcache: Automatically delete entries from cache on freeing
  2022-07-12 10:54 ` [PATCH 10/10] mbcache: Automatically delete entries from cache on freeing Jan Kara
@ 2022-07-14 13:09   ` Ritesh Harjani
  2022-07-14 15:05     ` Jan Kara
  0 siblings, 1 reply; 31+ messages in thread
From: Ritesh Harjani @ 2022-07-14 13:09 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On 22/07/12 12:54PM, Jan Kara wrote:
> Use the fact that entries with elevated refcount are not removed from

The elevated refcnt means >= 2?

> the hash and just move removal of the entry from the hash to the entry
> freeing time. When doing this we also change the generic code to hold
> one reference to the cache entry, not two of them, which makes code
> somewhat more obvious.
>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/mbcache.c            | 108 +++++++++++++++-------------------------
>  include/linux/mbcache.h |  24 ++++++---
>  2 files changed, 55 insertions(+), 77 deletions(-)
>
> diff --git a/fs/mbcache.c b/fs/mbcache.c
> index d1ebb5df2856..96f1d49d30a5 100644
> --- a/fs/mbcache.c
> +++ b/fs/mbcache.c
> @@ -90,7 +90,7 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
>  		return -ENOMEM;
>
>  	INIT_LIST_HEAD(&entry->e_list);
> -	/* One ref for hash, one ref returned */
> +	/* Initial hash reference */
>  	atomic_set(&entry->e_refcnt, 1);
>  	entry->e_key = key;
>  	entry->e_value = value;
> @@ -106,21 +106,28 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
>  		}
>  	}
>  	hlist_bl_add_head(&entry->e_hash_list, head);
> -	hlist_bl_unlock(head);
> -
> +	/*
> +	 * Add entry to LRU list before it can be found by
> +	 * mb_cache_entry_delete() to avoid races
> +	 */

No reference to mb_cache_entry_delete() now. It is
mb_cache_entry_delete_or_get()

>  	spin_lock(&cache->c_list_lock);
>  	list_add_tail(&entry->e_list, &cache->c_list);
> -	/* Grab ref for LRU list */
> -	atomic_inc(&entry->e_refcnt);
>  	cache->c_entry_count++;
>  	spin_unlock(&cache->c_list_lock);
> +	hlist_bl_unlock(head);
>
>  	return 0;
>  }
>  EXPORT_SYMBOL(mb_cache_entry_create);
>
> -void __mb_cache_entry_free(struct mb_cache_entry *entry)
> +void __mb_cache_entry_free(struct mb_cache *cache, struct mb_cache_entry *entry)
>  {
> +	struct hlist_bl_head *head;
> +
> +	head = mb_cache_entry_head(cache, entry->e_key);
> +	hlist_bl_lock(head);
> +	hlist_bl_del(&entry->e_hash_list);
> +	hlist_bl_unlock(head);
>  	kmem_cache_free(mb_entry_cache, entry);
>  }
>  EXPORT_SYMBOL(__mb_cache_entry_free);
> @@ -134,7 +141,7 @@ EXPORT_SYMBOL(__mb_cache_entry_free);
>   */
>  void mb_cache_entry_wait_unused(struct mb_cache_entry *entry)
>  {
> -	wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 3);
> +	wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 2);
>  }
>  EXPORT_SYMBOL(mb_cache_entry_wait_unused);
>
> @@ -155,10 +162,9 @@ static struct mb_cache_entry *__entry_find(struct mb_cache *cache,
>  	while (node) {
>  		entry = hlist_bl_entry(node, struct mb_cache_entry,
>  				       e_hash_list);
> -		if (entry->e_key == key && entry->e_reusable) {
> -			atomic_inc(&entry->e_refcnt);
> +		if (entry->e_key == key && entry->e_reusable &&
> +		    atomic_inc_not_zero(&entry->e_refcnt))
>  			goto out;
> -		}
>  		node = node->next;
>  	}
>  	entry = NULL;
> @@ -218,10 +224,9 @@ struct mb_cache_entry *mb_cache_entry_get(struct mb_cache *cache, u32 key,
>  	head = mb_cache_entry_head(cache, key);
>  	hlist_bl_lock(head);
>  	hlist_bl_for_each_entry(entry, node, head, e_hash_list) {
> -		if (entry->e_key == key && entry->e_value == value) {
> -			atomic_inc(&entry->e_refcnt);
> +		if (entry->e_key == key && entry->e_value == value &&
> +		    atomic_inc_not_zero(&entry->e_refcnt))
>  			goto out;
> -		}
>  	}
>  	entry = NULL;
>  out:
> @@ -244,37 +249,25 @@ EXPORT_SYMBOL(mb_cache_entry_get);
>  struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
>  						    u32 key, u64 value)
>  {
> -	struct hlist_bl_node *node;
> -	struct hlist_bl_head *head;
>  	struct mb_cache_entry *entry;
>
> -	head = mb_cache_entry_head(cache, key);
> -	hlist_bl_lock(head);
> -	hlist_bl_for_each_entry(entry, node, head, e_hash_list) {
> -		if (entry->e_key == key && entry->e_value == value) {
> -			if (atomic_read(&entry->e_refcnt) > 2) {
> -				atomic_inc(&entry->e_refcnt);
> -				hlist_bl_unlock(head);
> -				return entry;
> -			}
> -			/* We keep hash list reference to keep entry alive */
> -			hlist_bl_del_init(&entry->e_hash_list);
> -			hlist_bl_unlock(head);
> -			spin_lock(&cache->c_list_lock);
> -			if (!list_empty(&entry->e_list)) {
> -				list_del_init(&entry->e_list);
> -				if (!WARN_ONCE(cache->c_entry_count == 0,
> -		"mbcache: attempt to decrement c_entry_count past zero"))
> -					cache->c_entry_count--;
> -				atomic_dec(&entry->e_refcnt);
> -			}
> -			spin_unlock(&cache->c_list_lock);
> -			mb_cache_entry_put(cache, entry);
> -			return NULL;
> -		}
> -	}
> -	hlist_bl_unlock(head);
> +	entry = mb_cache_entry_get(cache, key, value);
> +	if (!entry)
> +		return NULL;
> +
> +	/*
> +	 * Drop the ref we got from mb_cache_entry_get() and the initial hash
> +	 * ref if we are the last user
> +	 */
> +	if (atomic_cmpxchg(&entry->e_refcnt, 2, 0) != 2)
> +		return entry;
>
> +	spin_lock(&cache->c_list_lock);
> +	if (!list_empty(&entry->e_list))
> +		list_del_init(&entry->e_list);
> +	cache->c_entry_count--;
> +	spin_unlock(&cache->c_list_lock);
> +	__mb_cache_entry_free(cache, entry);
>  	return NULL;
>  }
>  EXPORT_SYMBOL(mb_cache_entry_delete_or_get);
> @@ -306,42 +299,24 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
>  				     unsigned long nr_to_scan)
>  {
>  	struct mb_cache_entry *entry;
> -	struct hlist_bl_head *head;
>  	unsigned long shrunk = 0;
>
>  	spin_lock(&cache->c_list_lock);
>  	while (nr_to_scan-- && !list_empty(&cache->c_list)) {
>  		entry = list_first_entry(&cache->c_list,
>  					 struct mb_cache_entry, e_list);
> -		if (entry->e_referenced || atomic_read(&entry->e_refcnt) > 2) {
> +		/* Drop initial hash reference if there is no user */
> +		if (entry->e_referenced ||
> +		    atomic_cmpxchg(&entry->e_refcnt, 1, 0) != 1) {

So here if the refcnt of an entry is 1. That means it is still in use right.
So the shrinker will do the atomic_cmpxchg and make it to 0. And then
delete the entry from the cache?
This will only happen for entry with just 1 reference count.

Is that correct understanding?

-ritesh


>  			entry->e_referenced = 0;
>  			list_move_tail(&entry->e_list, &cache->c_list);
>  			continue;
>  		}
>  		list_del_init(&entry->e_list);
>  		cache->c_entry_count--;
> -		/*
> -		 * We keep LRU list reference so that entry doesn't go away
> -		 * from under us.
> -		 */
>  		spin_unlock(&cache->c_list_lock);
> -		head = mb_cache_entry_head(cache, entry->e_key);
> -		hlist_bl_lock(head);
> -		/* Now a reliable check if the entry didn't get used... */
> -		if (atomic_read(&entry->e_refcnt) > 2) {
> -			hlist_bl_unlock(head);
> -			spin_lock(&cache->c_list_lock);
> -			list_add_tail(&entry->e_list, &cache->c_list);
> -			cache->c_entry_count++;
> -			continue;
> -		}
> -		if (!hlist_bl_unhashed(&entry->e_hash_list)) {
> -			hlist_bl_del_init(&entry->e_hash_list);
> -			atomic_dec(&entry->e_refcnt);
> -		}
> -		hlist_bl_unlock(head);
> -		if (mb_cache_entry_put(cache, entry))
> -			shrunk++;
> +		__mb_cache_entry_free(cache, entry);
> +		shrunk++;
>  		cond_resched();
>  		spin_lock(&cache->c_list_lock);
>  	}
> @@ -433,11 +408,6 @@ void mb_cache_destroy(struct mb_cache *cache)
>  	 * point.
>  	 */
>  	list_for_each_entry_safe(entry, next, &cache->c_list, e_list) {
> -		if (!hlist_bl_unhashed(&entry->e_hash_list)) {
> -			hlist_bl_del_init(&entry->e_hash_list);
> -			atomic_dec(&entry->e_refcnt);
> -		} else
> -			WARN_ON(1);
>  		list_del(&entry->e_list);
>  		WARN_ON(atomic_read(&entry->e_refcnt) != 1);
>  		mb_cache_entry_put(cache, entry);
> diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
> index 452b579856d4..2da63fd7b98f 100644
> --- a/include/linux/mbcache.h
> +++ b/include/linux/mbcache.h
> @@ -13,8 +13,16 @@ struct mb_cache;
>  struct mb_cache_entry {
>  	/* List of entries in cache - protected by cache->c_list_lock */
>  	struct list_head	e_list;
> -	/* Hash table list - protected by hash chain bitlock */
> +	/*
> +	 * Hash table list - protected by hash chain bitlock. The entry is
> +	 * guaranteed to be hashed while e_refcnt > 0.
> +	 */
>  	struct hlist_bl_node	e_hash_list;
> +	/*
> +	 * Entry refcount. Once it reaches zero, entry is unhashed and freed.
> +	 * While refcount > 0, the entry is guaranteed to stay in the hash and
> +	 * e.g. mb_cache_entry_try_delete() will fail.
> +	 */
>  	atomic_t		e_refcnt;
>  	/* Key in hash - stable during lifetime of the entry */
>  	u32			e_key;
> @@ -29,20 +37,20 @@ void mb_cache_destroy(struct mb_cache *cache);
>
>  int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
>  			  u64 value, bool reusable);
> -void __mb_cache_entry_free(struct mb_cache_entry *entry);
> +void __mb_cache_entry_free(struct mb_cache *cache,
> +			   struct mb_cache_entry *entry);
>  void mb_cache_entry_wait_unused(struct mb_cache_entry *entry);
> -static inline int mb_cache_entry_put(struct mb_cache *cache,
> -				     struct mb_cache_entry *entry)
> +static inline void mb_cache_entry_put(struct mb_cache *cache,
> +				      struct mb_cache_entry *entry)
>  {
>  	unsigned int cnt = atomic_dec_return(&entry->e_refcnt);
>
>  	if (cnt > 0) {
> -		if (cnt <= 3)
> +		if (cnt <= 2)
>  			wake_up_var(&entry->e_refcnt);
> -		return 0;
> +		return;
>  	}
> -	__mb_cache_entry_free(entry);
> -	return 1;
> +	__mb_cache_entry_free(cache, entry);
>  }
>
>  struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
> --
> 2.35.3
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 01/10] mbcache: Don't reclaim used entries
  2022-07-14 11:47   ` Ritesh Harjani
@ 2022-07-14 14:36     ` Jan Kara
  2022-07-14 14:49       ` Ritesh Harjani
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2022-07-14 14:36 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Jan Kara, Ted Tso, linux-ext4, stable

On Thu 14-07-22 17:17:02, Ritesh Harjani wrote:
> On 22/07/12 12:54PM, Jan Kara wrote:
> > Do not reclaim entries that are currently used by somebody from a
> > shrinker. Firstly, these entries are likely useful. Secondly, we will
> > need to keep such entries to protect pending increment of xattr block
> > refcount.
> >
> > CC: stable@vger.kernel.org
> > Fixes: 82939d7999df ("ext4: convert to mbcache2")
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/mbcache.c | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/mbcache.c b/fs/mbcache.c
> > index 97c54d3a2227..cfc28129fb6f 100644
> > --- a/fs/mbcache.c
> > +++ b/fs/mbcache.c
> > @@ -288,7 +288,7 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
> >  	while (nr_to_scan-- && !list_empty(&cache->c_list)) {
> >  		entry = list_first_entry(&cache->c_list,
> >  					 struct mb_cache_entry, e_list);
> > -		if (entry->e_referenced) {
> > +		if (entry->e_referenced || atomic_read(&entry->e_refcnt) > 2) {
> >  			entry->e_referenced = 0;
> >  			list_move_tail(&entry->e_list, &cache->c_list);
> >  			continue;
> > @@ -302,6 +302,14 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
> >  		spin_unlock(&cache->c_list_lock);
> >  		head = mb_cache_entry_head(cache, entry->e_key);
> >  		hlist_bl_lock(head);
> > +		/* Now a reliable check if the entry didn't get used... */
> > +		if (atomic_read(&entry->e_refcnt) > 2) {
> 
> On taking a look at this patchset again. I think if we move this "if" condition
> of checking refcnt to above i.e. before we delete the entry from c_list.
> Then we can avoid =>
> removing of the entry -> checking it's refcnt under lock -> adding it back
> if the refcnt is elevated.
> 
> Thoughts?

Well, but synchronization would get more complicated because we don't want
to acquire hlist_bl_lock() under c_list_lock (technically we could at this
point in the series but it would make life harder for the last patch in the
series). And we need c_list_lock to remove entry from the LRU list. It
could be all done but I don't think what you suggest is really that simpler
and this code will go away later in the patchset anyway...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 01/10] mbcache: Don't reclaim used entries
  2022-07-14 14:36     ` Jan Kara
@ 2022-07-14 14:49       ` Ritesh Harjani
  0 siblings, 0 replies; 31+ messages in thread
From: Ritesh Harjani @ 2022-07-14 14:49 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4, stable

On 22/07/14 04:36PM, Jan Kara wrote:
> On Thu 14-07-22 17:17:02, Ritesh Harjani wrote:
> > On 22/07/12 12:54PM, Jan Kara wrote:
> > > Do not reclaim entries that are currently used by somebody from a
> > > shrinker. Firstly, these entries are likely useful. Secondly, we will
> > > need to keep such entries to protect pending increment of xattr block
> > > refcount.
> > >
> > > CC: stable@vger.kernel.org
> > > Fixes: 82939d7999df ("ext4: convert to mbcache2")
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > >  fs/mbcache.c | 10 +++++++++-
> > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/mbcache.c b/fs/mbcache.c
> > > index 97c54d3a2227..cfc28129fb6f 100644
> > > --- a/fs/mbcache.c
> > > +++ b/fs/mbcache.c
> > > @@ -288,7 +288,7 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
> > >  	while (nr_to_scan-- && !list_empty(&cache->c_list)) {
> > >  		entry = list_first_entry(&cache->c_list,
> > >  					 struct mb_cache_entry, e_list);
> > > -		if (entry->e_referenced) {
> > > +		if (entry->e_referenced || atomic_read(&entry->e_refcnt) > 2) {
> > >  			entry->e_referenced = 0;
> > >  			list_move_tail(&entry->e_list, &cache->c_list);
> > >  			continue;
> > > @@ -302,6 +302,14 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
> > >  		spin_unlock(&cache->c_list_lock);
> > >  		head = mb_cache_entry_head(cache, entry->e_key);
> > >  		hlist_bl_lock(head);
> > > +		/* Now a reliable check if the entry didn't get used... */
> > > +		if (atomic_read(&entry->e_refcnt) > 2) {
> >
> > On taking a look at this patchset again. I think if we move this "if" condition
> > of checking refcnt to above i.e. before we delete the entry from c_list.
> > Then we can avoid =>
> > removing of the entry -> checking it's refcnt under lock -> adding it back
> > if the refcnt is elevated.
> >
> > Thoughts?
>
> Well, but synchronization would get more complicated because we don't want
> to acquire hlist_bl_lock() under c_list_lock (technically we could at this
Ok, yes. I tried implementing it and it becomes lock()/unlock() mess.

> point in the series but it would make life harder for the last patch in the
> series). And we need c_list_lock to remove entry from the LRU list. It
> could be all done but I don't think what you suggest is really that simpler
> and this code will go away later in the patchset anyway...

I agree. Thanks for re-checking it.

-ritesh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 02/10] mbcache: Add functions to delete entry if unused
  2022-07-14 12:15   ` Ritesh Harjani
@ 2022-07-14 14:49     ` Jan Kara
  2022-07-14 15:00       ` Ritesh Harjani
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2022-07-14 14:49 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Jan Kara, Ted Tso, linux-ext4, stable

On Thu 14-07-22 17:45:32, Ritesh Harjani wrote:
> On 22/07/12 12:54PM, Jan Kara wrote:
> > Add function mb_cache_entry_delete_or_get() to delete mbcache entry if
> > it is unused and also add a function to wait for entry to become unused
> > - mb_cache_entry_wait_unused(). We do not share code between the two
> > deleting function as one of them will go away soon.
> >
> > CC: stable@vger.kernel.org
> > Fixes: 82939d7999df ("ext4: convert to mbcache2")
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/mbcache.c            | 66 +++++++++++++++++++++++++++++++++++++++--
> >  include/linux/mbcache.h | 10 ++++++-
> >  2 files changed, 73 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/mbcache.c b/fs/mbcache.c
> > index cfc28129fb6f..2010bc80a3f2 100644
> > --- a/fs/mbcache.c
> > +++ b/fs/mbcache.c
> > @@ -11,7 +11,7 @@
> >  /*
> >   * Mbcache is a simple key-value store. Keys need not be unique, however
> >   * key-value pairs are expected to be unique (we use this fact in
> > - * mb_cache_entry_delete()).
> > + * mb_cache_entry_delete_or_get()).
> >   *
> >   * Ext2 and ext4 use this cache for deduplication of extended attribute blocks.
> >   * Ext4 also uses it for deduplication of xattr values stored in inodes.
> > @@ -125,6 +125,19 @@ void __mb_cache_entry_free(struct mb_cache_entry *entry)
> >  }
> >  EXPORT_SYMBOL(__mb_cache_entry_free);
> >
> > +/*
> > + * mb_cache_entry_wait_unused - wait to be the last user of the entry
> > + *
> > + * @entry - entry to work on
> > + *
> > + * Wait to be the last user of the entry.
> > + */
> > +void mb_cache_entry_wait_unused(struct mb_cache_entry *entry)
> > +{
> > +	wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 3);
> 
> It's not very intuitive of why we check for refcnt <= 3.
> A small note at top of this function might be helpful.
> IIUC, it is because by default when anyone creates an entry we start with
> a refcnt of 2 (in mb_cache_entry_create.
> - Now when the user of the entry wants to delete this, it will try and call
>   mb_cache_entry_delete_or_get(). If during this function call it sees that the
>   refcnt is elevated more than 2, that means there is another user of this entry
>   currently active and hence we should wait before we remove this entry from the
>   cache. So it will take an extra refcnt and return.
> - So then this caller will call mb_cache_entry_wait_unused() for the refcnt to
>   be <= 3, so that the entry can be deleted.

Correct. I will add a comment as you suggest.

> Quick qn -
> So now is the design like, ext4_evict_ea_inode() will be waiting indefinitely
> until the other user of this mb_cache entry releases the reference right?

Correct. Similarly for ext4_xattr_release_block().

> And that will not happen until,
> - either the shrinker removes this entry from the cache during which we are
>   checking if the refcnt <= 3, then we call a wakeup event

No, shrinker will not touch these entries with active users anymore.

> - Or the user removes/deletes the xattr entry

No. We hold reference to mbcache entry only while we are trying to reuse
it. So functions ext4_xattr_block_cache_find() and
ext4_xattr_inode_cache_find() will lookup potential mbcache entry that may
have the same contents and get reference to it. Then we do comparisons
verifying whether the contents really matches, if yes, we increment on-disk
inode/block refcount. Then we drop mbcache entry reference which unblocks
waiters in mb_cache_entry_wait_unused().

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 06/10] ext2: Factor our freeing of xattr block reference
  2022-07-14 12:37   ` Ritesh Harjani
@ 2022-07-14 14:55     ` Jan Kara
  2022-07-14 16:17       ` Ritesh Harjani
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2022-07-14 14:55 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Jan Kara, Ted Tso, linux-ext4

On Thu 14-07-22 18:07:14, Ritesh Harjani wrote:
> On 22/07/12 12:54PM, Jan Kara wrote:
> > Free of xattr block reference is opencode in two places. Factor it out
> > into a separate function and use it.
> 
> Looked into the refactoring logic. The patch looks good to me.
> Small queries below -
> 
> >
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/ext2/xattr.c | 90 +++++++++++++++++++++----------------------------
> >  1 file changed, 38 insertions(+), 52 deletions(-)
> >
> > diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
> > index 841fa6d9d744..9885294993ef 100644
> > --- a/fs/ext2/xattr.c
> > +++ b/fs/ext2/xattr.c
> > @@ -651,6 +651,42 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
> >  	return error;
> >  }
> >
> > +static void ext2_xattr_release_block(struct inode *inode,
> > +				     struct buffer_head *bh)
> > +{
> > +	struct mb_cache *ea_block_cache = EA_BLOCK_CACHE(inode);
> > +
> > +	lock_buffer(bh);
> > +	if (HDR(bh)->h_refcount == cpu_to_le32(1)) {
> > +		__u32 hash = le32_to_cpu(HDR(bh)->h_hash);
> > +
> > +		/*
> > +		 * This must happen under buffer lock for
> > +		 * ext2_xattr_set2() to reliably detect freed block
> > +		 */
> > +		mb_cache_entry_delete(ea_block_cache, hash,
> > +				      bh->b_blocknr);
> > +		/* Free the old block. */
> > +		ea_bdebug(bh, "freeing");
> > +		ext2_free_blocks(inode, bh->b_blocknr, 1);
> > +		/* We let our caller release bh, so we
> > +		 * need to duplicate the buffer before. */
> > +		get_bh(bh);
> > +		bforget(bh);
> > +		unlock_buffer(bh);
> > +	} else {
> > +		/* Decrement the refcount only. */
> > +		le32_add_cpu(&HDR(bh)->h_refcount, -1);
> > +		dquot_free_block(inode, 1);
> > +		mark_buffer_dirty(bh);
> > +		unlock_buffer(bh);
> > +		ea_bdebug(bh, "refcount now=%d",
> > +			le32_to_cpu(HDR(bh)->h_refcount));
> > +		if (IS_SYNC(inode))
> > +			sync_dirty_buffer(bh);
> > +	}
> > +}
> > +
> >  /*
> >   * Second half of ext2_xattr_set(): Update the file system.
> >   */
> > @@ -747,34 +783,7 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
> >  		 * If there was an old block and we are no longer using it,
> >  		 * release the old block.
> >  		 */
> > -		lock_buffer(old_bh);
> > -		if (HDR(old_bh)->h_refcount == cpu_to_le32(1)) {
> > -			__u32 hash = le32_to_cpu(HDR(old_bh)->h_hash);
> > -
> > -			/*
> > -			 * This must happen under buffer lock for
> > -			 * ext2_xattr_set2() to reliably detect freed block
> > -			 */
> > -			mb_cache_entry_delete(ea_block_cache, hash,
> > -					      old_bh->b_blocknr);
> > -			/* Free the old block. */
> > -			ea_bdebug(old_bh, "freeing");
> > -			ext2_free_blocks(inode, old_bh->b_blocknr, 1);
> > -			mark_inode_dirty(inode);
> 
> ^^^ this is not needed because ext2_free_blocks() will take care of it.
> Hence you have dropped this in ext2_xattr_release_block()

Correct. ext2_free_blocks() always dirties the inode (unless there is
metadata inconsistency found in which case we don't really care).

> > -			/* We let our caller release old_bh, so we
> > -			 * need to duplicate the buffer before. */
> > -			get_bh(old_bh);
> > -			bforget(old_bh);
> > -		} else {
> > -			/* Decrement the refcount only. */
> > -			le32_add_cpu(&HDR(old_bh)->h_refcount, -1);
> > -			dquot_free_block_nodirty(inode, 1);
> > -			mark_inode_dirty(inode);
> 
> Quick qn -> Don't we need mark_inode_dirty() here?

Notice that I've changed dquot_free_block_nodirty() to dquot_free_block()
because quota info update is the only reason why we need to dirty the inode
so why not let quota code handle it...

> 
> > -			mark_buffer_dirty(old_bh);
> > -			ea_bdebug(old_bh, "refcount now=%d",
> > -				le32_to_cpu(HDR(old_bh)->h_refcount));
> > -		}
> > -		unlock_buffer(old_bh);
> > +		ext2_xattr_release_block(inode, old_bh);
> >  	}

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 02/10] mbcache: Add functions to delete entry if unused
  2022-07-14 14:49     ` Jan Kara
@ 2022-07-14 15:00       ` Ritesh Harjani
  0 siblings, 0 replies; 31+ messages in thread
From: Ritesh Harjani @ 2022-07-14 15:00 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4, stable

On 22/07/14 04:49PM, Jan Kara wrote:
> On Thu 14-07-22 17:45:32, Ritesh Harjani wrote:
> > On 22/07/12 12:54PM, Jan Kara wrote:
> > > Add function mb_cache_entry_delete_or_get() to delete mbcache entry if
> > > it is unused and also add a function to wait for entry to become unused
> > > - mb_cache_entry_wait_unused(). We do not share code between the two
> > > deleting function as one of them will go away soon.
> > >
> > > CC: stable@vger.kernel.org
> > > Fixes: 82939d7999df ("ext4: convert to mbcache2")
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > >  fs/mbcache.c            | 66 +++++++++++++++++++++++++++++++++++++++--
> > >  include/linux/mbcache.h | 10 ++++++-
> > >  2 files changed, 73 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/fs/mbcache.c b/fs/mbcache.c
> > > index cfc28129fb6f..2010bc80a3f2 100644
> > > --- a/fs/mbcache.c
> > > +++ b/fs/mbcache.c
> > > @@ -11,7 +11,7 @@
> > >  /*
> > >   * Mbcache is a simple key-value store. Keys need not be unique, however
> > >   * key-value pairs are expected to be unique (we use this fact in
> > > - * mb_cache_entry_delete()).
> > > + * mb_cache_entry_delete_or_get()).
> > >   *
> > >   * Ext2 and ext4 use this cache for deduplication of extended attribute blocks.
> > >   * Ext4 also uses it for deduplication of xattr values stored in inodes.
> > > @@ -125,6 +125,19 @@ void __mb_cache_entry_free(struct mb_cache_entry *entry)
> > >  }
> > >  EXPORT_SYMBOL(__mb_cache_entry_free);
> > >
> > > +/*
> > > + * mb_cache_entry_wait_unused - wait to be the last user of the entry
> > > + *
> > > + * @entry - entry to work on
> > > + *
> > > + * Wait to be the last user of the entry.
> > > + */
> > > +void mb_cache_entry_wait_unused(struct mb_cache_entry *entry)
> > > +{
> > > +	wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 3);
> >
> > It's not very intuitive of why we check for refcnt <= 3.
> > A small note at top of this function might be helpful.
> > IIUC, it is because by default when anyone creates an entry we start with
> > a refcnt of 2 (in mb_cache_entry_create.
> > - Now when the user of the entry wants to delete this, it will try and call
> >   mb_cache_entry_delete_or_get(). If during this function call it sees that the
> >   refcnt is elevated more than 2, that means there is another user of this entry
> >   currently active and hence we should wait before we remove this entry from the
> >   cache. So it will take an extra refcnt and return.
> > - So then this caller will call mb_cache_entry_wait_unused() for the refcnt to
> >   be <= 3, so that the entry can be deleted.
>
> Correct. I will add a comment as you suggest.
>
> > Quick qn -
> > So now is the design like, ext4_evict_ea_inode() will be waiting indefinitely
> > until the other user of this mb_cache entry releases the reference right?
>
> Correct. Similarly for ext4_xattr_release_block().
>
> > And that will not happen until,
> > - either the shrinker removes this entry from the cache during which we are
> >   checking if the refcnt <= 3, then we call a wakeup event
>
> No, shrinker will not touch these entries with active users anymore.
>
> > - Or the user removes/deletes the xattr entry
>
> No. We hold reference to mbcache entry only while we are trying to reuse
> it. So functions ext4_xattr_block_cache_find() and
> ext4_xattr_inode_cache_find() will lookup potential mbcache entry that may
> have the same contents and get reference to it. Then we do comparisons
> verifying whether the contents really matches, if yes, we increment on-disk
> inode/block refcount. Then we drop mbcache entry reference which unblocks
> waiters in mb_cache_entry_wait_unused().
>

ohk, yes. This is where I was a bit confused.
Thanks for explaining it. This makes more sense. I did go through the mbcache
implementation, but I was missing the info on how the callers are using it.

-ritesh

> 								Honza
>
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 10/10] mbcache: Automatically delete entries from cache on freeing
  2022-07-14 13:09   ` Ritesh Harjani
@ 2022-07-14 15:05     ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2022-07-14 15:05 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Jan Kara, Ted Tso, linux-ext4

On Thu 14-07-22 18:39:51, Ritesh Harjani wrote:
> On 22/07/12 12:54PM, Jan Kara wrote:
> > Use the fact that entries with elevated refcount are not removed from
> 
> The elevated refcnt means >= 2?

Well, it means when there is real user of the mbcache entry. So 3 before
this patch, 2 after this patch...

> > the hash and just move removal of the entry from the hash to the entry
> > freeing time. When doing this we also change the generic code to hold
> > one reference to the cache entry, not two of them, which makes code
> > somewhat more obvious.
> >
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/mbcache.c            | 108 +++++++++++++++-------------------------
> >  include/linux/mbcache.h |  24 ++++++---
> >  2 files changed, 55 insertions(+), 77 deletions(-)
> >
> > diff --git a/fs/mbcache.c b/fs/mbcache.c
> > index d1ebb5df2856..96f1d49d30a5 100644
> > --- a/fs/mbcache.c
> > +++ b/fs/mbcache.c
> > @@ -90,7 +90,7 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
> >  		return -ENOMEM;
> >
> >  	INIT_LIST_HEAD(&entry->e_list);
> > -	/* One ref for hash, one ref returned */
> > +	/* Initial hash reference */
> >  	atomic_set(&entry->e_refcnt, 1);
> >  	entry->e_key = key;
> >  	entry->e_value = value;
> > @@ -106,21 +106,28 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
> >  		}
> >  	}
> >  	hlist_bl_add_head(&entry->e_hash_list, head);
> > -	hlist_bl_unlock(head);
> > -
> > +	/*
> > +	 * Add entry to LRU list before it can be found by
> > +	 * mb_cache_entry_delete() to avoid races
> > +	 */
> 
> No reference to mb_cache_entry_delete() now. It is
> mb_cache_entry_delete_or_get()

Thanks, will fix.

> >  	spin_lock(&cache->c_list_lock);
> >  	list_add_tail(&entry->e_list, &cache->c_list);
> > -	/* Grab ref for LRU list */
> > -	atomic_inc(&entry->e_refcnt);
> >  	cache->c_entry_count++;
> >  	spin_unlock(&cache->c_list_lock);
> > +	hlist_bl_unlock(head);
> >
> >  	return 0;
> >  }
> >  EXPORT_SYMBOL(mb_cache_entry_create);
> >
> > -void __mb_cache_entry_free(struct mb_cache_entry *entry)
> > +void __mb_cache_entry_free(struct mb_cache *cache, struct mb_cache_entry *entry)
> >  {
> > +	struct hlist_bl_head *head;
> > +
> > +	head = mb_cache_entry_head(cache, entry->e_key);
> > +	hlist_bl_lock(head);
> > +	hlist_bl_del(&entry->e_hash_list);
> > +	hlist_bl_unlock(head);
> >  	kmem_cache_free(mb_entry_cache, entry);
> >  }
> >  EXPORT_SYMBOL(__mb_cache_entry_free);
> > @@ -134,7 +141,7 @@ EXPORT_SYMBOL(__mb_cache_entry_free);
> >   */
> >  void mb_cache_entry_wait_unused(struct mb_cache_entry *entry)
> >  {
> > -	wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 3);
> > +	wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 2);
> >  }
> >  EXPORT_SYMBOL(mb_cache_entry_wait_unused);
> >
> > @@ -155,10 +162,9 @@ static struct mb_cache_entry *__entry_find(struct mb_cache *cache,
> >  	while (node) {
> >  		entry = hlist_bl_entry(node, struct mb_cache_entry,
> >  				       e_hash_list);
> > -		if (entry->e_key == key && entry->e_reusable) {
> > -			atomic_inc(&entry->e_refcnt);
> > +		if (entry->e_key == key && entry->e_reusable &&
> > +		    atomic_inc_not_zero(&entry->e_refcnt))
> >  			goto out;
> > -		}
> >  		node = node->next;
> >  	}
> >  	entry = NULL;
> > @@ -218,10 +224,9 @@ struct mb_cache_entry *mb_cache_entry_get(struct mb_cache *cache, u32 key,
> >  	head = mb_cache_entry_head(cache, key);
> >  	hlist_bl_lock(head);
> >  	hlist_bl_for_each_entry(entry, node, head, e_hash_list) {
> > -		if (entry->e_key == key && entry->e_value == value) {
> > -			atomic_inc(&entry->e_refcnt);
> > +		if (entry->e_key == key && entry->e_value == value &&
> > +		    atomic_inc_not_zero(&entry->e_refcnt))
> >  			goto out;
> > -		}
> >  	}
> >  	entry = NULL;
> >  out:
> > @@ -244,37 +249,25 @@ EXPORT_SYMBOL(mb_cache_entry_get);
> >  struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
> >  						    u32 key, u64 value)
> >  {
> > -	struct hlist_bl_node *node;
> > -	struct hlist_bl_head *head;
> >  	struct mb_cache_entry *entry;
> >
> > -	head = mb_cache_entry_head(cache, key);
> > -	hlist_bl_lock(head);
> > -	hlist_bl_for_each_entry(entry, node, head, e_hash_list) {
> > -		if (entry->e_key == key && entry->e_value == value) {
> > -			if (atomic_read(&entry->e_refcnt) > 2) {
> > -				atomic_inc(&entry->e_refcnt);
> > -				hlist_bl_unlock(head);
> > -				return entry;
> > -			}
> > -			/* We keep hash list reference to keep entry alive */
> > -			hlist_bl_del_init(&entry->e_hash_list);
> > -			hlist_bl_unlock(head);
> > -			spin_lock(&cache->c_list_lock);
> > -			if (!list_empty(&entry->e_list)) {
> > -				list_del_init(&entry->e_list);
> > -				if (!WARN_ONCE(cache->c_entry_count == 0,
> > -		"mbcache: attempt to decrement c_entry_count past zero"))
> > -					cache->c_entry_count--;
> > -				atomic_dec(&entry->e_refcnt);
> > -			}
> > -			spin_unlock(&cache->c_list_lock);
> > -			mb_cache_entry_put(cache, entry);
> > -			return NULL;
> > -		}
> > -	}
> > -	hlist_bl_unlock(head);
> > +	entry = mb_cache_entry_get(cache, key, value);
> > +	if (!entry)
> > +		return NULL;
> > +
> > +	/*
> > +	 * Drop the ref we got from mb_cache_entry_get() and the initial hash
> > +	 * ref if we are the last user
> > +	 */
> > +	if (atomic_cmpxchg(&entry->e_refcnt, 2, 0) != 2)
> > +		return entry;
> >
> > +	spin_lock(&cache->c_list_lock);
> > +	if (!list_empty(&entry->e_list))
> > +		list_del_init(&entry->e_list);
> > +	cache->c_entry_count--;
> > +	spin_unlock(&cache->c_list_lock);
> > +	__mb_cache_entry_free(cache, entry);
> >  	return NULL;
> >  }
> >  EXPORT_SYMBOL(mb_cache_entry_delete_or_get);
> > @@ -306,42 +299,24 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
> >  				     unsigned long nr_to_scan)
> >  {
> >  	struct mb_cache_entry *entry;
> > -	struct hlist_bl_head *head;
> >  	unsigned long shrunk = 0;
> >
> >  	spin_lock(&cache->c_list_lock);
> >  	while (nr_to_scan-- && !list_empty(&cache->c_list)) {
> >  		entry = list_first_entry(&cache->c_list,
> >  					 struct mb_cache_entry, e_list);
> > -		if (entry->e_referenced || atomic_read(&entry->e_refcnt) > 2) {
> > +		/* Drop initial hash reference if there is no user */
> > +		if (entry->e_referenced ||
> > +		    atomic_cmpxchg(&entry->e_refcnt, 1, 0) != 1) {
> 
> So here if the refcnt of an entry is 1. That means it is still in use right.

No. One reference is held by the hashtable/LRU itself. So 1 means entry is
free.

> So the shrinker will do the atomic_cmpxchg and make it to 0. And then
> delete the entry from the cache?
> This will only happen for entry with just 1 reference count.
> 
> Is that correct understanding?

Correct. Basically the atomic 1 -> 0 transition makes sure we are not
racing with anybody else doing the 1 -> 2 transition. And once reference
gets to 0, we make sure no new references can be created.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 06/10] ext2: Factor our freeing of xattr block reference
  2022-07-14 14:55     ` Jan Kara
@ 2022-07-14 16:17       ` Ritesh Harjani
  0 siblings, 0 replies; 31+ messages in thread
From: Ritesh Harjani @ 2022-07-14 16:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On 22/07/14 04:55PM, Jan Kara wrote:
> On Thu 14-07-22 18:07:14, Ritesh Harjani wrote:
> > On 22/07/12 12:54PM, Jan Kara wrote:
> > > Free of xattr block reference is opencode in two places. Factor it out
> > > into a separate function and use it.
> >
> > Looked into the refactoring logic. The patch looks good to me.
> > Small queries below -
> >
> > >
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > >  fs/ext2/xattr.c | 90 +++++++++++++++++++++----------------------------
> > >  1 file changed, 38 insertions(+), 52 deletions(-)
> > >
> > > diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
> > > index 841fa6d9d744..9885294993ef 100644
> > > --- a/fs/ext2/xattr.c
> > > +++ b/fs/ext2/xattr.c
> > > @@ -651,6 +651,42 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
> > >  	return error;
> > >  }
> > >
> > > +static void ext2_xattr_release_block(struct inode *inode,
> > > +				     struct buffer_head *bh)
> > > +{
> > > +	struct mb_cache *ea_block_cache = EA_BLOCK_CACHE(inode);
> > > +
> > > +	lock_buffer(bh);
> > > +	if (HDR(bh)->h_refcount == cpu_to_le32(1)) {
> > > +		__u32 hash = le32_to_cpu(HDR(bh)->h_hash);
> > > +
> > > +		/*
> > > +		 * This must happen under buffer lock for
> > > +		 * ext2_xattr_set2() to reliably detect freed block
> > > +		 */
> > > +		mb_cache_entry_delete(ea_block_cache, hash,
> > > +				      bh->b_blocknr);
> > > +		/* Free the old block. */
> > > +		ea_bdebug(bh, "freeing");
> > > +		ext2_free_blocks(inode, bh->b_blocknr, 1);
> > > +		/* We let our caller release bh, so we
> > > +		 * need to duplicate the buffer before. */
> > > +		get_bh(bh);
> > > +		bforget(bh);
> > > +		unlock_buffer(bh);
> > > +	} else {
> > > +		/* Decrement the refcount only. */
> > > +		le32_add_cpu(&HDR(bh)->h_refcount, -1);
> > > +		dquot_free_block(inode, 1);
> > > +		mark_buffer_dirty(bh);
> > > +		unlock_buffer(bh);
> > > +		ea_bdebug(bh, "refcount now=%d",
> > > +			le32_to_cpu(HDR(bh)->h_refcount));
> > > +		if (IS_SYNC(inode))
> > > +			sync_dirty_buffer(bh);
> > > +	}
> > > +}
> > > +
> > >  /*
> > >   * Second half of ext2_xattr_set(): Update the file system.
> > >   */
> > > @@ -747,34 +783,7 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
> > >  		 * If there was an old block and we are no longer using it,
> > >  		 * release the old block.
> > >  		 */
> > > -		lock_buffer(old_bh);
> > > -		if (HDR(old_bh)->h_refcount == cpu_to_le32(1)) {
> > > -			__u32 hash = le32_to_cpu(HDR(old_bh)->h_hash);
> > > -
> > > -			/*
> > > -			 * This must happen under buffer lock for
> > > -			 * ext2_xattr_set2() to reliably detect freed block
> > > -			 */
> > > -			mb_cache_entry_delete(ea_block_cache, hash,
> > > -					      old_bh->b_blocknr);
> > > -			/* Free the old block. */
> > > -			ea_bdebug(old_bh, "freeing");
> > > -			ext2_free_blocks(inode, old_bh->b_blocknr, 1);
> > > -			mark_inode_dirty(inode);
> >
> > ^^^ this is not needed because ext2_free_blocks() will take care of it.
> > Hence you have dropped this in ext2_xattr_release_block()
>
> Correct. ext2_free_blocks() always dirties the inode (unless there is
> metadata inconsistency found in which case we don't really care).
>
> > > -			/* We let our caller release old_bh, so we
> > > -			 * need to duplicate the buffer before. */
> > > -			get_bh(old_bh);
> > > -			bforget(old_bh);
> > > -		} else {
> > > -			/* Decrement the refcount only. */
> > > -			le32_add_cpu(&HDR(old_bh)->h_refcount, -1);
> > > -			dquot_free_block_nodirty(inode, 1);
> > > -			mark_inode_dirty(inode);
> >
> > Quick qn -> Don't we need mark_inode_dirty() here?
>
> Notice that I've changed dquot_free_block_nodirty() to dquot_free_block()
> because quota info update is the only reason why we need to dirty the inode
> so why not let quota code handle it...
>

Ok, yes. Missed it. Thanks for pointing it out.

-ritesh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 05/10] ext4: Fix race when reusing xattr blocks
  2022-07-12 10:54 ` [PATCH 05/10] ext4: Fix race when reusing xattr blocks Jan Kara
  2022-07-14 12:26   ` Ritesh Harjani
@ 2022-07-16  3:00   ` Zhihao Cheng
  2022-07-25 15:23     ` Jan Kara
  1 sibling, 1 reply; 31+ messages in thread
From: Zhihao Cheng @ 2022-07-16  3:00 UTC (permalink / raw)
  To: Jan Kara, Ted Tso; +Cc: linux-ext4, Ritesh Harjani, stable

在 2022/7/12 18:54, Jan Kara 写道:
Hi Jan, one little question confuses me:
> When ext4_xattr_block_set() decides to remove xattr block the following
> race can happen:
> 
> CPU1                                    CPU2
> ext4_xattr_block_set()                  ext4_xattr_release_block()
>    new_bh = ext4_xattr_block_cache_find()
> 
>                                            lock_buffer(bh);
>                                            ref = le32_to_cpu(BHDR(bh)->h_refcount);
>                                            if (ref == 1) {
>                                              ...
>                                              mb_cache_entry_delete();
>                                              unlock_buffer(bh);
>                                              ext4_free_blocks();
>                                                ...
>                                                ext4_forget(..., bh, ...);
>                                                  jbd2_journal_revoke(..., bh);
> 
>    ext4_journal_get_write_access(..., new_bh, ...)
>      do_get_write_access()
>        jbd2_journal_cancel_revoke(..., new_bh);
> 
> Later the code in ext4_xattr_block_set() finds out the block got freed
> and cancels reusal of the block but the revoke stays canceled and so in
> case of block reuse and journal replay the filesystem can get corrupted.
> If the race works out slightly differently, we can also hit assertions
> in the jbd2 code.
> 
> Fix the problem by making sure that once matching mbcache entry is
> found, code dropping the last xattr block reference (or trying to modify
> xattr block in place) waits until the mbcache entry reference is
> dropped. This way code trying to reuse xattr block is protected from
> someone trying to drop the last reference to xattr block.
> 
> Reported-and-tested-by: Ritesh Harjani <ritesh.list@gmail.com>
> CC: stable@vger.kernel.org
> Fixes: 82939d7999df ("ext4: convert to mbcache2")
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>   fs/ext4/xattr.c | 67 +++++++++++++++++++++++++++++++++----------------
>   1 file changed, 45 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index aadfae53d055..3a0928c8720e 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -439,9 +439,16 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
>   /* Remove entry from mbcache when EA inode is getting evicted */
>   void ext4_evict_ea_inode(struct inode *inode)
>   {
> -	if (EA_INODE_CACHE(inode))
> -		mb_cache_entry_delete(EA_INODE_CACHE(inode),
> -			ext4_xattr_inode_get_hash(inode), inode->i_ino);
> +	struct mb_cache_entry *oe;
> +
> +	if (!EA_INODE_CACHE(inode))
> +		return;
> +	/* Wait for entry to get unused so that we can remove it */
> +	while ((oe = mb_cache_entry_delete_or_get(EA_INODE_CACHE(inode),
> +			ext4_xattr_inode_get_hash(inode), inode->i_ino))) {
> +		mb_cache_entry_wait_unused(oe);
> +		mb_cache_entry_put(EA_INODE_CACHE(inode), oe);
> +	}
>   }
>   
>   static int
> @@ -1229,6 +1236,7 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
>   	if (error)
>   		goto out;
>   
> +retry_ref:
>   	lock_buffer(bh);
>   	hash = le32_to_cpu(BHDR(bh)->h_hash);
>   	ref = le32_to_cpu(BHDR(bh)->h_refcount);
> @@ -1238,9 +1246,18 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
>   		 * This must happen under buffer lock for
>   		 * ext4_xattr_block_set() to reliably detect freed block
>   		 */
> -		if (ea_block_cache)
> -			mb_cache_entry_delete(ea_block_cache, hash,
> -					      bh->b_blocknr);
> +		if (ea_block_cache) {
> +			struct mb_cache_entry *oe;
> +
> +			oe = mb_cache_entry_delete_or_get(ea_block_cache, hash,
> +							  bh->b_blocknr);
> +			if (oe) {
> +				unlock_buffer(bh);
> +				mb_cache_entry_wait_unused(oe);
> +				mb_cache_entry_put(ea_block_cache, oe);
> +				goto retry_ref;
> +			}
> +		}
>   		get_bh(bh);
>   		unlock_buffer(bh);
>   
> @@ -1867,9 +1884,20 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>   			 * ext4_xattr_block_set() to reliably detect modified
>   			 * block
>   			 */
> -			if (ea_block_cache)
> -				mb_cache_entry_delete(ea_block_cache, hash,
> -						      bs->bh->b_blocknr);
> +			if (ea_block_cache) {
> +				struct mb_cache_entry *oe;
> +
> +				oe = mb_cache_entry_delete_or_get(ea_block_cache,
> +					hash, bs->bh->b_blocknr);
> +				if (oe) {
> +					/*
> +					 * Xattr block is getting reused. Leave
> +					 * it alone.
> +					 */
> +					mb_cache_entry_put(ea_block_cache, oe);
> +					goto clone_block;
> +				}
> +			}
>   			ea_bdebug(bs->bh, "modifying in-place");
>   			error = ext4_xattr_set_entry(i, s, handle, inode,
>   						     true /* is_block */);
> @@ -1885,6 +1913,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>   				goto cleanup;
>   			goto inserted;
>   		}
> +clone_block:
>   		unlock_buffer(bs->bh);
>   		ea_bdebug(bs->bh, "cloning");
>   		s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
> @@ -1991,18 +2020,13 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>   				lock_buffer(new_bh);
>   				/*
>   				 * We have to be careful about races with
> -				 * freeing, rehashing or adding references to
> -				 * xattr block. Once we hold buffer lock xattr
> -				 * block's state is stable so we can check
> -				 * whether the block got freed / rehashed or
> -				 * not.  Since we unhash mbcache entry under
> -				 * buffer lock when freeing / rehashing xattr
> -				 * block, checking whether entry is still
> -				 * hashed is reliable. Same rules hold for
> -				 * e_reusable handling.
> +				 * adding references to xattr block. Once we
> +				 * hold buffer lock xattr block's state is
> +				 * stable so we can check the additional
> +				 * reference fits.
>   				 */
> -				if (hlist_bl_unhashed(&ce->e_hash_list) ||
> -				    !ce->e_reusable) {
> +				ref = le32_to_cpu(BHDR(new_bh)->h_refcount) + 1;
> +				if (ref > EXT4_XATTR_REFCOUNT_MAX) {

So far, we have mb_cache_entry_delete_or_get() and 
mb_cache_entry_wait_unused(), so used cache entry cannot be concurrently 
removed. Removing check 'hlist_bl_unhashed(&ce->e_hash_list)' is okay.

What's affect of changing the other two checks 'ref >= 
EXT4_XATTR_REFCOUNT_MAX' and '!ce->e_reusable'? To make code more 
obvious? eg. To point out the condition of 'ref <= 
EXT4_XATTR_REFCOUNT_MAX' rather than 'ce->e_reusable', we have checked 
'ce->e_reusable' in ext4_xattr_block_cache_find() before?
>   					/*
>   					 * Undo everything and check mbcache
>   					 * again.
> @@ -2017,9 +2041,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>   					new_bh = NULL;
>   					goto inserted;
>   				}
> -				ref = le32_to_cpu(BHDR(new_bh)->h_refcount) + 1;
>   				BHDR(new_bh)->h_refcount = cpu_to_le32(ref);
> -				if (ref >= EXT4_XATTR_REFCOUNT_MAX)
> +				if (ref == EXT4_XATTR_REFCOUNT_MAX)
>   					ce->e_reusable = 0;
>   				ea_bdebug(new_bh, "reusing; refcount now=%d",
>   					  ref);
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 01/10] mbcache: Don't reclaim used entries
  2022-07-12 10:54 ` [PATCH 01/10] mbcache: Don't reclaim used entries Jan Kara
  2022-07-14 11:47   ` Ritesh Harjani
@ 2022-07-22 13:58   ` Theodore Ts'o
  1 sibling, 0 replies; 31+ messages in thread
From: Theodore Ts'o @ 2022-07-22 13:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: Theodore Ts'o, linux-ext4, Ritesh Harjani, stable

On Tue, 12 Jul 2022 12:54:20 +0200, Jan Kara wrote:
> Do not reclaim entries that are currently used by somebody from a
> shrinker. Firstly, these entries are likely useful. Secondly, we will
> need to keep such entries to protect pending increment of xattr block
> refcount.
> 

Applied, thanks!  (Some slight adjustments were needed to resolve a
merge conflict.)

[01/10] mbcache: Don't reclaim used entries
        commit: ee595bcf21a86af4cff673000e2728d61c7c0e7b
[02/10] mbcache: Add functions to delete entry if unused
        commit: ad3923aa44185f5f65e17764fe5c30501c6dfd22
[03/10] ext4: Remove EA inode entry from mbcache on inode eviction
        commit: 428dc374a6cb6c0cbbf6fe8984b667ef78dc7d75
[04/10] ext4: Unindent codeblock in ext4_xattr_block_set()
        commit: d52086dcf26a6284b08b5544210a7475b4837d52
[05/10] ext4: Fix race when reusing xattr blocks
        commit: 132991ed28822cfb4be41ac72195f00fc0baf3c8
[06/10] ext2: Factor our freeing of xattr block reference
        commit: c30e78a5f165244985aa346bdd460d459094470e
[07/10] ext2: Unindent codeblock in ext2_xattr_set()
        commit: 0e85fb030d13e427deca44a95aabb2475614f8d2
[08/10] ext2: Avoid deleting xattr block that is being reused
        commit: 44ce98e77ab4583b17ff4f501c2076eec3b759d7
[09/10] mbcache: Remove mb_cache_entry_delete()
        commit: c3671ffa0919f2d433576c99c4e211cd367afda0
[10/10] mbcache: Automatically delete entries from cache on freeing
        commit: b51539a7d04fb7d05b28ab9387364ccde88b6b6d

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 05/10] ext4: Fix race when reusing xattr blocks
  2022-07-16  3:00   ` Zhihao Cheng
@ 2022-07-25 15:23     ` Jan Kara
  2022-07-26  1:14       ` Zhihao Cheng
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2022-07-25 15:23 UTC (permalink / raw)
  To: Zhihao Cheng; +Cc: Jan Kara, Ted Tso, linux-ext4, Ritesh Harjani, stable

On Sat 16-07-22 11:00:46, Zhihao Cheng wrote:
> 在 2022/7/12 18:54, Jan Kara 写道:
> Hi Jan, one little question confuses me:
> > When ext4_xattr_block_set() decides to remove xattr block the following
> > race can happen:
> > 
> > CPU1                                    CPU2
> > ext4_xattr_block_set()                  ext4_xattr_release_block()
> >    new_bh = ext4_xattr_block_cache_find()
> > 
> >                                            lock_buffer(bh);
> >                                            ref = le32_to_cpu(BHDR(bh)->h_refcount);
> >                                            if (ref == 1) {
> >                                              ...
> >                                              mb_cache_entry_delete();
> >                                              unlock_buffer(bh);
> >                                              ext4_free_blocks();
> >                                                ...
> >                                                ext4_forget(..., bh, ...);
> >                                                  jbd2_journal_revoke(..., bh);
> > 
> >    ext4_journal_get_write_access(..., new_bh, ...)
> >      do_get_write_access()
> >        jbd2_journal_cancel_revoke(..., new_bh);
> > 
> > Later the code in ext4_xattr_block_set() finds out the block got freed
> > and cancels reusal of the block but the revoke stays canceled and so in
> > case of block reuse and journal replay the filesystem can get corrupted.
> > If the race works out slightly differently, we can also hit assertions
> > in the jbd2 code.
> > 
> > Fix the problem by making sure that once matching mbcache entry is
> > found, code dropping the last xattr block reference (or trying to modify
> > xattr block in place) waits until the mbcache entry reference is
> > dropped. This way code trying to reuse xattr block is protected from
> > someone trying to drop the last reference to xattr block.
> > 
> > Reported-and-tested-by: Ritesh Harjani <ritesh.list@gmail.com>
> > CC: stable@vger.kernel.org
> > Fixes: 82939d7999df ("ext4: convert to mbcache2")
> > Signed-off-by: Jan Kara <jack@suse.cz>

...

> > @@ -1991,18 +2020,13 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
> >   				lock_buffer(new_bh);
> >   				/*
> >   				 * We have to be careful about races with
> > -				 * freeing, rehashing or adding references to
> > -				 * xattr block. Once we hold buffer lock xattr
> > -				 * block's state is stable so we can check
> > -				 * whether the block got freed / rehashed or
> > -				 * not.  Since we unhash mbcache entry under
> > -				 * buffer lock when freeing / rehashing xattr
> > -				 * block, checking whether entry is still
> > -				 * hashed is reliable. Same rules hold for
> > -				 * e_reusable handling.
> > +				 * adding references to xattr block. Once we
> > +				 * hold buffer lock xattr block's state is
> > +				 * stable so we can check the additional
> > +				 * reference fits.
> >   				 */
> > -				if (hlist_bl_unhashed(&ce->e_hash_list) ||
> > -				    !ce->e_reusable) {
> > +				ref = le32_to_cpu(BHDR(new_bh)->h_refcount) + 1;
> > +				if (ref > EXT4_XATTR_REFCOUNT_MAX) {
> 
> So far, we have mb_cache_entry_delete_or_get() and
> mb_cache_entry_wait_unused(), so used cache entry cannot be concurrently
> removed. Removing check 'hlist_bl_unhashed(&ce->e_hash_list)' is okay.
> 
> What's affect of changing the other two checks 'ref >=
> EXT4_XATTR_REFCOUNT_MAX' and '!ce->e_reusable'? To make code more obvious?
> eg. To point out the condition of 'ref <= EXT4_XATTR_REFCOUNT_MAX' rather
> than 'ce->e_reusable', we have checked 'ce->e_reusable' in
> ext4_xattr_block_cache_find() before?

Well, ce->e_reusable is set if and only if BHDR(new_bh)->h_refcount <
EXT4_XATTR_REFCOUNT_MAX. So checking whether the refcount is small enough
is all that is needed and we don't need the ce->e_reusable check here.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 05/10] ext4: Fix race when reusing xattr blocks
  2022-07-25 15:23     ` Jan Kara
@ 2022-07-26  1:14       ` Zhihao Cheng
  0 siblings, 0 replies; 31+ messages in thread
From: Zhihao Cheng @ 2022-07-26  1:14 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4, Ritesh Harjani, stable

在 2022/7/25 23:23, Jan Kara 写道:

>> So far, we have mb_cache_entry_delete_or_get() and
>> mb_cache_entry_wait_unused(), so used cache entry cannot be concurrently
>> removed. Removing check 'hlist_bl_unhashed(&ce->e_hash_list)' is okay.
>>
>> What's affect of changing the other two checks 'ref >=
>> EXT4_XATTR_REFCOUNT_MAX' and '!ce->e_reusable'? To make code more obvious?
>> eg. To point out the condition of 'ref <= EXT4_XATTR_REFCOUNT_MAX' rather
>> than 'ce->e_reusable', we have checked 'ce->e_reusable' in
>> ext4_xattr_block_cache_find() before?
> 
> Well, ce->e_reusable is set if and only if BHDR(new_bh)->h_refcount <
> EXT4_XATTR_REFCOUNT_MAX. So checking whether the refcount is small enough
> is all that is needed and we don't need the ce->e_reusable check here.
> 

Thanks for replying, I get it, thanks.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 06/10] ext2: Factor our freeing of xattr block reference
  2022-06-14 16:05 [PATCH 0/10 v2] " Jan Kara
@ 2022-06-14 16:05 ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2022-06-14 16:05 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Ritesh Harjani, Jan Kara

Free of xattr block reference is opencode in two places. Factor it out
into a separate function and use it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/xattr.c | 90 +++++++++++++++++++++----------------------------
 1 file changed, 38 insertions(+), 52 deletions(-)

diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
index 841fa6d9d744..9885294993ef 100644
--- a/fs/ext2/xattr.c
+++ b/fs/ext2/xattr.c
@@ -651,6 +651,42 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
 	return error;
 }
 
+static void ext2_xattr_release_block(struct inode *inode,
+				     struct buffer_head *bh)
+{
+	struct mb_cache *ea_block_cache = EA_BLOCK_CACHE(inode);
+
+	lock_buffer(bh);
+	if (HDR(bh)->h_refcount == cpu_to_le32(1)) {
+		__u32 hash = le32_to_cpu(HDR(bh)->h_hash);
+
+		/*
+		 * This must happen under buffer lock for
+		 * ext2_xattr_set2() to reliably detect freed block
+		 */
+		mb_cache_entry_delete(ea_block_cache, hash,
+				      bh->b_blocknr);
+		/* Free the old block. */
+		ea_bdebug(bh, "freeing");
+		ext2_free_blocks(inode, bh->b_blocknr, 1);
+		/* We let our caller release bh, so we
+		 * need to duplicate the buffer before. */
+		get_bh(bh);
+		bforget(bh);
+		unlock_buffer(bh);
+	} else {
+		/* Decrement the refcount only. */
+		le32_add_cpu(&HDR(bh)->h_refcount, -1);
+		dquot_free_block(inode, 1);
+		mark_buffer_dirty(bh);
+		unlock_buffer(bh);
+		ea_bdebug(bh, "refcount now=%d",
+			le32_to_cpu(HDR(bh)->h_refcount));
+		if (IS_SYNC(inode))
+			sync_dirty_buffer(bh);
+	}
+}
+
 /*
  * Second half of ext2_xattr_set(): Update the file system.
  */
@@ -747,34 +783,7 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
 		 * If there was an old block and we are no longer using it,
 		 * release the old block.
 		 */
-		lock_buffer(old_bh);
-		if (HDR(old_bh)->h_refcount == cpu_to_le32(1)) {
-			__u32 hash = le32_to_cpu(HDR(old_bh)->h_hash);
-
-			/*
-			 * This must happen under buffer lock for
-			 * ext2_xattr_set2() to reliably detect freed block
-			 */
-			mb_cache_entry_delete(ea_block_cache, hash,
-					      old_bh->b_blocknr);
-			/* Free the old block. */
-			ea_bdebug(old_bh, "freeing");
-			ext2_free_blocks(inode, old_bh->b_blocknr, 1);
-			mark_inode_dirty(inode);
-			/* We let our caller release old_bh, so we
-			 * need to duplicate the buffer before. */
-			get_bh(old_bh);
-			bforget(old_bh);
-		} else {
-			/* Decrement the refcount only. */
-			le32_add_cpu(&HDR(old_bh)->h_refcount, -1);
-			dquot_free_block_nodirty(inode, 1);
-			mark_inode_dirty(inode);
-			mark_buffer_dirty(old_bh);
-			ea_bdebug(old_bh, "refcount now=%d",
-				le32_to_cpu(HDR(old_bh)->h_refcount));
-		}
-		unlock_buffer(old_bh);
+		ext2_xattr_release_block(inode, old_bh);
 	}
 
 cleanup:
@@ -828,30 +837,7 @@ ext2_xattr_delete_inode(struct inode *inode)
 			EXT2_I(inode)->i_file_acl);
 		goto cleanup;
 	}
-	lock_buffer(bh);
-	if (HDR(bh)->h_refcount == cpu_to_le32(1)) {
-		__u32 hash = le32_to_cpu(HDR(bh)->h_hash);
-
-		/*
-		 * This must happen under buffer lock for ext2_xattr_set2() to
-		 * reliably detect freed block
-		 */
-		mb_cache_entry_delete(EA_BLOCK_CACHE(inode), hash,
-				      bh->b_blocknr);
-		ext2_free_blocks(inode, EXT2_I(inode)->i_file_acl, 1);
-		get_bh(bh);
-		bforget(bh);
-		unlock_buffer(bh);
-	} else {
-		le32_add_cpu(&HDR(bh)->h_refcount, -1);
-		ea_bdebug(bh, "refcount now=%d",
-			le32_to_cpu(HDR(bh)->h_refcount));
-		unlock_buffer(bh);
-		mark_buffer_dirty(bh);
-		if (IS_SYNC(inode))
-			sync_dirty_buffer(bh);
-		dquot_free_block_nodirty(inode, 1);
-	}
+	ext2_xattr_release_block(inode, bh);
 	EXT2_I(inode)->i_file_acl = 0;
 
 cleanup:
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2022-07-26  1:14 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-12 10:54 [PATCH 0/10 v3] ext4: Fix possible fs corruption due to xattr races Jan Kara
2022-07-12 10:54 ` [PATCH 01/10] mbcache: Don't reclaim used entries Jan Kara
2022-07-14 11:47   ` Ritesh Harjani
2022-07-14 14:36     ` Jan Kara
2022-07-14 14:49       ` Ritesh Harjani
2022-07-22 13:58   ` Theodore Ts'o
2022-07-12 10:54 ` [PATCH 02/10] mbcache: Add functions to delete entry if unused Jan Kara
2022-07-14 12:15   ` Ritesh Harjani
2022-07-14 14:49     ` Jan Kara
2022-07-14 15:00       ` Ritesh Harjani
2022-07-12 10:54 ` [PATCH 03/10] ext4: Remove EA inode entry from mbcache on inode eviction Jan Kara
2022-07-12 10:54 ` [PATCH 04/10] ext4: Unindent codeblock in ext4_xattr_block_set() Jan Kara
2022-07-14 12:19   ` Ritesh Harjani
2022-07-12 10:54 ` [PATCH 05/10] ext4: Fix race when reusing xattr blocks Jan Kara
2022-07-14 12:26   ` Ritesh Harjani
2022-07-16  3:00   ` Zhihao Cheng
2022-07-25 15:23     ` Jan Kara
2022-07-26  1:14       ` Zhihao Cheng
2022-07-12 10:54 ` [PATCH 06/10] ext2: Factor our freeing of xattr block reference Jan Kara
2022-07-14 12:37   ` Ritesh Harjani
2022-07-14 14:55     ` Jan Kara
2022-07-14 16:17       ` Ritesh Harjani
2022-07-12 10:54 ` [PATCH 07/10] ext2: Unindent codeblock in ext2_xattr_set() Jan Kara
2022-07-14 12:38   ` Ritesh Harjani
2022-07-12 10:54 ` [PATCH 08/10] ext2: Avoid deleting xattr block that is being reused Jan Kara
2022-07-12 10:54 ` [PATCH 09/10] mbcache: Remove mb_cache_entry_delete() Jan Kara
2022-07-12 10:54 ` [PATCH 10/10] mbcache: Automatically delete entries from cache on freeing Jan Kara
2022-07-14 13:09   ` Ritesh Harjani
2022-07-14 15:05     ` Jan Kara
2022-07-12 12:47 ` [PATCH 0/10 v3] ext4: Fix possible fs corruption due to xattr races Ritesh Harjani
  -- strict thread matches above, loose matches on Subject: below --
2022-06-14 16:05 [PATCH 0/10 v2] " Jan Kara
2022-06-14 16:05 ` [PATCH 06/10] ext2: Factor our freeing of xattr block reference Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.