All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option
@ 2023-08-24  9:26 Zhang Yi
  2023-08-24  9:26 ` [RFC PATCH 01/16] ext4: correct the start block of counting reserved clusters Zhang Yi
                   ` (16 more replies)
  0 siblings, 17 replies; 24+ messages in thread
From: Zhang Yi @ 2023-08-24  9:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Hello,

The delayed allocation method allocates blocks during page writeback in
ext4_writepages(), which cannot handle block allocation failures due to
e.g. ENOSPC if acquires more extent blocks. In order to deal with this,
commit '79f0be8d2e6e ("ext4: Switch to non delalloc mode when we are low
on free blocks count.")' introduce ext4_nonda_switch() to convert to no
delalloc mode if the space if the free blocks is less than 150% of dirty
blocks or the watermark. In the meantime, '27dd43854227 ("ext4:
introduce reserved space")' reserve some of the file system space (2% or
4096 clusters, whichever is smaller). Both of those to solutions can
make sure that space is not exhausted when mapping delalloc blocks in
most cases, but cannot guarantee work in all cases, which could lead to
infinite loop or data loss (please see patch 14 for details).

This patch set wants to reserve metadata space more accurate for
delalloc mount option. The metadata blocks reservation is very tricky
and is also related to the continuity of physical blocks, an effective
way is to reserve as the worst case, which means that every data block
is discontinuous and one data block costs an extent entry. Reserve
metadata space as the worst case can make sure enough blocks reserved
during data writeback, the unused reservaion space can be released after
mapping data blocks. After doing this, add a worker to submit delayed
allocations to prevent excessive reservations. Finally, we could
completely drop the policy of switching back to non-delayed allocation.

The patch set is based on the latest ext4 dev branch.
Patch 1-2:   Fix two reserved data blocks problems triggered when
             bigalloc feature is enabled.
Patch 3-6:   Move reserved data blocks updating from
             ext4_{ext|ind}_map_blocks() to ext4_es_insert_extent(),
             preparing for reserving metadata.
Patch 7-14:  Reserve metadata blocks for delayed allocation as the worst
             case, and update count after allocating or releasing.
Patch 15-16: In order to prevent too many reserved metadata blocks that
             could running false positive out of space (doesn't take
             that much after allocating data blocks), add a worker to
             submit IO if the reservation is too big.

About tests:
1. This patch set has passed 'kvm-xfstests -g auto' many times.
2. The performance looks not significantly affected after doing the
   following tests on my virtual machine with 4 CPU core and 32GB
   memory, which based on Kunpeng-920 arm64 CPU and 1.5TB nvme ssd.

 fio -directory=/test -direct=0 -iodepth=10 -fsync=$sync -rw=$rw \
     -numjobs=${numjobs} -bs=${bs}k -ioengine=libaio -size=10G \
     -ramp_time=10 -runtime=60 -norandommap=0 -group_reportin \
     -name=tests

 Disable bigalloc:
                               | Before           | After
 rw         fsync jobs  bs(kB) | iops   bw(MiB/s) | iops   bw(MiB/s)
 ------------------------------|------------------|-----------------
 write      0     1     4      | 27500  107       | 27100  106
 write      0     4     4      | 33900  132       | 35300  138
 write      0     1     1024   | 134    135       | 149    150
 write      0     4     1024   | 172    174       | 199    200
 write      1     1     4      | 1530   6.1       | 1651   6.6
 write      1     4     4      | 3139   12.3      | 3131   12.2
 write      1     1     1024   | 184    185       | 195    196
 write      1     4     1024   | 117    119       | 114    115
 randwrite  0     1     4      | 17900  69.7      | 17600  68.9
 randwrite  0     4     4      | 32700  128       | 34600  135
 randwrite  0     1     1024   | 145    146       | 155    155
 randwrite  0     4     1024   | 193    194       | 207    209
 randwrite  1     1     4      | 1335   5.3       | 1444   5.7
 randwrite  1     4     4      | 3364   13.1      | 3428   13.4
 randwrite  1     1     1024   | 180    180       | 171    172
 randwrite  1     4     1024   | 132    134       | 141    142

 Enable bigalloc:
                               | Before           | After
 rw         fsync jobs  bs(kB) | iops   bw(MiB/s) | iops   bw(MiB/s)
 ------------------------------|------------------|-----------------
 write      0     1     4      | 27500  107       | 30300  118
 write      0     4     4      | 28800  112       | 34000  137
 write      0     1     1024   | 141    142       | 162    162
 write      0     4     1024   | 172    173       | 195    196
 write      1     1     4      | 1410   5.6       | 1302   5.2
 write      1     4     4      | 3052   11.9      | 3002   11.7
 write      1     1     1024   | 153    153       | 163    164
 write      1     4     1024   | 113    114       | 110    111
 randwrite  0     1     4      | 17500  68.5      | 18400  72
 randwrite  0     4     4      | 26600  104       | 24800  96
 randwrite  0     1     1024   | 170    171       | 165    165
 randwrite  0     4     1024   | 168    169       | 152    153
 randwrite  1     1     4      | 1281   5.1       | 1335   5.3
 randwrite  1     4     4      | 3115   12.2      | 3315   12
 randwrite  1     1     1024   | 150    150       | 151    152
 randwrite  1     4     1024   | 134    135       | 132    133

 Tests on ramdisk:

 Disable bigalloc
                               | Before           | After
 rw         fsync jobs  bs(kB) | iops   bw(MiB/s) | iops   bw(MiB/s)
 ------------------------------|------------------|-----------------
 write      1     1     4      | 4699   18.4      | 4858   18
 write      1     1     1024   | 245    246       | 247    248

 Enable bigalloc 
                               | Before           | After
 rw         fsync jobs  bs(kB) | iops   bw(MiB/s) | iops   bw(MiB/s)
 ------------------------------|------------------|-----------------
 write      1     1     4      | 4634   18.1      | 5073   19.8
 write      1     1     1024   | 246    247       | 268    269

Thanks,
Yi.

Zhang Yi (16):
  ext4: correct the start block of counting reserved clusters
  ext4: make sure allocate pending entry not fail
  ext4: let __revise_pending() return the number of new inserts pendings
  ext4: count removed reserved blocks for delalloc only es entry
  ext4: pass real delayed status into ext4_es_insert_extent()
  ext4: move delalloc data reserve spcae updating into
    ext4_es_insert_extent()
  ext4: count inode's total delalloc data blocks into ext4_es_tree
  ext4: refactor delalloc space reservation
  ext4: count reserved metadata blocks for delalloc per inode
  ext4: reserve meta blocks in ext4_da_reserve_space()
  ext4: factor out common part of
    ext4_da_{release|update_reserve}_space()
  ext4: update reserved meta blocks in
    ext4_da_{release|update_reserve}_space()
  ext4: calculate the worst extent blocks needed of a delalloc es entry
  ext4: reserve extent blocks for delalloc
  ext4: flush delalloc blocks if no free space
  ext4: drop ext4_nonda_switch()

 fs/ext4/balloc.c            |  47 ++++-
 fs/ext4/ext4.h              |  14 +-
 fs/ext4/extents.c           |  65 +++----
 fs/ext4/extents_status.c    | 340 +++++++++++++++++++++---------------
 fs/ext4/extents_status.h    |   3 +-
 fs/ext4/indirect.c          |   7 -
 fs/ext4/inode.c             | 191 ++++++++++----------
 fs/ext4/super.c             |  22 ++-
 include/trace/events/ext4.h |  70 ++++++--
 9 files changed, 439 insertions(+), 320 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH 01/16] ext4: correct the start block of counting reserved clusters
  2023-08-24  9:26 [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Zhang Yi
@ 2023-08-24  9:26 ` Zhang Yi
  2023-08-30 13:10   ` Jan Kara
  2023-08-24  9:26 ` [RFC PATCH 02/16] ext4: make sure allocate pending entry not fail Zhang Yi
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 24+ messages in thread
From: Zhang Yi @ 2023-08-24  9:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

When big allocate feature is enabled, we need to count and update
reserved clusters before removing a delayed only extent_status entry.
{init|count|get}_rsvd() have already done this, but the start block
number of this counting isn's correct in the following case.

  lblk            end
   |               |
   v               v
          -------------------------
          |                       | orig_es
          -------------------------
                   ^              ^
      len1 is 0    |     len2     |

If the start block of the orig_es entry founded is bigger than lblk, we
passed lblk as start block to count_rsvd(), but the length is correct,
finally, the range to be counted is offset. This patch fix this by
passing the start blocks to 'orig_es->lblk + len1'.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents_status.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 6f7de14c0fa8..5e625ea4545d 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -1405,8 +1405,8 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 			}
 		}
 		if (count_reserved)
-			count_rsvd(inode, lblk, orig_es.es_len - len1 - len2,
-				   &orig_es, &rc);
+			count_rsvd(inode, orig_es.es_lblk + len1,
+				   orig_es.es_len - len1 - len2, &orig_es, &rc);
 		goto out_get_reserved;
 	}
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 02/16] ext4: make sure allocate pending entry not fail
  2023-08-24  9:26 [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Zhang Yi
  2023-08-24  9:26 ` [RFC PATCH 01/16] ext4: correct the start block of counting reserved clusters Zhang Yi
@ 2023-08-24  9:26 ` Zhang Yi
  2023-08-30 13:25   ` Jan Kara
  2023-08-24  9:26 ` [RFC PATCH 03/16] ext4: let __revise_pending() return the number of new inserts pendings Zhang Yi
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 24+ messages in thread
From: Zhang Yi @ 2023-08-24  9:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

__insert_pending() allocate memory in atomic context, so the allocation
could fail, but we are not handling that failure now. It could lead
ext4_es_remove_extent() to get wrong reserved clusters, and the global
data blocks reservation count will be incorrect. The same to
extents_status entry preallocation, preallocate pending entry out of the
i_es_lock with __GFP_NOFAIL, make sure __insert_pending() and
__revise_pending() always succeeds.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents_status.c | 123 ++++++++++++++++++++++++++++-----------
 1 file changed, 89 insertions(+), 34 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 5e625ea4545d..f4b50652f0cc 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -152,8 +152,9 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 static int es_reclaim_extents(struct ext4_inode_info *ei, int *nr_to_scan);
 static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
 		       struct ext4_inode_info *locked_ei);
-static void __revise_pending(struct inode *inode, ext4_lblk_t lblk,
-			     ext4_lblk_t len);
+static int __revise_pending(struct inode *inode, ext4_lblk_t lblk,
+			    ext4_lblk_t len,
+			    struct pending_reservation **prealloc);
 
 int __init ext4_init_es(void)
 {
@@ -448,6 +449,19 @@ static void ext4_es_list_del(struct inode *inode)
 	spin_unlock(&sbi->s_es_lock);
 }
 
+static inline struct pending_reservation *__alloc_pending(bool nofail)
+{
+	if (!nofail)
+		return kmem_cache_alloc(ext4_pending_cachep, GFP_ATOMIC);
+
+	return kmem_cache_zalloc(ext4_pending_cachep, GFP_KERNEL | __GFP_NOFAIL);
+}
+
+static inline void __free_pending(struct pending_reservation *pr)
+{
+	kmem_cache_free(ext4_pending_cachep, pr);
+}
+
 /*
  * Returns true if we cannot fail to allocate memory for this extent_status
  * entry and cannot reclaim it until its status changes.
@@ -836,11 +850,12 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 {
 	struct extent_status newes;
 	ext4_lblk_t end = lblk + len - 1;
-	int err1 = 0;
-	int err2 = 0;
+	int err1 = 0, err2 = 0, err3 = 0;
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	struct extent_status *es1 = NULL;
 	struct extent_status *es2 = NULL;
+	struct pending_reservation *pr = NULL;
+	bool revise_pending = false;
 
 	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
 		return;
@@ -868,11 +883,17 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 
 	ext4_es_insert_extent_check(inode, &newes);
 
+	revise_pending = sbi->s_cluster_ratio > 1 &&
+			 test_opt(inode->i_sb, DELALLOC) &&
+			 (status & (EXTENT_STATUS_WRITTEN |
+				    EXTENT_STATUS_UNWRITTEN));
 retry:
 	if (err1 && !es1)
 		es1 = __es_alloc_extent(true);
 	if ((err1 || err2) && !es2)
 		es2 = __es_alloc_extent(true);
+	if ((err1 || err2 || err3) && revise_pending && !pr)
+		pr = __alloc_pending(true);
 	write_lock(&EXT4_I(inode)->i_es_lock);
 
 	err1 = __es_remove_extent(inode, lblk, end, NULL, es1);
@@ -897,13 +918,18 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 		es2 = NULL;
 	}
 
-	if (sbi->s_cluster_ratio > 1 && test_opt(inode->i_sb, DELALLOC) &&
-	    (status & EXTENT_STATUS_WRITTEN ||
-	     status & EXTENT_STATUS_UNWRITTEN))
-		__revise_pending(inode, lblk, len);
+	if (revise_pending) {
+		err3 = __revise_pending(inode, lblk, len, &pr);
+		if (err3 != 0)
+			goto error;
+		if (pr) {
+			__free_pending(pr);
+			pr = NULL;
+		}
+	}
 error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
-	if (err1 || err2)
+	if (err1 || err2 || err3)
 		goto retry;
 
 	ext4_es_print_tree(inode);
@@ -1311,7 +1337,7 @@ static unsigned int get_rsvd(struct inode *inode, ext4_lblk_t end,
 				rc->ndelonly--;
 				node = rb_next(&pr->rb_node);
 				rb_erase(&pr->rb_node, &tree->root);
-				kmem_cache_free(ext4_pending_cachep, pr);
+				__free_pending(pr);
 				if (!node)
 					break;
 				pr = rb_entry(node, struct pending_reservation,
@@ -1907,11 +1933,13 @@ static struct pending_reservation *__get_pending(struct inode *inode,
  *
  * @inode - file containing the cluster
  * @lblk - logical block in the cluster to be added
+ * @prealloc - preallocated pending entry
  *
  * Returns 0 on successful insertion and -ENOMEM on failure.  If the
  * pending reservation is already in the set, returns successfully.
  */
-static int __insert_pending(struct inode *inode, ext4_lblk_t lblk)
+static int __insert_pending(struct inode *inode, ext4_lblk_t lblk,
+			    struct pending_reservation **prealloc)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	struct ext4_pending_tree *tree = &EXT4_I(inode)->i_pending_tree;
@@ -1937,10 +1965,15 @@ static int __insert_pending(struct inode *inode, ext4_lblk_t lblk)
 		}
 	}
 
-	pr = kmem_cache_alloc(ext4_pending_cachep, GFP_ATOMIC);
-	if (pr == NULL) {
-		ret = -ENOMEM;
-		goto out;
+	if (likely(*prealloc == NULL)) {
+		pr = __alloc_pending(false);
+		if (!pr) {
+			ret = -ENOMEM;
+			goto out;
+		}
+	} else {
+		pr = *prealloc;
+		*prealloc = NULL;
 	}
 	pr->lclu = lclu;
 
@@ -1970,7 +2003,7 @@ static void __remove_pending(struct inode *inode, ext4_lblk_t lblk)
 	if (pr != NULL) {
 		tree = &EXT4_I(inode)->i_pending_tree;
 		rb_erase(&pr->rb_node, &tree->root);
-		kmem_cache_free(ext4_pending_cachep, pr);
+		__free_pending(pr);
 	}
 }
 
@@ -2029,10 +2062,10 @@ void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
 				  bool allocated)
 {
 	struct extent_status newes;
-	int err1 = 0;
-	int err2 = 0;
+	int err1 = 0, err2 = 0, err3 = 0;
 	struct extent_status *es1 = NULL;
 	struct extent_status *es2 = NULL;
+	struct pending_reservation *pr = NULL;
 
 	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
 		return;
@@ -2052,6 +2085,8 @@ void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
 		es1 = __es_alloc_extent(true);
 	if ((err1 || err2) && !es2)
 		es2 = __es_alloc_extent(true);
+	if ((err1 || err2 || err3) && allocated && !pr)
+		pr = __alloc_pending(true);
 	write_lock(&EXT4_I(inode)->i_es_lock);
 
 	err1 = __es_remove_extent(inode, lblk, lblk, NULL, es1);
@@ -2074,11 +2109,18 @@ void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
 		es2 = NULL;
 	}
 
-	if (allocated)
-		__insert_pending(inode, lblk);
+	if (allocated) {
+		err3 = __insert_pending(inode, lblk, &pr);
+		if (err3 != 0)
+			goto error;
+		if (pr) {
+			__free_pending(pr);
+			pr = NULL;
+		}
+	}
 error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
-	if (err1 || err2)
+	if (err1 || err2 || err3)
 		goto retry;
 
 	ext4_es_print_tree(inode);
@@ -2184,21 +2226,24 @@ unsigned int ext4_es_delayed_clu(struct inode *inode, ext4_lblk_t lblk,
  * @inode - file containing the range
  * @lblk - logical block defining the start of range
  * @len  - length of range in blocks
+ * @prealloc - preallocated pending entry
  *
  * Used after a newly allocated extent is added to the extents status tree.
  * Requires that the extents in the range have either written or unwritten
  * status.  Must be called while holding i_es_lock.
  */
-static void __revise_pending(struct inode *inode, ext4_lblk_t lblk,
-			     ext4_lblk_t len)
+static int __revise_pending(struct inode *inode, ext4_lblk_t lblk,
+			    ext4_lblk_t len,
+			    struct pending_reservation **prealloc)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	ext4_lblk_t end = lblk + len - 1;
 	ext4_lblk_t first, last;
 	bool f_del = false, l_del = false;
+	int ret = 0;
 
 	if (len == 0)
-		return;
+		return 0;
 
 	/*
 	 * Two cases - block range within single cluster and block range
@@ -2219,7 +2264,9 @@ static void __revise_pending(struct inode *inode, ext4_lblk_t lblk,
 			f_del = __es_scan_range(inode, &ext4_es_is_delonly,
 						first, lblk - 1);
 		if (f_del) {
-			__insert_pending(inode, first);
+			ret = __insert_pending(inode, first, prealloc);
+			if (ret < 0)
+				goto out;
 		} else {
 			last = EXT4_LBLK_CMASK(sbi, end) +
 			       sbi->s_cluster_ratio - 1;
@@ -2227,9 +2274,11 @@ static void __revise_pending(struct inode *inode, ext4_lblk_t lblk,
 				l_del = __es_scan_range(inode,
 							&ext4_es_is_delonly,
 							end + 1, last);
-			if (l_del)
-				__insert_pending(inode, last);
-			else
+			if (l_del) {
+				ret = __insert_pending(inode, last, prealloc);
+				if (ret < 0)
+					goto out;
+			} else
 				__remove_pending(inode, last);
 		}
 	} else {
@@ -2237,18 +2286,24 @@ static void __revise_pending(struct inode *inode, ext4_lblk_t lblk,
 		if (first != lblk)
 			f_del = __es_scan_range(inode, &ext4_es_is_delonly,
 						first, lblk - 1);
-		if (f_del)
-			__insert_pending(inode, first);
-		else
+		if (f_del) {
+			ret = __insert_pending(inode, first, prealloc);
+			if (ret < 0)
+				goto out;
+		} else
 			__remove_pending(inode, first);
 
 		last = EXT4_LBLK_CMASK(sbi, end) + sbi->s_cluster_ratio - 1;
 		if (last != end)
 			l_del = __es_scan_range(inode, &ext4_es_is_delonly,
 						end + 1, last);
-		if (l_del)
-			__insert_pending(inode, last);
-		else
+		if (l_del) {
+			ret = __insert_pending(inode, last, prealloc);
+			if (ret < 0)
+				goto out;
+		} else
 			__remove_pending(inode, last);
 	}
+out:
+	return ret;
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 03/16] ext4: let __revise_pending() return the number of new inserts pendings
  2023-08-24  9:26 [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Zhang Yi
  2023-08-24  9:26 ` [RFC PATCH 01/16] ext4: correct the start block of counting reserved clusters Zhang Yi
  2023-08-24  9:26 ` [RFC PATCH 02/16] ext4: make sure allocate pending entry not fail Zhang Yi
@ 2023-08-24  9:26 ` Zhang Yi
  2023-08-24  9:26 ` [RFC PATCH 04/16] ext4: count removed reserved blocks for delalloc only es entry Zhang Yi
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2023-08-24  9:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Change __insert_pending() to return 1 on successful insertion a new
pending cluster, and then change __revise_pending() to return the number
of new inserts pendings.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents_status.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index f4b50652f0cc..67ac09930541 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -892,7 +892,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 		es1 = __es_alloc_extent(true);
 	if ((err1 || err2) && !es2)
 		es2 = __es_alloc_extent(true);
-	if ((err1 || err2 || err3) && revise_pending && !pr)
+	if ((err1 || err2 || err3 < 0) && revise_pending && !pr)
 		pr = __alloc_pending(true);
 	write_lock(&EXT4_I(inode)->i_es_lock);
 
@@ -920,7 +920,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 
 	if (revise_pending) {
 		err3 = __revise_pending(inode, lblk, len, &pr);
-		if (err3 != 0)
+		if (err3 < 0)
 			goto error;
 		if (pr) {
 			__free_pending(pr);
@@ -929,7 +929,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	}
 error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
-	if (err1 || err2 || err3)
+	if (err1 || err2 || err3 < 0)
 		goto retry;
 
 	ext4_es_print_tree(inode);
@@ -1935,7 +1935,7 @@ static struct pending_reservation *__get_pending(struct inode *inode,
  * @lblk - logical block in the cluster to be added
  * @prealloc - preallocated pending entry
  *
- * Returns 0 on successful insertion and -ENOMEM on failure.  If the
+ * Returns 1 on successful insertion and -ENOMEM on failure.  If the
  * pending reservation is already in the set, returns successfully.
  */
 static int __insert_pending(struct inode *inode, ext4_lblk_t lblk,
@@ -1979,6 +1979,7 @@ static int __insert_pending(struct inode *inode, ext4_lblk_t lblk,
 
 	rb_link_node(&pr->rb_node, parent, p);
 	rb_insert_color(&pr->rb_node, &tree->root);
+	ret = 1;
 
 out:
 	return ret;
@@ -2085,7 +2086,7 @@ void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
 		es1 = __es_alloc_extent(true);
 	if ((err1 || err2) && !es2)
 		es2 = __es_alloc_extent(true);
-	if ((err1 || err2 || err3) && allocated && !pr)
+	if ((err1 || err2 || err3 < 0) && allocated && !pr)
 		pr = __alloc_pending(true);
 	write_lock(&EXT4_I(inode)->i_es_lock);
 
@@ -2111,7 +2112,7 @@ void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
 
 	if (allocated) {
 		err3 = __insert_pending(inode, lblk, &pr);
-		if (err3 != 0)
+		if (err3 < 0)
 			goto error;
 		if (pr) {
 			__free_pending(pr);
@@ -2120,7 +2121,7 @@ void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
 	}
 error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
-	if (err1 || err2 || err3)
+	if (err1 || err2 || err3 < 0)
 		goto retry;
 
 	ext4_es_print_tree(inode);
@@ -2230,7 +2231,9 @@ unsigned int ext4_es_delayed_clu(struct inode *inode, ext4_lblk_t lblk,
  *
  * Used after a newly allocated extent is added to the extents status tree.
  * Requires that the extents in the range have either written or unwritten
- * status.  Must be called while holding i_es_lock.
+ * status.  Must be called while holding i_es_lock. Returns number of new
+ * inserts pending cluster on insert pendings, returns 0 on remove pendings,
+ * return -ENOMEM on failure.
  */
 static int __revise_pending(struct inode *inode, ext4_lblk_t lblk,
 			    ext4_lblk_t len,
@@ -2240,6 +2243,7 @@ static int __revise_pending(struct inode *inode, ext4_lblk_t lblk,
 	ext4_lblk_t end = lblk + len - 1;
 	ext4_lblk_t first, last;
 	bool f_del = false, l_del = false;
+	int pendings = 0;
 	int ret = 0;
 
 	if (len == 0)
@@ -2267,6 +2271,7 @@ static int __revise_pending(struct inode *inode, ext4_lblk_t lblk,
 			ret = __insert_pending(inode, first, prealloc);
 			if (ret < 0)
 				goto out;
+			pendings += ret;
 		} else {
 			last = EXT4_LBLK_CMASK(sbi, end) +
 			       sbi->s_cluster_ratio - 1;
@@ -2278,6 +2283,7 @@ static int __revise_pending(struct inode *inode, ext4_lblk_t lblk,
 				ret = __insert_pending(inode, last, prealloc);
 				if (ret < 0)
 					goto out;
+				pendings += ret;
 			} else
 				__remove_pending(inode, last);
 		}
@@ -2290,6 +2296,7 @@ static int __revise_pending(struct inode *inode, ext4_lblk_t lblk,
 			ret = __insert_pending(inode, first, prealloc);
 			if (ret < 0)
 				goto out;
+			pendings += ret;
 		} else
 			__remove_pending(inode, first);
 
@@ -2301,9 +2308,10 @@ static int __revise_pending(struct inode *inode, ext4_lblk_t lblk,
 			ret = __insert_pending(inode, last, prealloc);
 			if (ret < 0)
 				goto out;
+			pendings += ret;
 		} else
 			__remove_pending(inode, last);
 	}
 out:
-	return ret;
+	return (ret < 0) ? ret : pendings;
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 04/16] ext4: count removed reserved blocks for delalloc only es entry
  2023-08-24  9:26 [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Zhang Yi
                   ` (2 preceding siblings ...)
  2023-08-24  9:26 ` [RFC PATCH 03/16] ext4: let __revise_pending() return the number of new inserts pendings Zhang Yi
@ 2023-08-24  9:26 ` Zhang Yi
  2023-08-24  9:26 ` [RFC PATCH 05/16] ext4: pass real delayed status into ext4_es_insert_extent() Zhang Yi
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2023-08-24  9:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Current __es_remove_extent() count reserved clusters if the removed es
entry is delalloc only, it's equal to blocks number if the bigalloc
feature is disabled, but we cannot get the blocks number if that feature
is enabled. So add a parameter to count the reserved blocks number, it
is not used in this patch now, it will be used to calculate reserved
meta blocks.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents_status.c | 40 ++++++++++++++++++++++++++++------------
 1 file changed, 28 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 67ac09930541..3a004ed04570 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -141,13 +141,18 @@
  *   -- Extent-level locking
  */
 
+struct rsvd_info {
+	int ndelonly_clu;	/* reserved clusters for delalloc es entry */
+	int ndelonly_blk;	/* reserved blocks for delalloc es entry */
+};
+
 static struct kmem_cache *ext4_es_cachep;
 static struct kmem_cache *ext4_pending_cachep;
 
 static int __es_insert_extent(struct inode *inode, struct extent_status *newes,
 			      struct extent_status *prealloc);
 static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
-			      ext4_lblk_t end, int *reserved,
+			      ext4_lblk_t end, struct rsvd_info *rinfo,
 			      struct extent_status *prealloc);
 static int es_reclaim_extents(struct ext4_inode_info *ei, int *nr_to_scan);
 static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
@@ -1050,6 +1055,7 @@ int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,
 
 struct rsvd_count {
 	int ndelonly;
+	int ndelonly_blk;
 	bool first_do_lblk_found;
 	ext4_lblk_t first_do_lblk;
 	ext4_lblk_t last_do_lblk;
@@ -1076,6 +1082,7 @@ static void init_rsvd(struct inode *inode, ext4_lblk_t lblk,
 	struct rb_node *node;
 
 	rc->ndelonly = 0;
+	rc->ndelonly_blk = 0;
 
 	/*
 	 * for bigalloc, note the first delonly block in the range has not
@@ -1124,10 +1131,12 @@ static void count_rsvd(struct inode *inode, ext4_lblk_t lblk, long len,
 
 	if (sbi->s_cluster_ratio == 1) {
 		rc->ndelonly += (int) len;
+		rc->ndelonly_blk = rc->ndelonly;
 		return;
 	}
 
 	/* bigalloc */
+	rc->ndelonly_blk += (int)len;
 
 	i = (lblk < es->es_lblk) ? es->es_lblk : lblk;
 	end = lblk + (ext4_lblk_t) len - 1;
@@ -1355,16 +1364,17 @@ static unsigned int get_rsvd(struct inode *inode, ext4_lblk_t end,
  * @inode - file containing range
  * @lblk - first block in range
  * @end - last block in range
- * @reserved - number of cluster reservations released
+ * @rinfo - reserved information collected, includes number of
+ *          block/cluster reservations released
  * @prealloc - pre-allocated es to avoid memory allocation failures
  *
- * If @reserved is not NULL and delayed allocation is enabled, counts
+ * If @rinfo is not NULL and delayed allocation is enabled, counts
  * block/cluster reservations freed by removing range and if bigalloc
  * enabled cancels pending reservations as needed. Returns 0 on success,
  * error code on failure.
  */
 static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
-			      ext4_lblk_t end, int *reserved,
+			      ext4_lblk_t end, struct rsvd_info *rinfo,
 			      struct extent_status *prealloc)
 {
 	struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
@@ -1374,11 +1384,15 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	ext4_lblk_t len1, len2;
 	ext4_fsblk_t block;
 	int err = 0;
-	bool count_reserved = true;
+	bool count_reserved = false;
 	struct rsvd_count rc;
 
-	if (reserved == NULL || !test_opt(inode->i_sb, DELALLOC))
-		count_reserved = false;
+	if (rinfo) {
+		rinfo->ndelonly_clu = 0;
+		rinfo->ndelonly_blk = 0;
+		if (test_opt(inode->i_sb, DELALLOC))
+			count_reserved = true;
+	}
 
 	es = __es_tree_search(&tree->root, lblk);
 	if (!es)
@@ -1476,8 +1490,10 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	}
 
 out_get_reserved:
-	if (count_reserved)
-		*reserved = get_rsvd(inode, end, es, &rc);
+	if (count_reserved) {
+		rinfo->ndelonly_clu = get_rsvd(inode, end, es, &rc);
+		rinfo->ndelonly_blk = rc.ndelonly_blk;
+	}
 out:
 	return err;
 }
@@ -1496,8 +1512,8 @@ void ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 			   ext4_lblk_t len)
 {
 	ext4_lblk_t end;
+	struct rsvd_info rinfo;
 	int err = 0;
-	int reserved = 0;
 	struct extent_status *es = NULL;
 
 	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
@@ -1522,7 +1538,7 @@ void ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	 * is reclaimed.
 	 */
 	write_lock(&EXT4_I(inode)->i_es_lock);
-	err = __es_remove_extent(inode, lblk, end, &reserved, es);
+	err = __es_remove_extent(inode, lblk, end, &rinfo, es);
 	/* Free preallocated extent if it didn't get used. */
 	if (es) {
 		if (!es->es_len)
@@ -1534,7 +1550,7 @@ void ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 		goto retry;
 
 	ext4_es_print_tree(inode);
-	ext4_da_release_space(inode, reserved);
+	ext4_da_release_space(inode, rinfo.ndelonly_clu);
 	return;
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 05/16] ext4: pass real delayed status into ext4_es_insert_extent()
  2023-08-24  9:26 [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Zhang Yi
                   ` (3 preceding siblings ...)
  2023-08-24  9:26 ` [RFC PATCH 04/16] ext4: count removed reserved blocks for delalloc only es entry Zhang Yi
@ 2023-08-24  9:26 ` Zhang Yi
  2023-08-24  9:26 ` [RFC PATCH 06/16] ext4: move delalloc data reserve spcae updating " Zhang Yi
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2023-08-24  9:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Commit 'd2dc317d564a ("ext4: fix data corruption caused by unwritten and
delayed extents")' fix a data corruption issue by stop passing delayed
status into ext4_es_insert_extent() if the mapping range has been
written. This patch change it to still pass the real delayed status and
deal with the 'delayed && written' case in ext4_es_insert_extent(). If
the status have delayed bit is set, it means that the path of delayed
allocation is still running, and this insert process is not allocating
delayed allocated blocks.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents_status.c | 13 +++++++------
 fs/ext4/inode.c          |  2 --
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 3a004ed04570..62191c772b82 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -873,13 +873,14 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 
 	BUG_ON(end < lblk);
 
+	/*
+	 * Insert extent as delayed and written which can potentially cause
+	 * data lose, and the extent has been written, it's safe to remove
+	 * the delayed flag even it's still delayed.
+	 */
 	if ((status & EXTENT_STATUS_DELAYED) &&
-	    (status & EXTENT_STATUS_WRITTEN)) {
-		ext4_warning(inode->i_sb, "Inserting extent [%u/%u] as "
-				" delayed and written which can potentially "
-				" cause data loss.", lblk, len);
-		WARN_ON(1);
-	}
+	    (status & EXTENT_STATUS_WRITTEN))
+		status &= ~EXTENT_STATUS_DELAYED;
 
 	newes.es_lblk = lblk;
 	newes.es_len = len;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6c490f05e2ba..82115d6656d3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -563,7 +563,6 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
 				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
 		if (!(flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) &&
-		    !(status & EXTENT_STATUS_WRITTEN) &&
 		    ext4_es_scan_range(inode, &ext4_es_is_delayed, map->m_lblk,
 				       map->m_lblk + map->m_len - 1))
 			status |= EXTENT_STATUS_DELAYED;
@@ -673,7 +672,6 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
 				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
 		if (!(flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) &&
-		    !(status & EXTENT_STATUS_WRITTEN) &&
 		    ext4_es_scan_range(inode, &ext4_es_is_delayed, map->m_lblk,
 				       map->m_lblk + map->m_len - 1))
 			status |= EXTENT_STATUS_DELAYED;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 06/16] ext4: move delalloc data reserve spcae updating into ext4_es_insert_extent()
  2023-08-24  9:26 [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Zhang Yi
                   ` (4 preceding siblings ...)
  2023-08-24  9:26 ` [RFC PATCH 05/16] ext4: pass real delayed status into ext4_es_insert_extent() Zhang Yi
@ 2023-08-24  9:26 ` Zhang Yi
  2023-08-24  9:26 ` [RFC PATCH 07/16] ext4: count inode's total delalloc data blocks into ext4_es_tree Zhang Yi
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2023-08-24  9:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

We update data reserved space for delalloc after allocating new blocks
in ext4_{ind|ext}_map_blocks(). If bigalloc feature is enabled, we also
need to query the extents_status tree to calculate the exact reserved
clusters. If we move it to ext4_es_insert_extent(), just after dropping
delalloc extents_status entry, it could become more simple because
__es_remove_extent() has done most of the work and we could remove
entire ext4_es_delayed_clu().

One important thing needs to take care of is that if bigalloc is
enabled, we should update data reserved count when first converting some
of the delayed only es entries of a caluster which has many other
delayed only entries left over.

  |                   one cluster                        |
  --------------------------------------------------------
  | da es 0 | .. | da es 1 | .. | da es 2 | .. | da es 3 |
  --------------------------------------------------------
  ^         ^
  |         | <- first allocating this delayed extent

The later allocations in that cluster will not count again. We could do
this by counting the new inserts pending clusters.

Another important thing is the quota claiming and i_blocks count, if
the delayed allocating has been raced by another no-delay allocating
(from fallocate, filemap, DIO...), we cannot claim quota as usual
because the racer have already done it. We could distinguish this case
easily through checking EXTENT_STATUS_DELAYED and the reserved only
blocks counted by __es_remove_extent(). If the EXTENT_STATUS_DELAYED is
set, it always means that the allocating is not from the delayed
allocating. But on the contrary, we can only get the opposite
conclusion if bigalloc is not enabled. If bigalloc is enabled, it could
be raced by another fallocate which is writing to other non-delayed
areas of the same cluster. In this case, the EXTENT_STATUS_DELAYED is
not set but we cannot claim quota again.

  |             one cluster                 |
  -------------------------------------------
  |                            | delayed es |
  -------------------------------------------
  ^           ^
  | fallocate |

So we also need to check the counted reserved only blocks, if it is zero
it means that the allocating is not from the delayed allocating, and we
should release reserved qutoa instead of claim it.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c        |  37 -------------
 fs/ext4/extents_status.c | 115 +++++++++------------------------------
 fs/ext4/extents_status.h |   2 -
 fs/ext4/indirect.c       |   7 ---
 fs/ext4/inode.c          |   5 +-
 5 files changed, 30 insertions(+), 136 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e4115d338f10..592383effe80 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4323,43 +4323,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		goto out;
 	}
 
-	/*
-	 * Reduce the reserved cluster count to reflect successful deferred
-	 * allocation of delayed allocated clusters or direct allocation of
-	 * clusters discovered to be delayed allocated.  Once allocated, a
-	 * cluster is not included in the reserved count.
-	 */
-	if (test_opt(inode->i_sb, DELALLOC) && allocated_clusters) {
-		if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
-			/*
-			 * When allocating delayed allocated clusters, simply
-			 * reduce the reserved cluster count and claim quota
-			 */
-			ext4_da_update_reserve_space(inode, allocated_clusters,
-							1);
-		} else {
-			ext4_lblk_t lblk, len;
-			unsigned int n;
-
-			/*
-			 * When allocating non-delayed allocated clusters
-			 * (from fallocate, filemap, DIO, or clusters
-			 * allocated when delalloc has been disabled by
-			 * ext4_nonda_switch), reduce the reserved cluster
-			 * count by the number of allocated clusters that
-			 * have previously been delayed allocated.  Quota
-			 * has been claimed by ext4_mb_new_blocks() above,
-			 * so release the quota reservations made for any
-			 * previously delayed allocated clusters.
-			 */
-			lblk = EXT4_LBLK_CMASK(sbi, map->m_lblk);
-			len = allocated_clusters << sbi->s_cluster_bits;
-			n = ext4_es_delayed_clu(inode, lblk, len);
-			if (n > 0)
-				ext4_da_update_reserve_space(inode, (int) n, 0);
-		}
-	}
-
 	/*
 	 * Cache the extent and update transaction to commit on fdatasync only
 	 * when it is _not_ an unwritten extent.
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 62191c772b82..34164c2827f2 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -856,11 +856,14 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	struct extent_status newes;
 	ext4_lblk_t end = lblk + len - 1;
 	int err1 = 0, err2 = 0, err3 = 0;
+	struct rsvd_info rinfo;
+	int pending = 0;
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	struct extent_status *es1 = NULL;
 	struct extent_status *es2 = NULL;
 	struct pending_reservation *pr = NULL;
 	bool revise_pending = false;
+	bool delayed = false;
 
 	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
 		return;
@@ -878,6 +881,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	 * data lose, and the extent has been written, it's safe to remove
 	 * the delayed flag even it's still delayed.
 	 */
+	delayed = status & EXTENT_STATUS_DELAYED;
 	if ((status & EXTENT_STATUS_DELAYED) &&
 	    (status & EXTENT_STATUS_WRITTEN))
 		status &= ~EXTENT_STATUS_DELAYED;
@@ -902,7 +906,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 		pr = __alloc_pending(true);
 	write_lock(&EXT4_I(inode)->i_es_lock);
 
-	err1 = __es_remove_extent(inode, lblk, end, NULL, es1);
+	err1 = __es_remove_extent(inode, lblk, end, &rinfo, es1);
 	if (err1 != 0)
 		goto error;
 	/* Free preallocated extent if it didn't get used. */
@@ -932,9 +936,30 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 			__free_pending(pr);
 			pr = NULL;
 		}
+		/*
+		 * In the first partial allocating some delayed extents of
+		 * one cluster, we also need to count the data cluster when
+		 * allocating delay only extent entries.
+		 */
+		pending = err3;
 	}
 error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
+	/*
+	 * If EXTENT_STATUS_DELAYED is not set and delayed only blocks is
+	 * not zero, we are allocating delayed allocated clusters, simply
+	 * reduce the reserved cluster count and claim quota.
+	 *
+	 * Otherwise, we aren't allocating delayed allocated clusters
+	 * (from fallocate, filemap, DIO, or clusters allocated when
+	 * delalloc has been disabled by ext4_nonda_switch()), reduce the
+	 * reserved cluster count by the number of allocated clusters that
+	 * have previously been delayed allocated. Quota has been claimed
+	 * by ext4_mb_new_blocks(), so release the quota reservations made
+	 * for any previously delayed allocated clusters.
+	 */
+	ext4_da_update_reserve_space(inode, rinfo.ndelonly_clu + pending,
+				     !delayed && rinfo.ndelonly_blk);
 	if (err1 || err2 || err3 < 0)
 		goto retry;
 
@@ -2146,94 +2171,6 @@ void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
 	return;
 }
 
-/*
- * __es_delayed_clu - count number of clusters containing blocks that
- *                    are delayed only
- *
- * @inode - file containing block range
- * @start - logical block defining start of range
- * @end - logical block defining end of range
- *
- * Returns the number of clusters containing only delayed (not delayed
- * and unwritten) blocks in the range specified by @start and @end.  Any
- * cluster or part of a cluster within the range and containing a delayed
- * and not unwritten block within the range is counted as a whole cluster.
- */
-static unsigned int __es_delayed_clu(struct inode *inode, ext4_lblk_t start,
-				     ext4_lblk_t end)
-{
-	struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
-	struct extent_status *es;
-	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
-	struct rb_node *node;
-	ext4_lblk_t first_lclu, last_lclu;
-	unsigned long long last_counted_lclu;
-	unsigned int n = 0;
-
-	/* guaranteed to be unequal to any ext4_lblk_t value */
-	last_counted_lclu = ~0ULL;
-
-	es = __es_tree_search(&tree->root, start);
-
-	while (es && (es->es_lblk <= end)) {
-		if (ext4_es_is_delonly(es)) {
-			if (es->es_lblk <= start)
-				first_lclu = EXT4_B2C(sbi, start);
-			else
-				first_lclu = EXT4_B2C(sbi, es->es_lblk);
-
-			if (ext4_es_end(es) >= end)
-				last_lclu = EXT4_B2C(sbi, end);
-			else
-				last_lclu = EXT4_B2C(sbi, ext4_es_end(es));
-
-			if (first_lclu == last_counted_lclu)
-				n += last_lclu - first_lclu;
-			else
-				n += last_lclu - first_lclu + 1;
-			last_counted_lclu = last_lclu;
-		}
-		node = rb_next(&es->rb_node);
-		if (!node)
-			break;
-		es = rb_entry(node, struct extent_status, rb_node);
-	}
-
-	return n;
-}
-
-/*
- * ext4_es_delayed_clu - count number of clusters containing blocks that
- *                       are both delayed and unwritten
- *
- * @inode - file containing block range
- * @lblk - logical block defining start of range
- * @len - number of blocks in range
- *
- * Locking for external use of __es_delayed_clu().
- */
-unsigned int ext4_es_delayed_clu(struct inode *inode, ext4_lblk_t lblk,
-				 ext4_lblk_t len)
-{
-	struct ext4_inode_info *ei = EXT4_I(inode);
-	ext4_lblk_t end;
-	unsigned int n;
-
-	if (len == 0)
-		return 0;
-
-	end = lblk + len - 1;
-	WARN_ON(end < lblk);
-
-	read_lock(&ei->i_es_lock);
-
-	n = __es_delayed_clu(inode, lblk, end);
-
-	read_unlock(&ei->i_es_lock);
-
-	return n;
-}
-
 /*
  * __revise_pending - makes, cancels, or leaves unchanged pending cluster
  *                    reservations for a specified block range depending
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index d9847a4a25db..7344667eb2cd 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -251,8 +251,6 @@ extern void ext4_remove_pending(struct inode *inode, ext4_lblk_t lblk);
 extern bool ext4_is_pending(struct inode *inode, ext4_lblk_t lblk);
 extern void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
 					 bool allocated);
-extern unsigned int ext4_es_delayed_clu(struct inode *inode, ext4_lblk_t lblk,
-					ext4_lblk_t len);
 extern void ext4_clear_inode_es(struct inode *inode);
 
 #endif /* _EXT4_EXTENTS_STATUS_H */
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index a9f3716119d3..448401e02c55 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -652,13 +652,6 @@ int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
 	ext4_update_inode_fsync_trans(handle, inode, 1);
 	count = ar.len;
 
-	/*
-	 * Update reserved blocks/metadata blocks after successful block
-	 * allocation which had been deferred till now.
-	 */
-	if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
-		ext4_da_update_reserve_space(inode, count, 1);
-
 got_it:
 	map->m_flags |= EXT4_MAP_MAPPED;
 	map->m_pblk = le32_to_cpu(chain[depth-1].key);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 82115d6656d3..546a3b09fd0a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -330,11 +330,14 @@ qsize_t *ext4_get_reserved_space(struct inode *inode)
  * ext4_discard_preallocations() from here.
  */
 void ext4_da_update_reserve_space(struct inode *inode,
-					int used, int quota_claim)
+				  int used, int quota_claim)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	struct ext4_inode_info *ei = EXT4_I(inode);
 
+	if (!used)
+		return;
+
 	spin_lock(&ei->i_block_reservation_lock);
 	trace_ext4_da_update_reserve_space(inode, used, quota_claim);
 	if (unlikely(used > ei->i_reserved_data_blocks)) {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 07/16] ext4: count inode's total delalloc data blocks into ext4_es_tree
  2023-08-24  9:26 [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Zhang Yi
                   ` (5 preceding siblings ...)
  2023-08-24  9:26 ` [RFC PATCH 06/16] ext4: move delalloc data reserve spcae updating " Zhang Yi
@ 2023-08-24  9:26 ` Zhang Yi
  2023-08-24  9:26 ` [RFC PATCH 08/16] ext4: refactor delalloc space reservation Zhang Yi
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2023-08-24  9:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Add a parameter in struct ext4_es_tree to count per-inode's total
delalloc data blocks number, it will be used to calculate reserved
meta blocks when creating a new delalloc extent entry, or mapping a
delalloc entry to a real one or releasing a delalloc entry.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents_status.c | 19 +++++++++++++++++++
 fs/ext4/extents_status.h |  1 +
 2 files changed, 20 insertions(+)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 34164c2827f2..b098c3316189 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -178,6 +178,7 @@ void ext4_es_init_tree(struct ext4_es_tree *tree)
 {
 	tree->root = RB_ROOT;
 	tree->cache_es = NULL;
+	tree->da_es_len = 0;
 }
 
 #ifdef ES_DEBUG__
@@ -787,6 +788,20 @@ static inline void ext4_es_insert_extent_check(struct inode *inode,
 }
 #endif
 
+/*
+ * Update total delay allocated extent length.
+ */
+static inline void ext4_es_update_da_block(struct inode *inode, long es_len)
+{
+	struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
+
+	if (!es_len)
+		return;
+
+	tree->da_es_len += es_len;
+	es_debug("update da blocks %ld, to %u\n", es_len, tree->da_es_len);
+}
+
 static int __es_insert_extent(struct inode *inode, struct extent_status *newes,
 			      struct extent_status *prealloc)
 {
@@ -915,6 +930,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 			__es_free_extent(es1);
 		es1 = NULL;
 	}
+	ext4_es_update_da_block(inode, -rinfo.ndelonly_blk);
 
 	err2 = __es_insert_extent(inode, &newes, es2);
 	if (err2 == -ENOMEM && !ext4_es_must_keep(&newes))
@@ -1571,6 +1587,7 @@ void ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 			__es_free_extent(es);
 		es = NULL;
 	}
+	ext4_es_update_da_block(inode, -rinfo.ndelonly_blk);
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 	if (err)
 		goto retry;
@@ -2161,6 +2178,8 @@ void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
 			pr = NULL;
 		}
 	}
+
+	ext4_es_update_da_block(inode, newes.es_len);
 error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 	if (err1 || err2 || err3 < 0)
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 7344667eb2cd..ee873b305210 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -66,6 +66,7 @@ struct extent_status {
 struct ext4_es_tree {
 	struct rb_root root;
 	struct extent_status *cache_es;	/* recently accessed extent */
+	ext4_lblk_t da_es_len;	/* total delalloc len */
 };
 
 struct ext4_es_stats {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 08/16] ext4: refactor delalloc space reservation
  2023-08-24  9:26 [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Zhang Yi
                   ` (6 preceding siblings ...)
  2023-08-24  9:26 ` [RFC PATCH 07/16] ext4: count inode's total delalloc data blocks into ext4_es_tree Zhang Yi
@ 2023-08-24  9:26 ` Zhang Yi
  2023-08-24  9:26 ` [RFC PATCH 09/16] ext4: count reserved metadata blocks for delalloc per inode Zhang Yi
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2023-08-24  9:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Cleanup the delalloc reserve space calling, split it from the bigalloc
checks, call ext4_da_reserve_space() if it have unmapped block need to
reserve, no logical changes.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 546a3b09fd0a..861602903b4d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1623,8 +1623,9 @@ static void ext4_print_free_blocks(struct inode *inode)
 static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
-	int ret;
+	unsigned int rsv_dlen = 1;
 	bool allocated = false;
+	int ret;
 
 	/*
 	 * If the cluster containing lblk is shared with a delayed,
@@ -1637,11 +1638,8 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
 	 * it's necessary to examine the extent tree if a search of the
 	 * extents status tree doesn't get a match.
 	 */
-	if (sbi->s_cluster_ratio == 1) {
-		ret = ext4_da_reserve_space(inode);
-		if (ret != 0)   /* ENOSPC */
-			return ret;
-	} else {   /* bigalloc */
+	if (sbi->s_cluster_ratio > 1) {
+		rsv_dlen = 0;
 		if (!ext4_es_scan_clu(inode, &ext4_es_is_delonly, lblk)) {
 			if (!ext4_es_scan_clu(inode,
 					      &ext4_es_is_mapped, lblk)) {
@@ -1649,19 +1647,22 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
 						      EXT4_B2C(sbi, lblk));
 				if (ret < 0)
 					return ret;
-				if (ret == 0) {
-					ret = ext4_da_reserve_space(inode);
-					if (ret != 0)   /* ENOSPC */
-						return ret;
-				} else {
+				if (ret == 0)
+					rsv_dlen = 1;
+				else
 					allocated = true;
-				}
 			} else {
 				allocated = true;
 			}
 		}
 	}
 
+	if (rsv_dlen > 0) {
+		ret = ext4_da_reserve_space(inode);
+		if (ret)   /* ENOSPC */
+			return ret;
+	}
+
 	ext4_es_insert_delayed_block(inode, lblk, allocated);
 	return 0;
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 09/16] ext4: count reserved metadata blocks for delalloc per inode
  2023-08-24  9:26 [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Zhang Yi
                   ` (7 preceding siblings ...)
  2023-08-24  9:26 ` [RFC PATCH 08/16] ext4: refactor delalloc space reservation Zhang Yi
@ 2023-08-24  9:26 ` Zhang Yi
  2023-08-24  9:26 ` [RFC PATCH 10/16] ext4: reserve meta blocks in ext4_da_reserve_space() Zhang Yi
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2023-08-24  9:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Add a new parameter ei->i_reserved_ext_blocks to prepare for reserving
metadata blocks for delalloc. This parameter will be used to count the
per inode's total reserved metadata blocks, this value should always be
zero when the inode is dieing. Also update the corresponding
tracepoints and debug interface.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h              |  1 +
 fs/ext4/inode.c             |  2 ++
 fs/ext4/super.c             | 10 +++++++---
 include/trace/events/ext4.h | 25 +++++++++++++++++--------
 4 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 84618c46f239..ee2dbbde176e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1104,6 +1104,7 @@ struct ext4_inode_info {
 	/* allocation reservation info for delalloc */
 	/* In case of bigalloc, this refer to clusters rather than blocks */
 	unsigned int i_reserved_data_blocks;
+	unsigned int i_reserved_ext_blocks;
 
 	/* pending cluster reservations for bigalloc file systems */
 	struct ext4_pending_tree i_pending_tree;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 861602903b4d..dda17b3340ce 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1606,6 +1606,8 @@ static void ext4_print_free_blocks(struct inode *inode)
 	ext4_msg(sb, KERN_CRIT, "Block reservation details");
 	ext4_msg(sb, KERN_CRIT, "i_reserved_data_blocks=%u",
 		 ei->i_reserved_data_blocks);
+	ext4_msg(sb, KERN_CRIT, "i_reserved_ext_blocks=%u",
+		 ei->i_reserved_ext_blocks);
 	return;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index bb42525de8d0..7bc7c8c0ed71 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1436,6 +1436,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	ei->i_es_shk_nr = 0;
 	ei->i_es_shrink_lblk = 0;
 	ei->i_reserved_data_blocks = 0;
+	ei->i_reserved_ext_blocks = 0;
 	spin_lock_init(&(ei->i_block_reservation_lock));
 	ext4_init_pending_tree(&ei->i_pending_tree);
 #ifdef CONFIG_QUOTA
@@ -1487,11 +1488,14 @@ static void ext4_destroy_inode(struct inode *inode)
 		dump_stack();
 	}
 
-	if (EXT4_I(inode)->i_reserved_data_blocks)
+	if (EXT4_I(inode)->i_reserved_data_blocks ||
+	    EXT4_I(inode)->i_reserved_ext_blocks)
 		ext4_msg(inode->i_sb, KERN_ERR,
-			 "Inode %lu (%p): i_reserved_data_blocks (%u) not cleared!",
+			 "Inode %lu (%p): i_reserved_data_blocks (%u) or "
+			 "i_reserved_ext_blocks (%u) not cleared!",
 			 inode->i_ino, EXT4_I(inode),
-			 EXT4_I(inode)->i_reserved_data_blocks);
+			 EXT4_I(inode)->i_reserved_data_blocks,
+			 EXT4_I(inode)->i_reserved_ext_blocks);
 }
 
 static void ext4_shutdown(struct super_block *sb)
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 65029dfb92fb..115f96f444ff 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -1224,6 +1224,7 @@ TRACE_EVENT(ext4_da_update_reserve_space,
 		__field(	__u64,	i_blocks		)
 		__field(	int,	used_blocks		)
 		__field(	int,	reserved_data_blocks	)
+		__field(	int,	reserved_ext_blocks	)
 		__field(	int,	quota_claim		)
 		__field(	__u16,	mode			)
 	),
@@ -1233,18 +1234,19 @@ TRACE_EVENT(ext4_da_update_reserve_space,
 		__entry->ino	= inode->i_ino;
 		__entry->i_blocks = inode->i_blocks;
 		__entry->used_blocks = used_blocks;
-		__entry->reserved_data_blocks =
-				EXT4_I(inode)->i_reserved_data_blocks;
+		__entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
+		__entry->reserved_ext_blocks = EXT4_I(inode)->i_reserved_ext_blocks;
 		__entry->quota_claim = quota_claim;
 		__entry->mode	= inode->i_mode;
 	),
 
 	TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu used_blocks %d "
-		  "reserved_data_blocks %d quota_claim %d",
+		  "reserved_data_blocks %d reserved_ext_blocks %d quota_claim %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
 		  __entry->mode, __entry->i_blocks,
-		  __entry->used_blocks, __entry->reserved_data_blocks,
+		  __entry->used_blocks,
+		  __entry->reserved_data_blocks, __entry->reserved_ext_blocks,
 		  __entry->quota_claim)
 );
 
@@ -1258,6 +1260,7 @@ TRACE_EVENT(ext4_da_reserve_space,
 		__field(	ino_t,	ino			)
 		__field(	__u64,	i_blocks		)
 		__field(	int,	reserved_data_blocks	)
+		__field(	int,	reserved_ext_blocks	)
 		__field(	__u16,  mode			)
 	),
 
@@ -1266,15 +1269,17 @@ TRACE_EVENT(ext4_da_reserve_space,
 		__entry->ino	= inode->i_ino;
 		__entry->i_blocks = inode->i_blocks;
 		__entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
+		__entry->reserved_ext_blocks = EXT4_I(inode)->i_reserved_ext_blocks;
 		__entry->mode	= inode->i_mode;
 	),
 
 	TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu "
-		  "reserved_data_blocks %d",
+		  "reserved_data_blocks %d reserved_ext_blocks %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
 		  __entry->mode, __entry->i_blocks,
-		  __entry->reserved_data_blocks)
+		  __entry->reserved_data_blocks,
+		  __entry->reserved_ext_blocks)
 );
 
 TRACE_EVENT(ext4_da_release_space,
@@ -1288,6 +1293,7 @@ TRACE_EVENT(ext4_da_release_space,
 		__field(	__u64,	i_blocks		)
 		__field(	int,	freed_blocks		)
 		__field(	int,	reserved_data_blocks	)
+		__field(	int,	reserved_ext_blocks	)
 		__field(	__u16,  mode			)
 	),
 
@@ -1297,15 +1303,18 @@ TRACE_EVENT(ext4_da_release_space,
 		__entry->i_blocks = inode->i_blocks;
 		__entry->freed_blocks = freed_blocks;
 		__entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
+		__entry->reserved_ext_blocks = EXT4_I(inode)->i_reserved_ext_blocks;
 		__entry->mode	= inode->i_mode;
 	),
 
 	TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu freed_blocks %d "
-		  "reserved_data_blocks %d",
+		  "reserved_data_blocks %d reserved_ext_blocks %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
 		  __entry->mode, __entry->i_blocks,
-		  __entry->freed_blocks, __entry->reserved_data_blocks)
+		  __entry->freed_blocks,
+		  __entry->reserved_data_blocks,
+		  __entry->reserved_ext_blocks)
 );
 
 DECLARE_EVENT_CLASS(ext4__bitmap_load,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 10/16] ext4: reserve meta blocks in ext4_da_reserve_space()
  2023-08-24  9:26 [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Zhang Yi
                   ` (8 preceding siblings ...)
  2023-08-24  9:26 ` [RFC PATCH 09/16] ext4: count reserved metadata blocks for delalloc per inode Zhang Yi
@ 2023-08-24  9:26 ` Zhang Yi
  2023-08-24  9:26 ` [RFC PATCH 11/16] ext4: factor out common part of ext4_da_{release|update_reserve}_space() Zhang Yi
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2023-08-24  9:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Prepare to reserve metadata blocks for delay allocation in
ext4_da_reserve_space(), claim reserved space from the global
sbi->s_dirtyclusters_counter, and also updating tracepoints to show the
new reserved metadata blocks. This patch is just a preparation, the
reserved ext_len is always zero.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c             | 28 ++++++++++++++++------------
 include/trace/events/ext4.h | 10 ++++++++--
 2 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index dda17b3340ce..a189009d20fa 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1439,31 +1439,37 @@ static int ext4_journalled_write_end(struct file *file,
 }
 
 /*
- * Reserve space for a single cluster
+ * Reserve space for a 'rsv_dlen' data blocks/clusters and 'rsv_extlen'
+ * extent metadata blocks.
  */
-static int ext4_da_reserve_space(struct inode *inode)
+static int ext4_da_reserve_space(struct inode *inode, unsigned int rsv_dlen,
+				 unsigned int rsv_extlen)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	int ret;
 
+	if (!rsv_dlen && !rsv_extlen)
+		return 0;
+
 	/*
 	 * We will charge metadata quota at writeout time; this saves
 	 * us from metadata over-estimation, though we may go over by
 	 * a small amount in the end.  Here we just reserve for data.
 	 */
-	ret = dquot_reserve_block(inode, EXT4_C2B(sbi, 1));
+	ret = dquot_reserve_block(inode, EXT4_C2B(sbi, rsv_dlen));
 	if (ret)
 		return ret;
 
 	spin_lock(&ei->i_block_reservation_lock);
-	if (ext4_claim_free_clusters(sbi, 1, 0)) {
+	if (ext4_claim_free_clusters(sbi, rsv_dlen + rsv_extlen, 0)) {
 		spin_unlock(&ei->i_block_reservation_lock);
-		dquot_release_reservation_block(inode, EXT4_C2B(sbi, 1));
+		dquot_release_reservation_block(inode, EXT4_C2B(sbi, rsv_dlen));
 		return -ENOSPC;
 	}
-	ei->i_reserved_data_blocks++;
-	trace_ext4_da_reserve_space(inode);
+	ei->i_reserved_data_blocks += rsv_dlen;
+	ei->i_reserved_ext_blocks += rsv_extlen;
+	trace_ext4_da_reserve_space(inode, rsv_dlen, rsv_extlen);
 	spin_unlock(&ei->i_block_reservation_lock);
 
 	return 0;       /* success */
@@ -1659,11 +1665,9 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
 		}
 	}
 
-	if (rsv_dlen > 0) {
-		ret = ext4_da_reserve_space(inode);
-		if (ret)   /* ENOSPC */
-			return ret;
-	}
+	ret = ext4_da_reserve_space(inode, rsv_dlen, 0);
+	if (ret)   /* ENOSPC */
+		return ret;
 
 	ext4_es_insert_delayed_block(inode, lblk, allocated);
 	return 0;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 115f96f444ff..7a9839f2d681 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -1251,14 +1251,16 @@ TRACE_EVENT(ext4_da_update_reserve_space,
 );
 
 TRACE_EVENT(ext4_da_reserve_space,
-	TP_PROTO(struct inode *inode),
+	TP_PROTO(struct inode *inode, int data_blocks, int meta_blocks),
 
-	TP_ARGS(inode),
+	TP_ARGS(inode, data_blocks, meta_blocks),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,	dev			)
 		__field(	ino_t,	ino			)
 		__field(	__u64,	i_blocks		)
+		__field(	int,	data_blocks		)
+		__field(	int,	meta_blocks		)
 		__field(	int,	reserved_data_blocks	)
 		__field(	int,	reserved_ext_blocks	)
 		__field(	__u16,  mode			)
@@ -1268,16 +1270,20 @@ TRACE_EVENT(ext4_da_reserve_space,
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->ino	= inode->i_ino;
 		__entry->i_blocks = inode->i_blocks;
+		__entry->data_blocks = data_blocks;
+		__entry->meta_blocks = meta_blocks;
 		__entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
 		__entry->reserved_ext_blocks = EXT4_I(inode)->i_reserved_ext_blocks;
 		__entry->mode	= inode->i_mode;
 	),
 
 	TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu "
+		  "data_blocks %d meta_blocks %d "
 		  "reserved_data_blocks %d reserved_ext_blocks %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
 		  __entry->mode, __entry->i_blocks,
+		  __entry->data_blocks, __entry->meta_blocks,
 		  __entry->reserved_data_blocks,
 		  __entry->reserved_ext_blocks)
 );
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 11/16] ext4: factor out common part of ext4_da_{release|update_reserve}_space()
  2023-08-24  9:26 [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Zhang Yi
                   ` (9 preceding siblings ...)
  2023-08-24  9:26 ` [RFC PATCH 10/16] ext4: reserve meta blocks in ext4_da_reserve_space() Zhang Yi
@ 2023-08-24  9:26 ` Zhang Yi
  2023-08-24  9:26 ` [RFC PATCH 12/16] ext4: update reserved meta blocks in ext4_da_{release|update_reserve}_space() Zhang Yi
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2023-08-24  9:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

The reserve blocks updating part in ext4_da_release_space() and
ext4_da_update_reserve_space() are almost the same, so factor them out
to a common helper.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 60 +++++++++++++++++++++----------------------------
 1 file changed, 25 insertions(+), 35 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a189009d20fa..13036cecbcc0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -325,6 +325,27 @@ qsize_t *ext4_get_reserved_space(struct inode *inode)
 }
 #endif
 
+static void __ext4_da_update_reserve_space(const char *where,
+					   struct inode *inode,
+					   int data_len)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+	struct ext4_inode_info *ei = EXT4_I(inode);
+
+	if (unlikely(data_len > ei->i_reserved_data_blocks)) {
+		ext4_warning(inode->i_sb, "%s: ino %lu, clear %d "
+			     "with only %d reserved data blocks",
+			     where, inode->i_ino, data_len,
+			     ei->i_reserved_data_blocks);
+		WARN_ON(1);
+		data_len = ei->i_reserved_data_blocks;
+	}
+
+	/* Update per-inode reservations */
+	ei->i_reserved_data_blocks -= data_len;
+	percpu_counter_sub(&sbi->s_dirtyclusters_counter, data_len);
+}
+
 /*
  * Called with i_data_sem down, which is important since we can call
  * ext4_discard_preallocations() from here.
@@ -340,19 +361,7 @@ void ext4_da_update_reserve_space(struct inode *inode,
 
 	spin_lock(&ei->i_block_reservation_lock);
 	trace_ext4_da_update_reserve_space(inode, used, quota_claim);
-	if (unlikely(used > ei->i_reserved_data_blocks)) {
-		ext4_warning(inode->i_sb, "%s: ino %lu, used %d "
-			 "with only %d reserved data blocks",
-			 __func__, inode->i_ino, used,
-			 ei->i_reserved_data_blocks);
-		WARN_ON(1);
-		used = ei->i_reserved_data_blocks;
-	}
-
-	/* Update per-inode reservations */
-	ei->i_reserved_data_blocks -= used;
-	percpu_counter_sub(&sbi->s_dirtyclusters_counter, used);
-
+	__ext4_da_update_reserve_space(__func__, inode, used);
 	spin_unlock(&ei->i_block_reservation_lock);
 
 	/* Update quota subsystem for data blocks */
@@ -1483,29 +1492,10 @@ void ext4_da_release_space(struct inode *inode, int to_free)
 	if (!to_free)
 		return;		/* Nothing to release, exit */
 
-	spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
-
+	spin_lock(&ei->i_block_reservation_lock);
 	trace_ext4_da_release_space(inode, to_free);
-	if (unlikely(to_free > ei->i_reserved_data_blocks)) {
-		/*
-		 * if there aren't enough reserved blocks, then the
-		 * counter is messed up somewhere.  Since this
-		 * function is called from invalidate page, it's
-		 * harmless to return without any action.
-		 */
-		ext4_warning(inode->i_sb, "ext4_da_release_space: "
-			 "ino %lu, to_free %d with only %d reserved "
-			 "data blocks", inode->i_ino, to_free,
-			 ei->i_reserved_data_blocks);
-		WARN_ON(1);
-		to_free = ei->i_reserved_data_blocks;
-	}
-	ei->i_reserved_data_blocks -= to_free;
-
-	/* update fs dirty data blocks counter */
-	percpu_counter_sub(&sbi->s_dirtyclusters_counter, to_free);
-
-	spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
+	__ext4_da_update_reserve_space(__func__, inode, to_free);
+	spin_unlock(&ei->i_block_reservation_lock);
 
 	dquot_release_reservation_block(inode, EXT4_C2B(sbi, to_free));
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 12/16] ext4: update reserved meta blocks in ext4_da_{release|update_reserve}_space()
  2023-08-24  9:26 [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Zhang Yi
                   ` (10 preceding siblings ...)
  2023-08-24  9:26 ` [RFC PATCH 11/16] ext4: factor out common part of ext4_da_{release|update_reserve}_space() Zhang Yi
@ 2023-08-24  9:26 ` Zhang Yi
  2023-09-06  7:35   ` kernel test robot
  2023-08-24  9:26 ` [RFC PATCH 13/16] ext4: calculate the worst extent blocks needed of a delalloc es entry Zhang Yi
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 24+ messages in thread
From: Zhang Yi @ 2023-08-24  9:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

The same to ext4_da_reserve_space(), we also need to update reserved
metadata blocks when we release and convert a delalloc space range in
ext4_da_release_space() and ext4_da_update_reserve_space(). So also
prepare to reserve metadata blocks in these two functions, the
reservation logic are the same to data blocks. This patch is just a
preparation, the reserved ext_len is always zero.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h              |  4 ++--
 fs/ext4/inode.c             | 47 +++++++++++++++++++++----------------
 include/trace/events/ext4.h | 28 ++++++++++++++--------
 3 files changed, 47 insertions(+), 32 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ee2dbbde176e..3e0a39653469 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2998,9 +2998,9 @@ extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
 extern vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf);
 extern qsize_t *ext4_get_reserved_space(struct inode *inode);
 extern int ext4_get_projid(struct inode *inode, kprojid_t *projid);
-extern void ext4_da_release_space(struct inode *inode, int to_free);
+extern void ext4_da_release_space(struct inode *inode, unsigned int data_len);
 extern void ext4_da_update_reserve_space(struct inode *inode,
-					int used, int quota_claim);
+					unsigned int data_len, int quota_claim);
 extern int ext4_issue_zeroout(struct inode *inode, ext4_lblk_t lblk,
 			      ext4_fsblk_t pblk, ext4_lblk_t len);
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 13036cecbcc0..38c47ce1333b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -327,53 +327,59 @@ qsize_t *ext4_get_reserved_space(struct inode *inode)
 
 static void __ext4_da_update_reserve_space(const char *where,
 					   struct inode *inode,
-					   int data_len)
+					   unsigned int data_len, int ext_len)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	struct ext4_inode_info *ei = EXT4_I(inode);
 
-	if (unlikely(data_len > ei->i_reserved_data_blocks)) {
-		ext4_warning(inode->i_sb, "%s: ino %lu, clear %d "
-			     "with only %d reserved data blocks",
-			     where, inode->i_ino, data_len,
-			     ei->i_reserved_data_blocks);
+	if (unlikely(data_len > ei->i_reserved_data_blocks ||
+		     ext_len > (long)ei->i_reserved_ext_blocks)) {
+		ext4_warning(inode->i_sb, "%s: ino %lu, clear %d,%d "
+			     "with only %d,%d reserved data blocks",
+			     where, inode->i_ino, data_len, ext_len,
+			     ei->i_reserved_data_blocks,
+			     ei->i_reserved_ext_blocks);
 		WARN_ON(1);
-		data_len = ei->i_reserved_data_blocks;
+		data_len = min(data_len, ei->i_reserved_data_blocks);
+		ext_len = min_t(unsigned int, ext_len, ei->i_reserved_ext_blocks);
 	}
 
 	/* Update per-inode reservations */
 	ei->i_reserved_data_blocks -= data_len;
-	percpu_counter_sub(&sbi->s_dirtyclusters_counter, data_len);
+	ei->i_reserved_ext_blocks -= ext_len;
+	percpu_counter_sub(&sbi->s_dirtyclusters_counter, (s64)data_len + ext_len);
 }
 
 /*
  * Called with i_data_sem down, which is important since we can call
  * ext4_discard_preallocations() from here.
  */
-void ext4_da_update_reserve_space(struct inode *inode,
-				  int used, int quota_claim)
+void ext4_da_update_reserve_space(struct inode *inode, unsigned int data_len,
+				  int quota_claim)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	struct ext4_inode_info *ei = EXT4_I(inode);
+	int ext_len = 0;
 
-	if (!used)
+	if (!data_len)
 		return;
 
 	spin_lock(&ei->i_block_reservation_lock);
-	trace_ext4_da_update_reserve_space(inode, used, quota_claim);
-	__ext4_da_update_reserve_space(__func__, inode, used);
+	trace_ext4_da_update_reserve_space(inode, data_len, ext_len,
+					   quota_claim);
+	__ext4_da_update_reserve_space(__func__, inode, data_len, ext_len);
 	spin_unlock(&ei->i_block_reservation_lock);
 
 	/* Update quota subsystem for data blocks */
 	if (quota_claim)
-		dquot_claim_block(inode, EXT4_C2B(sbi, used));
+		dquot_claim_block(inode, EXT4_C2B(sbi, data_len));
 	else {
 		/*
 		 * We did fallocate with an offset that is already delayed
 		 * allocated. So on delayed allocated writeback we should
 		 * not re-claim the quota for fallocated blocks.
 		 */
-		dquot_release_reservation_block(inode, EXT4_C2B(sbi, used));
+		dquot_release_reservation_block(inode, EXT4_C2B(sbi, data_len));
 	}
 
 	/*
@@ -1484,20 +1490,21 @@ static int ext4_da_reserve_space(struct inode *inode, unsigned int rsv_dlen,
 	return 0;       /* success */
 }
 
-void ext4_da_release_space(struct inode *inode, int to_free)
+void ext4_da_release_space(struct inode *inode, unsigned int data_len)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	struct ext4_inode_info *ei = EXT4_I(inode);
+	int ext_len = 0;
 
-	if (!to_free)
+	if (!data_len)
 		return;		/* Nothing to release, exit */
 
 	spin_lock(&ei->i_block_reservation_lock);
-	trace_ext4_da_release_space(inode, to_free);
-	__ext4_da_update_reserve_space(__func__, inode, to_free);
+	trace_ext4_da_release_space(inode, data_len, ext_len);
+	__ext4_da_update_reserve_space(__func__, inode, data_len, ext_len);
 	spin_unlock(&ei->i_block_reservation_lock);
 
-	dquot_release_reservation_block(inode, EXT4_C2B(sbi, to_free));
+	dquot_release_reservation_block(inode, EXT4_C2B(sbi, data_len));
 }
 
 /*
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 7a9839f2d681..e1e9d7ead20f 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -1214,15 +1214,19 @@ TRACE_EVENT(ext4_forget,
 );
 
 TRACE_EVENT(ext4_da_update_reserve_space,
-	TP_PROTO(struct inode *inode, int used_blocks, int quota_claim),
+	TP_PROTO(struct inode *inode,
+		 int data_blocks,
+		 int meta_blocks,
+		 int quota_claim),
 
-	TP_ARGS(inode, used_blocks, quota_claim),
+	TP_ARGS(inode, data_blocks, meta_blocks, quota_claim),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,	dev			)
 		__field(	ino_t,	ino			)
 		__field(	__u64,	i_blocks		)
-		__field(	int,	used_blocks		)
+		__field(	int,	data_blocks		)
+		__field(	int,	meta_blocks		)
 		__field(	int,	reserved_data_blocks	)
 		__field(	int,	reserved_ext_blocks	)
 		__field(	int,	quota_claim		)
@@ -1233,19 +1237,20 @@ TRACE_EVENT(ext4_da_update_reserve_space,
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->ino	= inode->i_ino;
 		__entry->i_blocks = inode->i_blocks;
-		__entry->used_blocks = used_blocks;
+		__entry->data_blocks = data_blocks;
+		__entry->meta_blocks = meta_blocks;
 		__entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
 		__entry->reserved_ext_blocks = EXT4_I(inode)->i_reserved_ext_blocks;
 		__entry->quota_claim = quota_claim;
 		__entry->mode	= inode->i_mode;
 	),
 
-	TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu used_blocks %d "
+	TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu data_blocks %d meta_blocks %d "
 		  "reserved_data_blocks %d reserved_ext_blocks %d quota_claim %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
 		  __entry->mode, __entry->i_blocks,
-		  __entry->used_blocks,
+		  __entry->data_blocks, __entry->meta_blocks,
 		  __entry->reserved_data_blocks, __entry->reserved_ext_blocks,
 		  __entry->quota_claim)
 );
@@ -1289,15 +1294,16 @@ TRACE_EVENT(ext4_da_reserve_space,
 );
 
 TRACE_EVENT(ext4_da_release_space,
-	TP_PROTO(struct inode *inode, int freed_blocks),
+	TP_PROTO(struct inode *inode, int freed_blocks, int meta_blocks),
 
-	TP_ARGS(inode, freed_blocks),
+	TP_ARGS(inode, freed_blocks, meta_blocks),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,	dev			)
 		__field(	ino_t,	ino			)
 		__field(	__u64,	i_blocks		)
 		__field(	int,	freed_blocks		)
+		__field(	int,	meta_blocks		)
 		__field(	int,	reserved_data_blocks	)
 		__field(	int,	reserved_ext_blocks	)
 		__field(	__u16,  mode			)
@@ -1308,17 +1314,19 @@ TRACE_EVENT(ext4_da_release_space,
 		__entry->ino	= inode->i_ino;
 		__entry->i_blocks = inode->i_blocks;
 		__entry->freed_blocks = freed_blocks;
+		__entry->meta_blocks = meta_blocks;
 		__entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
 		__entry->reserved_ext_blocks = EXT4_I(inode)->i_reserved_ext_blocks;
 		__entry->mode	= inode->i_mode;
 	),
 
-	TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu freed_blocks %d "
+	TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu "
+		  "freed_blocks %d meta_blocks %d"
 		  "reserved_data_blocks %d reserved_ext_blocks %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
 		  __entry->mode, __entry->i_blocks,
-		  __entry->freed_blocks,
+		  __entry->freed_blocks, __entry->meta_blocks,
 		  __entry->reserved_data_blocks,
 		  __entry->reserved_ext_blocks)
 );
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 13/16] ext4: calculate the worst extent blocks needed of a delalloc es entry
  2023-08-24  9:26 [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Zhang Yi
                   ` (11 preceding siblings ...)
  2023-08-24  9:26 ` [RFC PATCH 12/16] ext4: update reserved meta blocks in ext4_da_{release|update_reserve}_space() Zhang Yi
@ 2023-08-24  9:26 ` Zhang Yi
  2023-08-24  9:26 ` [RFC PATCH 14/16] ext4: reserve extent blocks for delalloc Zhang Yi
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2023-08-24  9:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Add a new helper to calculate the worst case of extent blocks that
needed while mapping a new delalloc extent_status entry. In the worst
case, one delay data block consumes one extent enrty, the worst extent
blocks should be 'leaf blocks + index blocks + (max depth - depth
increasing costs)'. The detailed calculation formula is:

        / DIV_ROUND_UP(da_blocks, ext_per_block);  (i = 0)
 f(i) =
        \ DIV_ROUND_UP(f(i-1), idx_per_block);     (0 < i < max_depth)

 SUM = f(0) + .. + f(n) + max_depth - n - 1;  (0 <= n < max_depth, f(n) > 0)

For example:
On the default 4k block size, the default ext_per_block and
idx_per_block are 340. (1) If we map 50 length of blocks, the worst
entent block is DIV_ROUND_UP(50, 340) + EXT4_MAX_EXTENT_DEPTH - 1 = 5,
(2) if we map 500 length of blocks, the worst extent block is
DIV_ROUND_UP(500, 340) + DIV_ROUND_UP(DIV_ROUND_UP(500, 340), 340) +
EXT4_MAX_EXTENT_DEPTH - 2 = 6, and so on. It is a preparation for
reserving meta blocks of delalloc.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h    |  2 ++
 fs/ext4/extents.c | 28 ++++++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3e0a39653469..11813382fbcc 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3699,6 +3699,8 @@ extern int ext4_swap_extents(handle_t *handle, struct inode *inode1,
 			     ext4_lblk_t lblk2,  ext4_lblk_t count,
 			     int mark_unwritten,int *err);
 extern int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu);
+extern unsigned int ext4_map_worst_ext_blocks(struct inode *inode,
+					      unsigned int len);
 extern int ext4_datasem_ensure_credits(handle_t *handle, struct inode *inode,
 				       int check_cred, int restart_cred,
 				       int revoke_cred);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 592383effe80..43c251a42144 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5797,6 +5797,34 @@ int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu)
 	return err ? err : mapped;
 }
 
+/*
+ * Calculate the worst case of extents blocks needed while mapping 'len'
+ * data blocks.
+ */
+unsigned int ext4_map_worst_ext_blocks(struct inode *inode, unsigned int len)
+{
+	unsigned int ext_blocks = 0;
+	int max_entries;
+	int depth, max_depth;
+
+	if (!len)
+		return 0;
+
+	max_entries = ext4_ext_space_block(inode, 0);
+	max_depth = EXT4_MAX_EXTENT_DEPTH;
+
+	for (depth = 0; depth < max_depth; depth++) {
+		len = DIV_ROUND_UP(len, max_entries);
+		ext_blocks += len;
+		if (len == 1)
+			break;
+		if (depth == 0)
+			max_entries = ext4_ext_space_block_idx(inode, 0);
+	}
+
+	return ext_blocks + max_depth - depth - 1;
+}
+
 /*
  * Updates physical block address and unwritten status of extent
  * starting at lblk start and of len. If such an extent doesn't exist,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 14/16] ext4: reserve extent blocks for delalloc
  2023-08-24  9:26 [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Zhang Yi
                   ` (12 preceding siblings ...)
  2023-08-24  9:26 ` [RFC PATCH 13/16] ext4: calculate the worst extent blocks needed of a delalloc es entry Zhang Yi
@ 2023-08-24  9:26 ` Zhang Yi
  2023-08-24  9:26 ` [RFC PATCH 15/16] ext4: flush delalloc blocks if no free space Zhang Yi
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2023-08-24  9:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Now ext4 only reserve data block for delalloc in ext4_da_reserve_space(),
and switch to no delalloc mode if the space if the free blocks is less
than 150% of dirty blocks or the watermark. In the meantime,
'27dd43854227 ("ext4: introduce reserved space")' reserve some of the
file system space (2% or 4096 clusters, whichever is smaller). Both of
them could make sure that space is not exhausted when mapping delalloc
entries as much as possible, but cannot guarantee (Under high concurrent
writes, ext4_ext4_nonda_switch() does not work because it only read the
count on current CPU, and reserved_clusters can also be exhausted
easily). So it could lead to infinite loop in ext4_do_writepages(),
think about we have only one free block left and want to allocate a data
block and a new extent block in ext4_writepages().

ext4_do_writepages()
 // <-- 1
 mpage_map_and_submit_extent()
  mpage_map_one_extent()
   ext4_map_blocks()
    ext4_ext_map_blocks()
     ext4_mb_new_blocks() //allocate the last block
     ext4_ext_insert_extent //allocate failed
     ext4_free_blocks() //free the data block just allocated
     return -ENOSPC;
  ext4_count_free_clusters() //is true
  return -ENOSPC;
 --> goto 1 and infinite loop

One more thing, it could also lead to data lost and trigger below error
message.

  EXT4-fs (sda): delayed block allocation failed for inode
                 X at logical offset X with max blocks X with error -28
  EXT4-fs (sda): This should not happen!!  Data will be lost

The best solution is try to calculate and reserve extent blocks
(metadata blocks) that could be allocated when mapping a delalloc es
entry. The reservation is very tricky and is related to the continuity
of physical blocks. An effective way is to reserve for the worst-case,
which means every block is discontinuous and costs an extent entry,
ext4_map_worst_ext_blocks() does this calculation. We have already count
the total delayed data blocks in the ext4_es_tree, so we could use it
calculate to the worst metadata blocks that should reserved, and save it
in the prepared ei->i_reserved_ext_blocks, once the delalloc entry
mapped, recalculate it and release the unused reservation.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h              |  6 ++++--
 fs/ext4/extents_status.c    | 29 +++++++++++++++++++-------
 fs/ext4/inode.c             | 41 ++++++++++++++++++++++++++++---------
 include/trace/events/ext4.h | 25 +++++++++++++++-------
 4 files changed, 75 insertions(+), 26 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 11813382fbcc..67b12f9ffc50 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2998,9 +2998,11 @@ extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
 extern vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf);
 extern qsize_t *ext4_get_reserved_space(struct inode *inode);
 extern int ext4_get_projid(struct inode *inode, kprojid_t *projid);
-extern void ext4_da_release_space(struct inode *inode, unsigned int data_len);
+extern void ext4_da_release_space(struct inode *inode, unsigned int data_len,
+				  unsigned int total_da_len, long da_len);
 extern void ext4_da_update_reserve_space(struct inode *inode,
-					unsigned int data_len, int quota_claim);
+				unsigned int data_len, unsigned int total_da_len,
+				long da_len, int quota_claim);
 extern int ext4_issue_zeroout(struct inode *inode, ext4_lblk_t lblk,
 			      ext4_fsblk_t pblk, ext4_lblk_t len);
 
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index b098c3316189..8e0dec27f967 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -789,17 +789,20 @@ static inline void ext4_es_insert_extent_check(struct inode *inode,
 #endif
 
 /*
- * Update total delay allocated extent length.
+ * Update and return total delay allocated extent length.
  */
-static inline void ext4_es_update_da_block(struct inode *inode, long es_len)
+static inline unsigned int ext4_es_update_da_block(struct inode *inode,
+						   long es_len)
 {
 	struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
 
 	if (!es_len)
-		return;
+		goto out;
 
 	tree->da_es_len += es_len;
 	es_debug("update da blocks %ld, to %u\n", es_len, tree->da_es_len);
+out:
+	return tree->da_es_len;
 }
 
 static int __es_insert_extent(struct inode *inode, struct extent_status *newes,
@@ -870,6 +873,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 {
 	struct extent_status newes;
 	ext4_lblk_t end = lblk + len - 1;
+	ext4_lblk_t da_blocks = 0;
 	int err1 = 0, err2 = 0, err3 = 0;
 	struct rsvd_info rinfo;
 	int pending = 0;
@@ -930,7 +934,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 			__es_free_extent(es1);
 		es1 = NULL;
 	}
-	ext4_es_update_da_block(inode, -rinfo.ndelonly_blk);
+	da_blocks = ext4_es_update_da_block(inode, -rinfo.ndelonly_blk);
 
 	err2 = __es_insert_extent(inode, &newes, es2);
 	if (err2 == -ENOMEM && !ext4_es_must_keep(&newes))
@@ -975,6 +979,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	 * for any previously delayed allocated clusters.
 	 */
 	ext4_da_update_reserve_space(inode, rinfo.ndelonly_clu + pending,
+				     da_blocks, -rinfo.ndelonly_blk,
 				     !delayed && rinfo.ndelonly_blk);
 	if (err1 || err2 || err3 < 0)
 		goto retry;
@@ -1554,6 +1559,7 @@ void ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 			   ext4_lblk_t len)
 {
 	ext4_lblk_t end;
+	ext4_lblk_t da_blocks = 0;
 	struct rsvd_info rinfo;
 	int err = 0;
 	struct extent_status *es = NULL;
@@ -1587,13 +1593,14 @@ void ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 			__es_free_extent(es);
 		es = NULL;
 	}
-	ext4_es_update_da_block(inode, -rinfo.ndelonly_blk);
+	da_blocks = ext4_es_update_da_block(inode, -rinfo.ndelonly_blk);
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 	if (err)
 		goto retry;
 
 	ext4_es_print_tree(inode);
-	ext4_da_release_space(inode, rinfo.ndelonly_clu);
+	ext4_da_release_space(inode, rinfo.ndelonly_clu, da_blocks,
+			      -rinfo.ndelonly_blk);
 	return;
 }
 
@@ -2122,6 +2129,7 @@ void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
 				  bool allocated)
 {
 	struct extent_status newes;
+	ext4_lblk_t da_blocks;
 	int err1 = 0, err2 = 0, err3 = 0;
 	struct extent_status *es1 = NULL;
 	struct extent_status *es2 = NULL;
@@ -2179,12 +2187,19 @@ void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
 		}
 	}
 
-	ext4_es_update_da_block(inode, newes.es_len);
+	da_blocks = ext4_es_update_da_block(inode, newes.es_len);
 error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 	if (err1 || err2 || err3 < 0)
 		goto retry;
 
+	/*
+	 * New reserved meta space has been claimed for a single newly added
+	 * delayed block in ext4_da_reserve_space(), but most of the reserved
+	 * count of meta blocks could be merged, so recalculate it according
+	 * to latest total delayed blocks.
+	 */
+	ext4_da_update_reserve_space(inode, 0, da_blocks, newes.es_len, 0);
 	ext4_es_print_tree(inode);
 	ext4_print_pending_tree(inode);
 	return;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 38c47ce1333b..d714bf2e4171 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -332,6 +332,9 @@ static void __ext4_da_update_reserve_space(const char *where,
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	struct ext4_inode_info *ei = EXT4_I(inode);
 
+	if (!data_len && !ext_len)
+		return;
+
 	if (unlikely(data_len > ei->i_reserved_data_blocks ||
 		     ext_len > (long)ei->i_reserved_ext_blocks)) {
 		ext4_warning(inode->i_sb, "%s: ino %lu, clear %d,%d "
@@ -355,21 +358,30 @@ static void __ext4_da_update_reserve_space(const char *where,
  * ext4_discard_preallocations() from here.
  */
 void ext4_da_update_reserve_space(struct inode *inode, unsigned int data_len,
+				  unsigned int total_da_len, long da_len,
 				  int quota_claim)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	struct ext4_inode_info *ei = EXT4_I(inode);
-	int ext_len = 0;
+	unsigned int new_ext_len;
+	int ext_len;
 
-	if (!data_len)
+	if (!data_len && !da_len)
 		return;
 
+	if (da_len)
+		new_ext_len = ext4_map_worst_ext_blocks(inode, total_da_len);
+
 	spin_lock(&ei->i_block_reservation_lock);
-	trace_ext4_da_update_reserve_space(inode, data_len, ext_len,
-					   quota_claim);
+	ext_len = da_len ? ei->i_reserved_ext_blocks - new_ext_len : 0;
+	trace_ext4_da_update_reserve_space(inode, data_len, total_da_len,
+					   ext_len, quota_claim);
 	__ext4_da_update_reserve_space(__func__, inode, data_len, ext_len);
 	spin_unlock(&ei->i_block_reservation_lock);
 
+	if (!data_len)
+		return;
+
 	/* Update quota subsystem for data blocks */
 	if (quota_claim)
 		dquot_claim_block(inode, EXT4_C2B(sbi, data_len));
@@ -1490,21 +1502,28 @@ static int ext4_da_reserve_space(struct inode *inode, unsigned int rsv_dlen,
 	return 0;       /* success */
 }
 
-void ext4_da_release_space(struct inode *inode, unsigned int data_len)
+void ext4_da_release_space(struct inode *inode, unsigned int data_len,
+			   unsigned int total_da_len, long da_len)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	struct ext4_inode_info *ei = EXT4_I(inode);
-	int ext_len = 0;
+	unsigned int new_ext_len;
+	int ext_len;
 
-	if (!data_len)
+	if (!data_len && !da_len)
 		return;		/* Nothing to release, exit */
 
+	if (da_len)
+		new_ext_len = ext4_map_worst_ext_blocks(inode, total_da_len);
+
 	spin_lock(&ei->i_block_reservation_lock);
-	trace_ext4_da_release_space(inode, data_len, ext_len);
+	ext_len = da_len ? (ei->i_reserved_ext_blocks - new_ext_len) : 0;
+	trace_ext4_da_release_space(inode, data_len, total_da_len, ext_len);
 	__ext4_da_update_reserve_space(__func__, inode, data_len, ext_len);
 	spin_unlock(&ei->i_block_reservation_lock);
 
-	dquot_release_reservation_block(inode, EXT4_C2B(sbi, data_len));
+	if (data_len)
+		dquot_release_reservation_block(inode, EXT4_C2B(sbi, data_len));
 }
 
 /*
@@ -1629,6 +1648,7 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	unsigned int rsv_dlen = 1;
+	unsigned int rsv_extlen;
 	bool allocated = false;
 	int ret;
 
@@ -1662,7 +1682,8 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
 		}
 	}
 
-	ret = ext4_da_reserve_space(inode, rsv_dlen, 0);
+	rsv_extlen = ext4_map_worst_ext_blocks(inode, 1);
+	ret = ext4_da_reserve_space(inode, rsv_dlen, rsv_extlen);
 	if (ret)   /* ENOSPC */
 		return ret;
 
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index e1e9d7ead20f..6916b1c5dff6 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -1216,16 +1216,18 @@ TRACE_EVENT(ext4_forget,
 TRACE_EVENT(ext4_da_update_reserve_space,
 	TP_PROTO(struct inode *inode,
 		 int data_blocks,
+		 unsigned int total_da_blocks,
 		 int meta_blocks,
 		 int quota_claim),
 
-	TP_ARGS(inode, data_blocks, meta_blocks, quota_claim),
+	TP_ARGS(inode, data_blocks, total_da_blocks, meta_blocks, quota_claim),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,	dev			)
 		__field(	ino_t,	ino			)
 		__field(	__u64,	i_blocks		)
 		__field(	int,	data_blocks		)
+		__field(	unsigned int, total_da_blocks	)
 		__field(	int,	meta_blocks		)
 		__field(	int,	reserved_data_blocks	)
 		__field(	int,	reserved_ext_blocks	)
@@ -1238,6 +1240,7 @@ TRACE_EVENT(ext4_da_update_reserve_space,
 		__entry->ino	= inode->i_ino;
 		__entry->i_blocks = inode->i_blocks;
 		__entry->data_blocks = data_blocks;
+		__entry->total_da_blocks = total_da_blocks;
 		__entry->meta_blocks = meta_blocks;
 		__entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
 		__entry->reserved_ext_blocks = EXT4_I(inode)->i_reserved_ext_blocks;
@@ -1245,12 +1248,14 @@ TRACE_EVENT(ext4_da_update_reserve_space,
 		__entry->mode	= inode->i_mode;
 	),
 
-	TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu data_blocks %d meta_blocks %d "
+	TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu "
+		  "data_blocks %d total_da_blocks %u meta_blocks %d "
 		  "reserved_data_blocks %d reserved_ext_blocks %d quota_claim %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
 		  __entry->mode, __entry->i_blocks,
-		  __entry->data_blocks, __entry->meta_blocks,
+		  __entry->data_blocks,
+		  __entry->total_da_blocks, __entry->meta_blocks,
 		  __entry->reserved_data_blocks, __entry->reserved_ext_blocks,
 		  __entry->quota_claim)
 );
@@ -1294,15 +1299,19 @@ TRACE_EVENT(ext4_da_reserve_space,
 );
 
 TRACE_EVENT(ext4_da_release_space,
-	TP_PROTO(struct inode *inode, int freed_blocks, int meta_blocks),
+	TP_PROTO(struct inode *inode,
+		 int freed_blocks,
+		 unsigned int total_da_blocks,
+		 int meta_blocks),
 
-	TP_ARGS(inode, freed_blocks, meta_blocks),
+	TP_ARGS(inode, freed_blocks, total_da_blocks, meta_blocks),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,	dev			)
 		__field(	ino_t,	ino			)
 		__field(	__u64,	i_blocks		)
 		__field(	int,	freed_blocks		)
+		__field(	unsigned int, total_da_blocks	)
 		__field(	int,	meta_blocks		)
 		__field(	int,	reserved_data_blocks	)
 		__field(	int,	reserved_ext_blocks	)
@@ -1314,6 +1323,7 @@ TRACE_EVENT(ext4_da_release_space,
 		__entry->ino	= inode->i_ino;
 		__entry->i_blocks = inode->i_blocks;
 		__entry->freed_blocks = freed_blocks;
+		__entry->total_da_blocks = total_da_blocks;
 		__entry->meta_blocks = meta_blocks;
 		__entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
 		__entry->reserved_ext_blocks = EXT4_I(inode)->i_reserved_ext_blocks;
@@ -1321,12 +1331,13 @@ TRACE_EVENT(ext4_da_release_space,
 	),
 
 	TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu "
-		  "freed_blocks %d meta_blocks %d"
+		  "freed_blocks %d total_da_blocks %u, meta_blocks %d"
 		  "reserved_data_blocks %d reserved_ext_blocks %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
 		  __entry->mode, __entry->i_blocks,
-		  __entry->freed_blocks, __entry->meta_blocks,
+		  __entry->freed_blocks,
+		  __entry->total_da_blocks, __entry->meta_blocks,
 		  __entry->reserved_data_blocks,
 		  __entry->reserved_ext_blocks)
 );
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 15/16] ext4: flush delalloc blocks if no free space
  2023-08-24  9:26 [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Zhang Yi
                   ` (13 preceding siblings ...)
  2023-08-24  9:26 ` [RFC PATCH 14/16] ext4: reserve extent blocks for delalloc Zhang Yi
@ 2023-08-24  9:26 ` Zhang Yi
  2023-08-24  9:26 ` [RFC PATCH 16/16] ext4: drop ext4_nonda_switch() Zhang Yi
  2023-08-30 15:30 ` [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Jan Kara
  16 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2023-08-24  9:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

For delalloc, the reserved metadata blocks count is calculated in the
worst case, so the reservation could be larger than the real needs, that
could lead to return false positive -ENOSPC when claiming free space. So
start a worker to flush delalloc blocks in ext4_should_retry_alloc().
If the s_dirtyclusters_counter is not zero, there may have some delalloc
metadata blocks that could be freed.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/balloc.c | 47 +++++++++++++++++++++++++++++++++++++++++------
 fs/ext4/ext4.h   |  5 +++++
 fs/ext4/super.c  | 12 ++++++++++++
 3 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 79b20d6ae39e..e8acc21ef56d 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -667,6 +667,30 @@ int ext4_claim_free_clusters(struct ext4_sb_info *sbi,
 		return -ENOSPC;
 }
 
+void ext4_writeback_da_blocks(struct work_struct *work)
+{
+	struct ext4_sb_info *sbi = container_of(work, struct ext4_sb_info,
+						s_da_flush_work);
+
+	try_to_writeback_inodes_sb(sbi->s_sb, WB_REASON_FS_FREE_SPACE);
+}
+
+/*
+ * Writeback delallocated blocks and try to free unused reserved extent
+ * blocks, return 0 if no delalloc blocks need to writeback, 1 otherwise.
+ */
+static int ext4_flush_da_blocks(struct ext4_sb_info *sbi)
+{
+	if (!percpu_counter_read_positive(&sbi->s_dirtyclusters_counter) &&
+	    !percpu_counter_sum(&sbi->s_dirtyclusters_counter))
+		return 0;
+
+	if (!work_busy(&sbi->s_da_flush_work))
+		queue_work(sbi->s_da_flush_wq, &sbi->s_da_flush_work);
+	flush_work(&sbi->s_da_flush_work);
+	return 1;
+}
+
 /**
  * ext4_should_retry_alloc() - check if a block allocation should be retried
  * @sb:			superblock
@@ -681,15 +705,22 @@ int ext4_claim_free_clusters(struct ext4_sb_info *sbi,
 int ext4_should_retry_alloc(struct super_block *sb, int *retries)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
-
-	if (!sbi->s_journal)
-		return 0;
+	int result = 0;
 
 	if (++(*retries) > 3) {
 		percpu_counter_inc(&sbi->s_sra_exceeded_retry_limit);
 		return 0;
 	}
 
+	/*
+	 * Flush allocated delalloc blocks and try to free unused
+	 * reserved extent blocks.
+	 */
+	if (test_opt(sb, DELALLOC))
+		result += ext4_flush_da_blocks(sbi);
+
+	if (!sbi->s_journal)
+		goto out;
 	/*
 	 * if there's no indication that blocks are about to be freed it's
 	 * possible we just missed a transaction commit that did so
@@ -701,16 +732,20 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries)
 			flush_work(&sbi->s_discard_work);
 			atomic_dec(&sbi->s_retry_alloc_pending);
 		}
-		return ext4_has_free_clusters(sbi, 1, 0);
+		result += ext4_has_free_clusters(sbi, 1, 0);
+		goto out;
 	}
 
 	/*
 	 * it's possible we've just missed a transaction commit here,
 	 * so ignore the returned status
 	 */
-	ext4_debug("%s: retrying operation after ENOSPC\n", sb->s_id);
+	result += 1;
 	(void) jbd2_journal_force_commit_nested(sbi->s_journal);
-	return 1;
+out:
+	if (result)
+		ext4_debug("%s: retrying operation after ENOSPC\n", sb->s_id);
+	return result;
 }
 
 /*
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 67b12f9ffc50..6f4259ea6751 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1627,6 +1627,10 @@ struct ext4_sb_info {
 	/* workqueue for reserved extent conversions (buffered io) */
 	struct workqueue_struct *rsv_conversion_wq;
 
+	/* workqueue for delalloc buffer IO flushing */
+	struct workqueue_struct *s_da_flush_wq;
+	struct work_struct s_da_flush_work;
+
 	/* timer for periodic error stats printing */
 	struct timer_list s_err_report;
 
@@ -2716,6 +2720,7 @@ extern int ext4_wait_block_bitmap(struct super_block *sb,
 				  struct buffer_head *bh);
 extern struct buffer_head *ext4_read_block_bitmap(struct super_block *sb,
 						  ext4_group_t block_group);
+extern void ext4_writeback_da_blocks(struct work_struct *work);
 extern unsigned ext4_free_clusters_after_init(struct super_block *sb,
 					      ext4_group_t block_group,
 					      struct ext4_group_desc *gdp);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7bc7c8c0ed71..6f50975ba42e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1335,6 +1335,8 @@ static void ext4_put_super(struct super_block *sb)
 
 	flush_work(&sbi->s_sb_upd_work);
 	destroy_workqueue(sbi->rsv_conversion_wq);
+	flush_work(&sbi->s_da_flush_work);
+	destroy_workqueue(sbi->s_da_flush_wq);
 	ext4_release_orphan_info(sb);
 
 	if (sbi->s_journal) {
@@ -5491,6 +5493,14 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 		goto failed_mount4;
 	}
 
+	INIT_WORK(&sbi->s_da_flush_work, ext4_writeback_da_blocks);
+	sbi->s_da_flush_wq = alloc_workqueue("ext4_delalloc_flush", WQ_UNBOUND, 1);
+	if (!sbi->s_da_flush_wq) {
+		printk(KERN_ERR "EXT4-fs: failed to create workqueue\n");
+		err = -ENOMEM;
+		goto failed_mount4;
+	}
+
 	/*
 	 * The jbd2_journal_load will have done any necessary log recovery,
 	 * so we can safely mount the rest of the filesystem now.
@@ -5660,6 +5670,8 @@ failed_mount9: __maybe_unused
 	sb->s_root = NULL;
 failed_mount4:
 	ext4_msg(sb, KERN_ERR, "mount failed");
+	if (sbi->s_da_flush_wq)
+		destroy_workqueue(sbi->s_da_flush_wq);
 	if (EXT4_SB(sb)->rsv_conversion_wq)
 		destroy_workqueue(EXT4_SB(sb)->rsv_conversion_wq);
 failed_mount_wq:
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 16/16] ext4: drop ext4_nonda_switch()
  2023-08-24  9:26 [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Zhang Yi
                   ` (14 preceding siblings ...)
  2023-08-24  9:26 ` [RFC PATCH 15/16] ext4: flush delalloc blocks if no free space Zhang Yi
@ 2023-08-24  9:26 ` Zhang Yi
  2023-08-30 15:30 ` [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Jan Kara
  16 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2023-08-24  9:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Now that we have reserve enough metadata blocks for delalloc, the
ext4_nonda_switch() could be dropped, it's safe to keep on delalloc mode
for buffer write if the dirty space is high and free space is low, we
can make sure always successfully allocate the metadata block while
mapping delalloc entries in ext4_writepages().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents_status.c |  9 ++++-----
 fs/ext4/inode.c          | 39 ++-------------------------------------
 2 files changed, 6 insertions(+), 42 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 8e0dec27f967..954c6e49182e 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -971,11 +971,10 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	 * reduce the reserved cluster count and claim quota.
 	 *
 	 * Otherwise, we aren't allocating delayed allocated clusters
-	 * (from fallocate, filemap, DIO, or clusters allocated when
-	 * delalloc has been disabled by ext4_nonda_switch()), reduce the
-	 * reserved cluster count by the number of allocated clusters that
-	 * have previously been delayed allocated. Quota has been claimed
-	 * by ext4_mb_new_blocks(), so release the quota reservations made
+	 * (from fallocate, filemap, DIO), reduce the reserved cluster
+	 * count by the number of allocated clusters that have previously
+	 * been delayed allocated. Quota has been claimed by
+	 * ext4_mb_new_blocks(), so release the quota reservations made
 	 * for any previously delayed allocated clusters.
 	 */
 	ext4_da_update_reserve_space(inode, rinfo.ndelonly_clu + pending,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d714bf2e4171..0a76c99ea8c6 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2838,40 +2838,6 @@ static int ext4_dax_writepages(struct address_space *mapping,
 	return ret;
 }
 
-static int ext4_nonda_switch(struct super_block *sb)
-{
-	s64 free_clusters, dirty_clusters;
-	struct ext4_sb_info *sbi = EXT4_SB(sb);
-
-	/*
-	 * switch to non delalloc mode if we are running low
-	 * on free block. The free block accounting via percpu
-	 * counters can get slightly wrong with percpu_counter_batch getting
-	 * accumulated on each CPU without updating global counters
-	 * Delalloc need an accurate free block accounting. So switch
-	 * to non delalloc when we are near to error range.
-	 */
-	free_clusters =
-		percpu_counter_read_positive(&sbi->s_freeclusters_counter);
-	dirty_clusters =
-		percpu_counter_read_positive(&sbi->s_dirtyclusters_counter);
-	/*
-	 * Start pushing delalloc when 1/2 of free blocks are dirty.
-	 */
-	if (dirty_clusters && (free_clusters < 2 * dirty_clusters))
-		try_to_writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);
-
-	if (2 * free_clusters < 3 * dirty_clusters ||
-	    free_clusters < (dirty_clusters + EXT4_FREECLUSTERS_WATERMARK)) {
-		/*
-		 * free block count is less than 150% of dirty blocks
-		 * or free blocks is less than watermark
-		 */
-		return 1;
-	}
-	return 0;
-}
-
 static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
 			       loff_t pos, unsigned len,
 			       struct page **pagep, void **fsdata)
@@ -2886,7 +2852,7 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
 
 	index = pos >> PAGE_SHIFT;
 
-	if (ext4_nonda_switch(inode->i_sb) || ext4_verity_in_progress(inode)) {
+	if (ext4_verity_in_progress(inode)) {
 		*fsdata = (void *)FALL_BACK_TO_NONDELALLOC;
 		return ext4_write_begin(file, mapping, pos,
 					len, pagep, fsdata);
@@ -6117,8 +6083,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 		goto retry_alloc;
 
 	/* Delalloc case is easy... */
-	if (test_opt(inode->i_sb, DELALLOC) &&
-	    !ext4_nonda_switch(inode->i_sb)) {
+	if (test_opt(inode->i_sb, DELALLOC)) {
 		do {
 			err = block_page_mkwrite(vma, vmf,
 						   ext4_da_get_block_prep);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 01/16] ext4: correct the start block of counting reserved clusters
  2023-08-24  9:26 ` [RFC PATCH 01/16] ext4: correct the start block of counting reserved clusters Zhang Yi
@ 2023-08-30 13:10   ` Jan Kara
  2023-10-06  2:33     ` Theodore Ts'o
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2023-08-30 13:10 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, tytso, adilger.kernel, jack, yi.zhang, chengzhihao1, yukuai3

On Thu 24-08-23 17:26:04, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When big allocate feature is enabled, we need to count and update
> reserved clusters before removing a delayed only extent_status entry.
> {init|count|get}_rsvd() have already done this, but the start block
> number of this counting isn's correct in the following case.
> 
>   lblk            end
>    |               |
>    v               v
>           -------------------------
>           |                       | orig_es
>           -------------------------
>                    ^              ^
>       len1 is 0    |     len2     |
> 
> If the start block of the orig_es entry founded is bigger than lblk, we
> passed lblk as start block to count_rsvd(), but the length is correct,
> finally, the range to be counted is offset. This patch fix this by
> passing the start blocks to 'orig_es->lblk + len1'.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/extents_status.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index 6f7de14c0fa8..5e625ea4545d 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -1405,8 +1405,8 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  			}
>  		}
>  		if (count_reserved)
> -			count_rsvd(inode, lblk, orig_es.es_len - len1 - len2,
> -				   &orig_es, &rc);
> +			count_rsvd(inode, orig_es.es_lblk + len1,
> +				   orig_es.es_len - len1 - len2, &orig_es, &rc);
>  		goto out_get_reserved;
>  	}
>  
> -- 
> 2.39.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 02/16] ext4: make sure allocate pending entry not fail
  2023-08-24  9:26 ` [RFC PATCH 02/16] ext4: make sure allocate pending entry not fail Zhang Yi
@ 2023-08-30 13:25   ` Jan Kara
  2023-10-06  2:33     ` Theodore Ts'o
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2023-08-30 13:25 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, tytso, adilger.kernel, jack, yi.zhang, chengzhihao1, yukuai3

On Thu 24-08-23 17:26:05, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> __insert_pending() allocate memory in atomic context, so the allocation
> could fail, but we are not handling that failure now. It could lead
> ext4_es_remove_extent() to get wrong reserved clusters, and the global
> data blocks reservation count will be incorrect. The same to
> extents_status entry preallocation, preallocate pending entry out of the
> i_es_lock with __GFP_NOFAIL, make sure __insert_pending() and
> __revise_pending() always succeeds.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks sensible. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/extents_status.c | 123 ++++++++++++++++++++++++++++-----------
>  1 file changed, 89 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index 5e625ea4545d..f4b50652f0cc 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -152,8 +152,9 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  static int es_reclaim_extents(struct ext4_inode_info *ei, int *nr_to_scan);
>  static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
>  		       struct ext4_inode_info *locked_ei);
> -static void __revise_pending(struct inode *inode, ext4_lblk_t lblk,
> -			     ext4_lblk_t len);
> +static int __revise_pending(struct inode *inode, ext4_lblk_t lblk,
> +			    ext4_lblk_t len,
> +			    struct pending_reservation **prealloc);
>  
>  int __init ext4_init_es(void)
>  {
> @@ -448,6 +449,19 @@ static void ext4_es_list_del(struct inode *inode)
>  	spin_unlock(&sbi->s_es_lock);
>  }
>  
> +static inline struct pending_reservation *__alloc_pending(bool nofail)
> +{
> +	if (!nofail)
> +		return kmem_cache_alloc(ext4_pending_cachep, GFP_ATOMIC);
> +
> +	return kmem_cache_zalloc(ext4_pending_cachep, GFP_KERNEL | __GFP_NOFAIL);
> +}
> +
> +static inline void __free_pending(struct pending_reservation *pr)
> +{
> +	kmem_cache_free(ext4_pending_cachep, pr);
> +}
> +
>  /*
>   * Returns true if we cannot fail to allocate memory for this extent_status
>   * entry and cannot reclaim it until its status changes.
> @@ -836,11 +850,12 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>  {
>  	struct extent_status newes;
>  	ext4_lblk_t end = lblk + len - 1;
> -	int err1 = 0;
> -	int err2 = 0;
> +	int err1 = 0, err2 = 0, err3 = 0;
>  	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>  	struct extent_status *es1 = NULL;
>  	struct extent_status *es2 = NULL;
> +	struct pending_reservation *pr = NULL;
> +	bool revise_pending = false;
>  
>  	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
>  		return;
> @@ -868,11 +883,17 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>  
>  	ext4_es_insert_extent_check(inode, &newes);
>  
> +	revise_pending = sbi->s_cluster_ratio > 1 &&
> +			 test_opt(inode->i_sb, DELALLOC) &&
> +			 (status & (EXTENT_STATUS_WRITTEN |
> +				    EXTENT_STATUS_UNWRITTEN));
>  retry:
>  	if (err1 && !es1)
>  		es1 = __es_alloc_extent(true);
>  	if ((err1 || err2) && !es2)
>  		es2 = __es_alloc_extent(true);
> +	if ((err1 || err2 || err3) && revise_pending && !pr)
> +		pr = __alloc_pending(true);
>  	write_lock(&EXT4_I(inode)->i_es_lock);
>  
>  	err1 = __es_remove_extent(inode, lblk, end, NULL, es1);
> @@ -897,13 +918,18 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>  		es2 = NULL;
>  	}
>  
> -	if (sbi->s_cluster_ratio > 1 && test_opt(inode->i_sb, DELALLOC) &&
> -	    (status & EXTENT_STATUS_WRITTEN ||
> -	     status & EXTENT_STATUS_UNWRITTEN))
> -		__revise_pending(inode, lblk, len);
> +	if (revise_pending) {
> +		err3 = __revise_pending(inode, lblk, len, &pr);
> +		if (err3 != 0)
> +			goto error;
> +		if (pr) {
> +			__free_pending(pr);
> +			pr = NULL;
> +		}
> +	}
>  error:
>  	write_unlock(&EXT4_I(inode)->i_es_lock);
> -	if (err1 || err2)
> +	if (err1 || err2 || err3)
>  		goto retry;
>  
>  	ext4_es_print_tree(inode);
> @@ -1311,7 +1337,7 @@ static unsigned int get_rsvd(struct inode *inode, ext4_lblk_t end,
>  				rc->ndelonly--;
>  				node = rb_next(&pr->rb_node);
>  				rb_erase(&pr->rb_node, &tree->root);
> -				kmem_cache_free(ext4_pending_cachep, pr);
> +				__free_pending(pr);
>  				if (!node)
>  					break;
>  				pr = rb_entry(node, struct pending_reservation,
> @@ -1907,11 +1933,13 @@ static struct pending_reservation *__get_pending(struct inode *inode,
>   *
>   * @inode - file containing the cluster
>   * @lblk - logical block in the cluster to be added
> + * @prealloc - preallocated pending entry
>   *
>   * Returns 0 on successful insertion and -ENOMEM on failure.  If the
>   * pending reservation is already in the set, returns successfully.
>   */
> -static int __insert_pending(struct inode *inode, ext4_lblk_t lblk)
> +static int __insert_pending(struct inode *inode, ext4_lblk_t lblk,
> +			    struct pending_reservation **prealloc)
>  {
>  	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>  	struct ext4_pending_tree *tree = &EXT4_I(inode)->i_pending_tree;
> @@ -1937,10 +1965,15 @@ static int __insert_pending(struct inode *inode, ext4_lblk_t lblk)
>  		}
>  	}
>  
> -	pr = kmem_cache_alloc(ext4_pending_cachep, GFP_ATOMIC);
> -	if (pr == NULL) {
> -		ret = -ENOMEM;
> -		goto out;
> +	if (likely(*prealloc == NULL)) {
> +		pr = __alloc_pending(false);
> +		if (!pr) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +	} else {
> +		pr = *prealloc;
> +		*prealloc = NULL;
>  	}
>  	pr->lclu = lclu;
>  
> @@ -1970,7 +2003,7 @@ static void __remove_pending(struct inode *inode, ext4_lblk_t lblk)
>  	if (pr != NULL) {
>  		tree = &EXT4_I(inode)->i_pending_tree;
>  		rb_erase(&pr->rb_node, &tree->root);
> -		kmem_cache_free(ext4_pending_cachep, pr);
> +		__free_pending(pr);
>  	}
>  }
>  
> @@ -2029,10 +2062,10 @@ void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
>  				  bool allocated)
>  {
>  	struct extent_status newes;
> -	int err1 = 0;
> -	int err2 = 0;
> +	int err1 = 0, err2 = 0, err3 = 0;
>  	struct extent_status *es1 = NULL;
>  	struct extent_status *es2 = NULL;
> +	struct pending_reservation *pr = NULL;
>  
>  	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
>  		return;
> @@ -2052,6 +2085,8 @@ void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
>  		es1 = __es_alloc_extent(true);
>  	if ((err1 || err2) && !es2)
>  		es2 = __es_alloc_extent(true);
> +	if ((err1 || err2 || err3) && allocated && !pr)
> +		pr = __alloc_pending(true);
>  	write_lock(&EXT4_I(inode)->i_es_lock);
>  
>  	err1 = __es_remove_extent(inode, lblk, lblk, NULL, es1);
> @@ -2074,11 +2109,18 @@ void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
>  		es2 = NULL;
>  	}
>  
> -	if (allocated)
> -		__insert_pending(inode, lblk);
> +	if (allocated) {
> +		err3 = __insert_pending(inode, lblk, &pr);
> +		if (err3 != 0)
> +			goto error;
> +		if (pr) {
> +			__free_pending(pr);
> +			pr = NULL;
> +		}
> +	}
>  error:
>  	write_unlock(&EXT4_I(inode)->i_es_lock);
> -	if (err1 || err2)
> +	if (err1 || err2 || err3)
>  		goto retry;
>  
>  	ext4_es_print_tree(inode);
> @@ -2184,21 +2226,24 @@ unsigned int ext4_es_delayed_clu(struct inode *inode, ext4_lblk_t lblk,
>   * @inode - file containing the range
>   * @lblk - logical block defining the start of range
>   * @len  - length of range in blocks
> + * @prealloc - preallocated pending entry
>   *
>   * Used after a newly allocated extent is added to the extents status tree.
>   * Requires that the extents in the range have either written or unwritten
>   * status.  Must be called while holding i_es_lock.
>   */
> -static void __revise_pending(struct inode *inode, ext4_lblk_t lblk,
> -			     ext4_lblk_t len)
> +static int __revise_pending(struct inode *inode, ext4_lblk_t lblk,
> +			    ext4_lblk_t len,
> +			    struct pending_reservation **prealloc)
>  {
>  	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>  	ext4_lblk_t end = lblk + len - 1;
>  	ext4_lblk_t first, last;
>  	bool f_del = false, l_del = false;
> +	int ret = 0;
>  
>  	if (len == 0)
> -		return;
> +		return 0;
>  
>  	/*
>  	 * Two cases - block range within single cluster and block range
> @@ -2219,7 +2264,9 @@ static void __revise_pending(struct inode *inode, ext4_lblk_t lblk,
>  			f_del = __es_scan_range(inode, &ext4_es_is_delonly,
>  						first, lblk - 1);
>  		if (f_del) {
> -			__insert_pending(inode, first);
> +			ret = __insert_pending(inode, first, prealloc);
> +			if (ret < 0)
> +				goto out;
>  		} else {
>  			last = EXT4_LBLK_CMASK(sbi, end) +
>  			       sbi->s_cluster_ratio - 1;
> @@ -2227,9 +2274,11 @@ static void __revise_pending(struct inode *inode, ext4_lblk_t lblk,
>  				l_del = __es_scan_range(inode,
>  							&ext4_es_is_delonly,
>  							end + 1, last);
> -			if (l_del)
> -				__insert_pending(inode, last);
> -			else
> +			if (l_del) {
> +				ret = __insert_pending(inode, last, prealloc);
> +				if (ret < 0)
> +					goto out;
> +			} else
>  				__remove_pending(inode, last);
>  		}
>  	} else {
> @@ -2237,18 +2286,24 @@ static void __revise_pending(struct inode *inode, ext4_lblk_t lblk,
>  		if (first != lblk)
>  			f_del = __es_scan_range(inode, &ext4_es_is_delonly,
>  						first, lblk - 1);
> -		if (f_del)
> -			__insert_pending(inode, first);
> -		else
> +		if (f_del) {
> +			ret = __insert_pending(inode, first, prealloc);
> +			if (ret < 0)
> +				goto out;
> +		} else
>  			__remove_pending(inode, first);
>  
>  		last = EXT4_LBLK_CMASK(sbi, end) + sbi->s_cluster_ratio - 1;
>  		if (last != end)
>  			l_del = __es_scan_range(inode, &ext4_es_is_delonly,
>  						end + 1, last);
> -		if (l_del)
> -			__insert_pending(inode, last);
> -		else
> +		if (l_del) {
> +			ret = __insert_pending(inode, last, prealloc);
> +			if (ret < 0)
> +				goto out;
> +		} else
>  			__remove_pending(inode, last);
>  	}
> +out:
> +	return ret;
>  }
> -- 
> 2.39.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option
  2023-08-24  9:26 [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Zhang Yi
                   ` (15 preceding siblings ...)
  2023-08-24  9:26 ` [RFC PATCH 16/16] ext4: drop ext4_nonda_switch() Zhang Yi
@ 2023-08-30 15:30 ` Jan Kara
  2023-09-01  2:33   ` Zhang Yi
  16 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2023-08-30 15:30 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, tytso, adilger.kernel, jack, yi.zhang, chengzhihao1, yukuai3

Hello!

On Thu 24-08-23 17:26:03, Zhang Yi wrote:
> The delayed allocation method allocates blocks during page writeback in
> ext4_writepages(), which cannot handle block allocation failures due to
> e.g. ENOSPC if acquires more extent blocks. In order to deal with this,
> commit '79f0be8d2e6e ("ext4: Switch to non delalloc mode when we are low
> on free blocks count.")' introduce ext4_nonda_switch() to convert to no
> delalloc mode if the space if the free blocks is less than 150% of dirty
> blocks or the watermark.

Well, that functionality is there mainly so that we can really allocate all
blocks available in the filesystem. But you are right that since we are not
reserving metadata blocks explicitely anymore, it is questionable whether
we really still need this.

> In the meantime, '27dd43854227 ("ext4:
> introduce reserved space")' reserve some of the file system space (2% or
> 4096 clusters, whichever is smaller). Both of those to solutions can
> make sure that space is not exhausted when mapping delalloc blocks in
> most cases, but cannot guarantee work in all cases, which could lead to
> infinite loop or data loss (please see patch 14 for details).

OK, I agree that in principle there could be problems due to percpu
counters inaccuracy etc. but were you able to reproduce the problem under
some at least somewhat realistic conditions? We were discussing making
free space percpu counters switch to exact counting in case we are running
tight on space to avoid these problems but it never proved to be a problem
in practice so we never bothered to actually implement it.

> This patch set wants to reserve metadata space more accurate for
> delalloc mount option. The metadata blocks reservation is very tricky
> and is also related to the continuity of physical blocks, an effective
> way is to reserve as the worst case, which means that every data block
> is discontinuous and one data block costs an extent entry. Reserve
> metadata space as the worst case can make sure enough blocks reserved
> during data writeback, the unused reservaion space can be released after
> mapping data blocks.

Well, as you say, there is a problem with the worst case estimates - either
you *heavily* overestimate the number of needed metadata blocks or the code
to estimate the number of needed metadata blocks is really complex. We used
to have estimates of needed metadata and we ditched that code (in favor of
reserved clusters) exactly because it was complex and suffered from
cornercases that were hard to fix. I haven't quite digested the other
patches in this series to judge which case is it but it seems to lean on
the "complex" side :).

So I'm somewhat skeptical this complexity is really needed but I can be
convinced :).

> After doing this, add a worker to submit delayed
> allocations to prevent excessive reservations. Finally, we could
> completely drop the policy of switching back to non-delayed allocation.

BTW the worker there in patch 15 seems really pointless. If you do:
queue_work(), flush_work() then you could just directly do the work inline
and get as a bonus more efficiency and proper lockdep tracking of
dependencies. But that's just a minor issue I have noticed.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option
  2023-08-30 15:30 ` [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Jan Kara
@ 2023-09-01  2:33   ` Zhang Yi
  0 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2023-09-01  2:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, yi.zhang, chengzhihao1, yukuai3

[-- Attachment #1: Type: text/plain, Size: 7245 bytes --]

Hello! Thanks for your reply.

On 2023/8/30 23:30, Jan Kara wrote:
> Hello!
> 
> On Thu 24-08-23 17:26:03, Zhang Yi wrote:
>> The delayed allocation method allocates blocks during page writeback in
>> ext4_writepages(), which cannot handle block allocation failures due to
>> e.g. ENOSPC if acquires more extent blocks. In order to deal with this,
>> commit '79f0be8d2e6e ("ext4: Switch to non delalloc mode when we are low
>> on free blocks count.")' introduce ext4_nonda_switch() to convert to no
>> delalloc mode if the space if the free blocks is less than 150% of dirty
>> blocks or the watermark.
> 
> Well, that functionality is there mainly so that we can really allocate all
> blocks available in the filesystem. But you are right that since we are not
> reserving metadata blocks explicitely anymore, it is questionable whether
> we really still need this.
> 
>> In the meantime, '27dd43854227 ("ext4:
>> introduce reserved space")' reserve some of the file system space (2% or
>> 4096 clusters, whichever is smaller). Both of those to solutions can
>> make sure that space is not exhausted when mapping delalloc blocks in
>> most cases, but cannot guarantee work in all cases, which could lead to
>> infinite loop or data loss (please see patch 14 for details).
> 
> OK, I agree that in principle there could be problems due to percpu
> counters inaccuracy etc. but were you able to reproduce the problem under
> some at least somewhat realistic conditions? We were discussing making
> free space percpu counters switch to exact counting in case we are running
> tight on space to avoid these problems but it never proved to be a problem
> in practice so we never bothered to actually implement it.
> 

Yes, we catch this problem in our products firstly and we reproduced it
when doing stress test on low free space disk, but the frequency is
very low. After analysis we found the root cause, and Zhihao helped to
write a 100% reproducer below.

1. Apply the 'infinite_loop.diff' in attachment, add info and delay
   into ext4 code.
2. Run 'enospc.sh' on a virtual machine with 4 CPU (important, because
   the cpu number will affect EXT4_FREECLUSTERS_WATERMARK and also
   affect the reproduce).

After several minutes, the writeback process will loop infinitely, and
other processes which rely on it will hung.

[  304.815575] INFO: task sync:7292 blocked for more than 153 seconds.
[  304.818130]       Not tainted 6.5.0-dirty #578
[  304.819926] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  304.822747] task:sync            state:D stack:0     pid:7292  ppid:1      flags:0x00004000
[  304.825677] Call Trace:
[  304.826538]  <TASK>
[  304.827307]  __schedule+0x577/0x12b0
[  304.828300]  ? sync_fs_one_sb+0x50/0x50
[  304.829433]  schedule+0x9d/0x1e0
[  304.830451]  wb_wait_for_completion+0x82/0xd0
[  304.831811]  ? cpuacct_css_alloc+0x100/0x100
[  304.833090]  sync_inodes_sb+0xf1/0x440
[  304.834207]  ? sync_fs_one_sb+0x50/0x50
[  304.835304]  sync_inodes_one_sb+0x21/0x30
[  304.836528]  iterate_supers+0xd2/0x180
[  304.837614]  ksys_sync+0x50/0xf0
[  304.838356]  __do_sys_sync+0x12/0x20
[  304.839207]  do_syscall_64+0x68/0xf0
[  304.839964]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

On the contrary, after doing a little tweaking the delay injection
procedure, we could reproduce the data loss problem easily.

1. Apply the 'data_loss.diff' in the attachment.
2. Run 'enospc.sh' like the previous one, then we got below error message.

[   52.226320] EXT4-fs (sda): Delayed block allocation failed for inode 571 at logical offset 8 with max blocks 1 with error 28
[   52.229126] EXT4-fs (sda): This should not happen!! Data will be lost

>> This patch set wants to reserve metadata space more accurate for
>> delalloc mount option. The metadata blocks reservation is very tricky
>> and is also related to the continuity of physical blocks, an effective
>> way is to reserve as the worst case, which means that every data block
>> is discontinuous and one data block costs an extent entry. Reserve
>> metadata space as the worst case can make sure enough blocks reserved
>> during data writeback, the unused reservaion space can be released after
>> mapping data blocks.
> 
> Well, as you say, there is a problem with the worst case estimates - either
> you *heavily* overestimate the number of needed metadata blocks or the code
> to estimate the number of needed metadata blocks is really complex. We used
> to have estimates of needed metadata and we ditched that code (in favor of
> reserved clusters) exactly because it was complex and suffered from
> cornercases that were hard to fix. I haven't quite digested the other
> patches in this series to judge which case is it but it seems to lean on
> the "complex" side :).
> 
> So I'm somewhat skeptical this complexity is really needed but I can be
> convinced :).

I understand your concern. At first I tried to solve this problem with
other simple solutions, but failed. I suppose reserve blocks for the
worst case is the only way to cover all cases, and I noticed that xfs
also uses this reservation method, so I learned the implementation from
it, but it's not exactly the same.

Although it's a worst-case reservation, it is not that complicated.
Firstly, the estimate formula is simple, just add the 'extent & node'
blocks calculated by the **total** delay allocated data blocks and the
remaining btree heights (no need to concern whether the logical
positions of each extents are continuous or not, the btree heights can
be merged between each discontinuous extent entries of one inode, this
can reduce overestimate to some extent). Secondary, the method of
reserving metadata blocks is similar to that of reserving data blocks,
just the estimate formula is different. Fortunately, there already have
data reservation helpers like ext4_da_update_reserve_space() and
ext4_da_release_reserve_space(), it works only takes some minor
changes. BTW, I don't really know the ditched estimation and
cornercases you mentioned, how it's like?

Finally, maybe this reservation could bring other benefits in the long
run. For example, after we've done this, maybe we could also reserve
metadata for DIO and buffer IO with dioread_nolock in the future, then
we could drop EXT4_GET_BLOCKS_PRE_IO safely, which looks like a
compromise, maybe we could get some improve performance if do this (I
haven't thought deeply, just a whim :) ). But it's a different thing.

> 
>> After doing this, add a worker to submit delayed
>> allocations to prevent excessive reservations. Finally, we could
>> completely drop the policy of switching back to non-delayed allocation.
> 
> BTW the worker there in patch 15 seems really pointless. If you do:
> queue_work(), flush_work() then you could just directly do the work inline
> and get as a bonus more efficiency and proper lockdep tracking of
> dependencies. But that's just a minor issue I have noticed.
> 

Yes, I added this worker because I want to run the work asynchronously
if the s_dirtyclusters_counter is running beyond watermark. In this way,
the I/O flow could be smoother. But I didn't implement it because I
don't know if you like this estimate solution, I can do it if so.

Thanks,
Yi.



[-- Attachment #2: data_loss.diff --]
[-- Type: text/plain, Size: 2808 bytes --]

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 43775a6ca505..6772cbc74224 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2192,6 +2192,7 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
  * mapped so that it can be written out (and thus forward progress is
  * guaranteed). After mapping we submit all mapped pages for IO.
  */
+#include <linux/delay.h>
 static int mpage_map_and_submit_extent(handle_t *handle,
 				       struct mpage_da_data *mpd,
 				       bool *give_up_on_write)
@@ -2203,6 +2204,7 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 	int progress = 0;
 	ext4_io_end_t *io_end = mpd->io_submit.io_end;
 	struct ext4_io_end_vec *io_end_vec;
+	static int wait = 0;
 
 	io_end_vec = ext4_alloc_io_end_vec(io_end);
 	if (IS_ERR(io_end_vec))
@@ -2213,6 +2215,26 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 		if (err < 0) {
 			struct super_block *sb = inode->i_sb;
 
+			if (!wait && err == -ENOSPC) {
+				wait = 1;
+				if (!ext4_count_free_clusters(sb)) {
+					/*
+					 * Failed to allocate metadata block,
+					 * will trigger infinite loop and hung.
+					 */
+					pr_err("will hung\n");
+				} else {
+					/*
+					 * Failed to allocate data block, wait
+					 * test.sh to free a block.
+					 */
+					pr_err("wait free\n");
+					msleep(3000);
+					pr_err("after free, now %llu\n",
+						ext4_count_free_clusters(sb));
+				}
+			}
+
 			if (ext4_forced_shutdown(EXT4_SB(sb)) ||
 			    ext4_test_mount_flag(sb, EXT4_MF_FS_ABORTED))
 				goto invalidate_dirty_pages;
@@ -2888,6 +2910,10 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
 	/* In case writeback began while the folio was unlocked */
 	folio_wait_stable(folio);
 
+	/* Use task name and DISCARD mount option as delay inject filter. */
+	if (!strcmp(current->comm, "dd") && test_opt(inode->i_sb, DISCARD))
+		msleep(3000);
+
 #ifdef CONFIG_FS_ENCRYPTION
 	ret = ext4_block_write_begin(folio, pos, len, ext4_da_get_block_prep);
 #else
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c94ebf704616..79f4e96b8691 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5715,6 +5715,15 @@ static int ext4_fill_super(struct super_block *sb, struct fs_context *fc)
 
 	/* Update the s_overhead_clusters if necessary */
 	ext4_update_overhead(sb, false);
+
+	if (!strcmp(sb->s_bdev->bd_disk->disk_name, "sda")) {
+		pr_err("r_blocks %lld s_resv_clusters %llu free %lld dirty %lld EXT4_FREECLUSTERS_WATERMARK %u\n",
+			(ext4_r_blocks_count(sbi->s_es) >> sbi->s_cluster_bits),
+			atomic64_read(&sbi->s_resv_clusters),
+			percpu_counter_read_positive(&sbi->s_freeclusters_counter),
+			percpu_counter_read_positive(&sbi->s_dirtyclusters_counter),
+			EXT4_FREECLUSTERS_WATERMARK);
+	}
 	return 0;
 
 free_sbi:

[-- Attachment #3: infinite_loop.diff --]
[-- Type: text/plain, Size: 2808 bytes --]

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 43775a6ca505..11e47a530435 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2192,6 +2192,7 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
  * mapped so that it can be written out (and thus forward progress is
  * guaranteed). After mapping we submit all mapped pages for IO.
  */
+#include <linux/delay.h>
 static int mpage_map_and_submit_extent(handle_t *handle,
 				       struct mpage_da_data *mpd,
 				       bool *give_up_on_write)
@@ -2203,6 +2204,7 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 	int progress = 0;
 	ext4_io_end_t *io_end = mpd->io_submit.io_end;
 	struct ext4_io_end_vec *io_end_vec;
+	static int wait = 0;
 
 	io_end_vec = ext4_alloc_io_end_vec(io_end);
 	if (IS_ERR(io_end_vec))
@@ -2213,6 +2215,26 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 		if (err < 0) {
 			struct super_block *sb = inode->i_sb;
 
+			if (!wait && err == -ENOSPC) {
+				wait = 1;
+				if (!ext4_count_free_clusters(sb)) {
+					/*
+					 * Failed to allocate data block, wait
+					 * test.sh to free a block.
+					 */
+					pr_err("wait free\n");
+					msleep(3000);
+					pr_err("after free, now %llu\n",
+						ext4_count_free_clusters(sb));
+				} else {
+					/*
+					 * Failed to allocate metadata block,
+					 * will trigger infinite loop and hung.
+					 */
+					pr_err("will hung\n");
+				}
+			}
+
 			if (ext4_forced_shutdown(EXT4_SB(sb)) ||
 			    ext4_test_mount_flag(sb, EXT4_MF_FS_ABORTED))
 				goto invalidate_dirty_pages;
@@ -2888,6 +2910,10 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
 	/* In case writeback began while the folio was unlocked */
 	folio_wait_stable(folio);
 
+	/* Use task name and DISCARD mount option as delay inject filter. */
+	if (!strcmp(current->comm, "dd") && test_opt(inode->i_sb, DISCARD))
+		msleep(3000);
+
 #ifdef CONFIG_FS_ENCRYPTION
 	ret = ext4_block_write_begin(folio, pos, len, ext4_da_get_block_prep);
 #else
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c94ebf704616..79f4e96b8691 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5715,6 +5715,15 @@ static int ext4_fill_super(struct super_block *sb, struct fs_context *fc)
 
 	/* Update the s_overhead_clusters if necessary */
 	ext4_update_overhead(sb, false);
+
+	if (!strcmp(sb->s_bdev->bd_disk->disk_name, "sda")) {
+		pr_err("r_blocks %lld s_resv_clusters %llu free %lld dirty %lld EXT4_FREECLUSTERS_WATERMARK %u\n",
+			(ext4_r_blocks_count(sbi->s_es) >> sbi->s_cluster_bits),
+			atomic64_read(&sbi->s_resv_clusters),
+			percpu_counter_read_positive(&sbi->s_freeclusters_counter),
+			percpu_counter_read_positive(&sbi->s_dirtyclusters_counter),
+			EXT4_FREECLUSTERS_WATERMARK);
+	}
 	return 0;
 
 free_sbi:

[-- Attachment #4: enospc.sh --]
[-- Type: text/plain, Size: 908 bytes --]

#!/bin/bash

sysctl -w kernel.hung_task_timeout_secs=15
umount /root/temp
mkfs.ext4 -F -b 4096 /dev/sda 100M
mount /dev/sda /root/temp
dd if=/dev/zero of=/root/temp/file bs=4K count=1
for i in {0..1100}
do
	touch /root/temp/f_$i
	dd if=/dev/zero of=/root/temp/f_$i bs=4K count=1
	dd if=/dev/zero of=/root/temp/f_$i bs=4K count=1 seek=2
	dd if=/dev/zero of=/root/temp/f_$i bs=4K count=1 seek=4
	dd if=/dev/zero of=/root/temp/f_$i bs=4K count=1 seek=6
done
dd if=/dev/zero of=/root/temp/consumer bs=1M count=68
umount /root/temp

mount -odiscard /dev/sda /root/temp
for i in {0..1100}
do
	dd if=/dev/zero of=/root/temp/f_$i bs=4K count=1 seek=8 &
done
sleep 1
dmesg -c > /dev/null
wait
sync &
sleep 1
while true
do
	res=`dmesg -c`
	if [[ "$res" =~ "wait free" ]]
	then
		echo "delete file"
		rm -f /root/temp/file
		break;
	elif [[ "$res" =~ "will hung" ]]
	then
		echo "will hung"
		break;
	fi
	sleep 1
done


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 12/16] ext4: update reserved meta blocks in ext4_da_{release|update_reserve}_space()
  2023-08-24  9:26 ` [RFC PATCH 12/16] ext4: update reserved meta blocks in ext4_da_{release|update_reserve}_space() Zhang Yi
@ 2023-09-06  7:35   ` kernel test robot
  0 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2023-09-06  7:35 UTC (permalink / raw)
  To: Zhang Yi
  Cc: oe-lkp, lkp, linux-ext4, ying.huang, feng.tang, fengwei.yin,
	tytso, adilger.kernel, jack, yi.zhang, yi.zhang, chengzhihao1,
	yukuai3, oliver.sang



Hello,

kernel test robot noticed a 25.3% improvement of stress-ng.msg.ops_per_sec on:


commit: 235f4f5bfea93e33e13ba9d8c553d9cf613a58ee ("[RFC PATCH 12/16] ext4: update reserved meta blocks in ext4_da_{release|update_reserve}_space()")
url: https://github.com/intel-lab-lkp/linux/commits/Zhang-Yi/ext4-correct-the-start-block-of-counting-reserved-clusters/20230824-173242
base: https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git dev
patch link: https://lore.kernel.org/all/20230824092619.1327976-13-yi.zhang@huaweicloud.com/
patch subject: [RFC PATCH 12/16] ext4: update reserved meta blocks in ext4_da_{release|update_reserve}_space()

testcase: stress-ng
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
parameters:

	nr_threads: 100%
	testtime: 60s
	class: pts
	test: msg
	cpufreq_governor: performance






Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230906/202309061536.1b59f59d-oliver.sang@intel.com

=========================================================================================
class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  pts/gcc-12/performance/x86_64-rhel-8.3/100%/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp7/msg/stress-ng/60s

commit: 
  637653488a ("ext4: factor out common part of ext4_da_{release|update_reserve}_space()")
  235f4f5bfe ("ext4: update reserved meta blocks in ext4_da_{release|update_reserve}_space()")

637653488ad95a0c 235f4f5bfea93e33e13ba9d8c55 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      2359            +1.3%       2390        boot-time.idle
   9537488           -18.9%    7734959        cpuidle..usage
      3.17            +0.6        3.78        mpstat.cpu.all.usr%
    336783 ± 17%     -28.6%     240355 ± 18%  numa-meminfo.node1.Active
    336689 ± 17%     -28.6%     240299 ± 18%  numa-meminfo.node1.Active(anon)
    460770 ± 16%     -30.0%     322742 ± 19%  numa-meminfo.node1.Shmem
    450337 ±  5%     +43.9%     648260 ±  6%  numa-numastat.node0.local_node
    485519 ±  4%     +40.1%     680005 ±  5%  numa-numastat.node0.numa_hit
    680896 ±  3%     +10.0%     748983 ±  4%  numa-numastat.node1.numa_hit
    136742           -14.2%     117325        sched_debug.cpu.nr_switches.avg
    116638 ±  3%     -13.7%     100689 ±  2%  sched_debug.cpu.nr_switches.min
      0.31 ± 10%     +18.2%       0.36 ±  7%  sched_debug.cpu.nr_uninterruptible.avg
     36.67 ±  4%     +10.9%      40.67 ±  2%  vmstat.procs.r
    273206           -14.6%     233356        vmstat.system.cs
    104671            +4.9%     109774        vmstat.system.in
     46.83 ±  7%     -31.0%      32.33 ± 14%  perf-c2c.DRAM.local
     10329 ±  8%     +26.9%      13107 ±  6%  perf-c2c.DRAM.remote
      8544 ±  8%     +31.0%      11194 ±  6%  perf-c2c.HITM.remote
     30099 ±  7%     +15.1%      34636 ±  5%  perf-c2c.HITM.total
    485599 ±  4%     +40.1%     680300 ±  5%  numa-vmstat.node0.numa_hit
    450417 ±  5%     +44.0%     648555 ±  6%  numa-vmstat.node0.numa_local
     84324 ± 17%     -28.7%      60140 ± 18%  numa-vmstat.node1.nr_active_anon
    115201 ± 16%     -30.0%      80698 ± 19%  numa-vmstat.node1.nr_shmem
     84323 ± 17%     -28.7%      60140 ± 18%  numa-vmstat.node1.nr_zone_active_anon
    680786 ±  3%     +10.0%     749101 ±  4%  numa-vmstat.node1.numa_hit
 7.181e+08           +24.4%  8.932e+08        stress-ng.msg.ops
  11872476           +25.3%   14879722        stress-ng.msg.ops_per_sec
    153117 ±  4%     +39.5%     213629 ±  3%  stress-ng.time.involuntary_context_switches
      3843            +5.3%       4048        stress-ng.time.percent_of_cpu_this_job_got
      2316            +4.4%       2418        stress-ng.time.system_time
     91.29           +28.5%     117.34        stress-ng.time.user_time
   9318632           -14.7%    7951016        stress-ng.time.voluntary_context_switches
    112355            -3.2%     108777 ±  2%  proc-vmstat.nr_active_anon
    154552            -3.9%     148580        proc-vmstat.nr_shmem
    112355            -3.2%     108777 ±  2%  proc-vmstat.nr_zone_active_anon
   1168434           +22.5%    1430780        proc-vmstat.numa_hit
   1102166           +23.8%    1364548        proc-vmstat.numa_local
    206664            -4.7%     196864        proc-vmstat.pgactivate
   1211499           +21.6%    1473352        proc-vmstat.pgalloc_normal
    973632           +28.0%    1246003 ±  2%  proc-vmstat.pgfree
     92337 ±  4%    +158.2%     238396 ±  7%  turbostat.C1
      0.09 ±  5%      +0.2        0.25 ±  8%  turbostat.C1%
   9175706           -22.1%    7147445        turbostat.C1E
     30.93            -3.9       27.02        turbostat.C1E%
    247900 ±  3%     +18.3%     293206 ±  2%  turbostat.C6
      6.42 ±  3%      +0.7        7.14 ±  2%  turbostat.C6%
      0.10           +15.0%       0.12 ±  4%  turbostat.IPC
     13954 ±  4%    +232.9%      46453 ± 10%  turbostat.POLL
    211.09            +2.4%     216.18        turbostat.PkgWatt
     59.62            +3.8%      61.89        turbostat.RAMWatt
      0.01 ± 12%     +34.4%       0.01 ± 14%  perf-sched.sch_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
      0.01           +13.0%       0.01 ±  3%  perf-sched.sch_delay.avg.ms.schedule_preempt_disabled.rwsem_down_read_slowpath.down_read.sysvipc_proc_start
      0.01 ±  3%     +19.7%       0.01 ±  2%  perf-sched.sch_delay.avg.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.msgctl_down
      1.39 ± 52%    +350.2%       6.24 ± 63%  perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
      4.37 ±  3%     -12.2%       3.84 ±  4%  perf-sched.wait_and_delay.avg.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
    189683           -20.7%     150342        perf-sched.wait_and_delay.count.schedule_preempt_disabled.rwsem_down_read_slowpath.down_read.msgctl_info.constprop
    103248           +15.7%     119498        perf-sched.wait_and_delay.count.schedule_preempt_disabled.rwsem_down_read_slowpath.down_read.sysvipc_proc_start
    859.83 ±  2%     +16.4%       1000 ±  4%  perf-sched.wait_and_delay.count.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
     11.39 ± 22%     +88.6%      21.48 ± 38%  perf-sched.wait_and_delay.max.ms.schedule_preempt_disabled.rwsem_down_read_slowpath.down_read.msgctl_info.constprop
      0.55 ±  5%      -8.3%       0.50 ±  4%  perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
      4.36 ±  3%     -12.3%       3.83 ±  4%  perf-sched.wait_time.avg.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
      1.57 ± 36%    +125.7%       3.55 ± 23%  perf-sched.wait_time.max.ms.__cond_resched.__kmem_cache_alloc_node.__kmalloc.alloc_msg.load_msg
      1.06 ± 19%    +109.5%       2.21 ± 25%  perf-sched.wait_time.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_call_function_single
 7.183e+09           +20.5%  8.658e+09        perf-stat.i.branch-instructions
  37085833           +19.0%   44140894        perf-stat.i.branch-misses
     24.92            +3.3       28.26        perf-stat.i.cache-miss-rate%
  53067160           +38.2%   73340095        perf-stat.i.cache-misses
 2.157e+08           +21.2%  2.614e+08        perf-stat.i.cache-references
    289025           -14.7%     246454        perf-stat.i.context-switches
      3.34           -13.6%       2.89        perf-stat.i.cpi
 1.282e+11            +4.3%  1.337e+11        perf-stat.i.cpu-cycles
     76033           -19.8%      60961        perf-stat.i.cpu-migrations
      2396           -24.1%       1818        perf-stat.i.cycles-between-cache-misses
 9.348e+09           +20.5%  1.126e+10        perf-stat.i.dTLB-loads
      0.00 ±  2%      -0.0        0.00 ±  5%  perf-stat.i.dTLB-store-miss-rate%
 5.595e+09           +21.4%  6.791e+09        perf-stat.i.dTLB-stores
 3.773e+10           +20.6%  4.552e+10        perf-stat.i.instructions
      0.32           +14.1%       0.36        perf-stat.i.ipc
      2.00            +4.3%       2.09        perf-stat.i.metric.GHz
    847.80           +37.2%       1162        perf-stat.i.metric.K/sec
    349.01           +20.7%     421.38        perf-stat.i.metric.M/sec
     96.81            +0.9       97.69        perf-stat.i.node-load-miss-rate%
  34681142           +41.8%   49163827        perf-stat.i.node-load-misses
    571211 ± 12%     -19.2%     461475 ±  7%  perf-stat.i.node-loads
     69.19            +4.7       73.94        perf-stat.i.node-store-miss-rate%
  12432335           +42.8%   17748149        perf-stat.i.node-store-misses
   5384441           +11.0%    5974384        perf-stat.i.node-stores
     24.59            +3.5       28.04        perf-stat.overall.cache-miss-rate%
      3.40           -13.6%       2.94        perf-stat.overall.cpi
      2416           -24.6%       1823        perf-stat.overall.cycles-between-cache-misses
      0.00 ±  2%      -0.0        0.00 ±  6%  perf-stat.overall.dTLB-store-miss-rate%
      0.29           +15.7%       0.34        perf-stat.overall.ipc
     69.80            +5.0       74.83        perf-stat.overall.node-store-miss-rate%
  7.07e+09           +20.5%   8.52e+09        perf-stat.ps.branch-instructions
  36445205           +19.0%   43382549        perf-stat.ps.branch-misses
  52231245           +38.2%   72171401        perf-stat.ps.cache-misses
 2.124e+08           +21.2%  2.574e+08        perf-stat.ps.cache-references
    284550           -14.8%     242556        perf-stat.ps.context-switches
 1.262e+11            +4.3%  1.316e+11        perf-stat.ps.cpu-cycles
     74864           -19.9%      59988        perf-stat.ps.cpu-migrations
 9.201e+09           +20.5%  1.109e+10        perf-stat.ps.dTLB-loads
 5.507e+09           +21.4%  6.684e+09        perf-stat.ps.dTLB-stores
 3.714e+10           +20.6%  4.479e+10        perf-stat.ps.instructions
  34137147           +41.7%   48383050        perf-stat.ps.node-load-misses
    561002 ± 12%     -18.8%     455270 ±  7%  perf-stat.ps.node-loads
  12237881           +42.7%   17467084        perf-stat.ps.node-store-misses
   5295571           +11.0%    5875597        perf-stat.ps.node-stores
 2.342e+12           +20.5%  2.823e+12        perf-stat.total.instructions
     11.71           -10.7        1.04        perf-profile.calltrace.cycles-pp.idr_find.ipc_obtain_object_check.do_msgrcv.do_syscall_64.entry_SYSCALL_64_after_hwframe
     11.06           -10.1        0.94 ±  2%  perf-profile.calltrace.cycles-pp.idr_find.ipc_obtain_object_check.do_msgsnd.do_syscall_64.entry_SYSCALL_64_after_hwframe
     15.49            -9.2        6.24        perf-profile.calltrace.cycles-pp.ipc_obtain_object_check.do_msgrcv.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_msgrcv
     15.74            -8.4        7.36        perf-profile.calltrace.cycles-pp.ipc_obtain_object_check.do_msgsnd.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_msgsnd
      7.11            -3.0        4.09        perf-profile.calltrace.cycles-pp.msgctl
      7.03            -3.0        4.02        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.msgctl
      7.01            -3.0        4.00        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.msgctl
      6.93            -3.0        3.94        perf-profile.calltrace.cycles-pp.ksys_msgctl.do_syscall_64.entry_SYSCALL_64_after_hwframe.msgctl
      3.84            -1.7        2.14        perf-profile.calltrace.cycles-pp.msgctl_info.ksys_msgctl.do_syscall_64.entry_SYSCALL_64_after_hwframe.msgctl
      2.74            -1.2        1.54 ±  2%  perf-profile.calltrace.cycles-pp.msgctl_down.ksys_msgctl.do_syscall_64.entry_SYSCALL_64_after_hwframe.msgctl
      1.81 ±  2%      -1.2        0.62 ±  2%  perf-profile.calltrace.cycles-pp.down_read.msgctl_info.ksys_msgctl.do_syscall_64.entry_SYSCALL_64_after_hwframe
      1.59 ±  2%      -1.0        0.60 ±  2%  perf-profile.calltrace.cycles-pp.rwsem_down_read_slowpath.down_read.msgctl_info.ksys_msgctl.do_syscall_64
      1.71            -0.9        0.83 ±  4%  perf-profile.calltrace.cycles-pp.down_write.msgctl_down.ksys_msgctl.do_syscall_64.entry_SYSCALL_64_after_hwframe
      1.61            -0.8        0.80 ±  4%  perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write.msgctl_down.ksys_msgctl.do_syscall_64
      2.96            -0.7        2.25        perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify
      2.90            -0.7        2.21        perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
      2.90            -0.7        2.21        perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
      2.90            -0.7        2.22        perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify
      1.84            -0.5        1.33        perf-profile.calltrace.cycles-pp.seq_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
      1.87            -0.5        1.36        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
      1.84            -0.5        1.34        perf-profile.calltrace.cycles-pp.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
      1.85            -0.5        1.34        perf-profile.calltrace.cycles-pp.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
      1.83            -0.5        1.33        perf-profile.calltrace.cycles-pp.seq_read_iter.seq_read.vfs_read.ksys_read.do_syscall_64
      1.89            -0.5        1.39        perf-profile.calltrace.cycles-pp.read
      1.87            -0.5        1.37        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.read
      0.92 ±  3%      -0.5        0.42 ± 44%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__percpu_counter_sum.msgctl_info.ksys_msgctl.do_syscall_64
      1.84            -0.4        1.39 ±  2%  perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
      0.66            -0.4        0.25 ±100%  perf-profile.calltrace.cycles-pp.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
      1.68            -0.4        1.28 ±  2%  perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
      1.66            -0.4        1.26 ±  2%  perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
      1.32            -0.4        0.94        perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
      0.97            -0.4        0.61 ±  3%  perf-profile.calltrace.cycles-pp.sysvipc_proc_start.seq_read_iter.seq_read.vfs_read.ksys_read
      0.92 ±  2%      -0.3        0.58 ±  3%  perf-profile.calltrace.cycles-pp.down_read.sysvipc_proc_start.seq_read_iter.seq_read.vfs_read
      0.81            -0.3        0.51        perf-profile.calltrace.cycles-pp.up_write.msgctl_down.ksys_msgctl.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.84 ±  2%      -0.3        0.56 ±  4%  perf-profile.calltrace.cycles-pp.rwsem_down_read_slowpath.down_read.sysvipc_proc_start.seq_read_iter.seq_read
      0.64 ±  3%      -0.2        0.45 ± 44%  perf-profile.calltrace.cycles-pp.schedule.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.msgctl_down
      0.64 ±  3%      -0.2        0.45 ± 44%  perf-profile.calltrace.cycles-pp.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.msgctl_down.ksys_msgctl
      0.64 ±  3%      -0.2        0.44 ± 44%  perf-profile.calltrace.cycles-pp.__schedule.schedule.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write
      1.56 ±  2%      -0.2        1.40        perf-profile.calltrace.cycles-pp.__percpu_counter_sum.msgctl_info.ksys_msgctl.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.56            +0.1        0.61        perf-profile.calltrace.cycles-pp.ss_wakeup.do_msgrcv.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_msgrcv
      0.82 ±  5%      +0.1        0.96 ±  2%  perf-profile.calltrace.cycles-pp._copy_from_user.load_msg.do_msgsnd.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.66 ±  2%      +0.2        0.83        perf-profile.calltrace.cycles-pp.__check_object_size.load_msg.do_msgsnd.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.53            +0.2        0.73 ± 26%  perf-profile.calltrace.cycles-pp.__entry_text_start.__libc_msgrcv.stress_run
      0.70 ±  2%      +0.2        0.95 ±  2%  perf-profile.calltrace.cycles-pp.wake_up_q.do_msgrcv.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_msgrcv
      0.57 ±  3%      +0.4        0.94 ± 27%  perf-profile.calltrace.cycles-pp.__entry_text_start.__libc_msgsnd.stress_run
      1.58            +0.4        1.99        perf-profile.calltrace.cycles-pp.stress_msg.stress_run
      0.56 ±  2%      +0.4        0.98        perf-profile.calltrace.cycles-pp.___slab_alloc.__kmem_cache_alloc_node.__kmalloc.alloc_msg.load_msg
      0.00            +0.6        0.55 ±  2%  perf-profile.calltrace.cycles-pp.__x64_sys_msgsnd.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_msgsnd.stress_run
     44.35            +0.6       44.90        perf-profile.calltrace.cycles-pp.do_msgrcv.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_msgrcv.stress_run
      0.00            +0.6        0.56        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_safe_stack.__libc_msgsnd.stress_run
      2.27            +0.6        2.85        perf-profile.calltrace.cycles-pp.__list_del_entry_valid.do_msgrcv.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_msgrcv
      2.52            +0.6        3.12        perf-profile.calltrace.cycles-pp.percpu_counter_add_batch.do_msgsnd.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_msgsnd
     44.94            +0.7       45.61        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_msgrcv.stress_run
      2.65            +0.7        3.36        perf-profile.calltrace.cycles-pp.memcg_slab_post_alloc_hook.__kmem_cache_alloc_node.__kmalloc.alloc_msg.load_msg
     45.15            +0.7       45.86        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__libc_msgrcv.stress_run
      3.08            +0.8        3.86 ±  2%  perf-profile.calltrace.cycles-pp.__kmem_cache_free.free_msg.do_msgrcv.do_syscall_64.entry_SYSCALL_64_after_hwframe
      2.41            +0.9        3.27        perf-profile.calltrace.cycles-pp.percpu_counter_add_batch.do_msgrcv.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_msgrcv
     45.96            +0.9       46.84        perf-profile.calltrace.cycles-pp.__libc_msgrcv.stress_run
      2.56            +0.9        3.44        perf-profile.calltrace.cycles-pp.check_heap_object.__check_object_size.store_msg.do_msg_fill.do_msgrcv
      3.75            +1.1        4.82 ±  9%  perf-profile.calltrace.cycles-pp._copy_to_user.store_msg.do_msg_fill.do_msgrcv.do_syscall_64
      3.16            +1.1        4.24        perf-profile.calltrace.cycles-pp.__check_object_size.store_msg.do_msg_fill.do_msgrcv.do_syscall_64
      3.60            +1.2        4.78 ±  2%  perf-profile.calltrace.cycles-pp.__slab_free.free_msg.do_msgrcv.do_syscall_64.entry_SYSCALL_64_after_hwframe
      4.78            +1.4        6.22        perf-profile.calltrace.cycles-pp.__kmem_cache_alloc_node.__kmalloc.alloc_msg.load_msg.do_msgsnd
      4.99            +1.5        6.46        perf-profile.calltrace.cycles-pp.__kmalloc.alloc_msg.load_msg.do_msgsnd.do_syscall_64
      5.31            +1.7        6.97        perf-profile.calltrace.cycles-pp.alloc_msg.load_msg.do_msgsnd.do_syscall_64.entry_SYSCALL_64_after_hwframe
     36.29            +1.7       37.99        perf-profile.calltrace.cycles-pp.do_msgsnd.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_msgsnd.stress_run
      6.79            +1.9        8.66        perf-profile.calltrace.cycles-pp.load_msg.do_msgsnd.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_msgsnd
     37.42            +2.0       39.42        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_msgsnd.stress_run
      7.34            +2.2        9.50 ±  2%  perf-profile.calltrace.cycles-pp.free_msg.do_msgrcv.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_msgrcv
      2.59            +2.2        4.76        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.do_msgrcv.do_syscall_64.entry_SYSCALL_64_after_hwframe
      7.33            +2.3        9.62 ±  4%  perf-profile.calltrace.cycles-pp.store_msg.do_msg_fill.do_msgrcv.do_syscall_64.entry_SYSCALL_64_after_hwframe
      3.28            +2.4        5.66        perf-profile.calltrace.cycles-pp._raw_spin_lock.do_msgrcv.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_msgrcv
      7.85            +2.4       10.24 ±  4%  perf-profile.calltrace.cycles-pp.do_msg_fill.do_msgrcv.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_msgrcv
     38.05            +2.4       40.50        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__libc_msgsnd.stress_run
     39.32            +2.8       42.12        perf-profile.calltrace.cycles-pp.__libc_msgsnd.stress_run
      6.28            +3.7       10.01        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.do_msgsnd.do_syscall_64.entry_SYSCALL_64_after_hwframe
      7.10            +4.1       11.20        perf-profile.calltrace.cycles-pp._raw_spin_lock.do_msgsnd.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_msgsnd
     87.71            +4.3       92.00        perf-profile.calltrace.cycles-pp.stress_run
     22.97           -21.0        2.02        perf-profile.children.cycles-pp.idr_find
     31.41           -17.7       13.70        perf-profile.children.cycles-pp.ipc_obtain_object_check
      7.13            -3.0        4.12        perf-profile.children.cycles-pp.msgctl
      6.93            -3.0        3.94        perf-profile.children.cycles-pp.ksys_msgctl
      3.84            -1.7        2.14        perf-profile.children.cycles-pp.msgctl_info
      2.74            -1.5        1.21 ±  3%  perf-profile.children.cycles-pp.down_read
      1.49            -1.3        0.18 ±  4%  perf-profile.children.cycles-pp._raw_spin_lock_irq
      2.44 ±  2%      -1.3        1.16 ±  3%  perf-profile.children.cycles-pp.rwsem_down_read_slowpath
      2.74            -1.2        1.54 ±  2%  perf-profile.children.cycles-pp.msgctl_down
      1.71            -0.9        0.84 ±  4%  perf-profile.children.cycles-pp.down_write
     91.47            -0.8       90.65        perf-profile.children.cycles-pp.do_syscall_64
      1.61            -0.8        0.80 ±  4%  perf-profile.children.cycles-pp.rwsem_down_write_slowpath
      1.42            -0.8        0.63        perf-profile.children.cycles-pp._raw_spin_lock_irqsave
      2.96            -0.7        2.25        perf-profile.children.cycles-pp.secondary_startup_64_no_verify
      2.96            -0.7        2.25        perf-profile.children.cycles-pp.cpu_startup_entry
      2.96            -0.7        2.25        perf-profile.children.cycles-pp.do_idle
      2.90            -0.7        2.22        perf-profile.children.cycles-pp.start_secondary
      1.84            -0.5        1.33        perf-profile.children.cycles-pp.seq_read
      1.84            -0.5        1.33        perf-profile.children.cycles-pp.seq_read_iter
      1.85            -0.5        1.34        perf-profile.children.cycles-pp.ksys_read
      1.85            -0.5        1.34        perf-profile.children.cycles-pp.vfs_read
      1.89            -0.5        1.39        perf-profile.children.cycles-pp.read
      1.88            -0.5        1.42 ±  2%  perf-profile.children.cycles-pp.cpuidle_idle_call
      0.99            -0.4        0.56        perf-profile.children.cycles-pp.rwsem_wake
      1.72            -0.4        1.30 ±  2%  perf-profile.children.cycles-pp.cpuidle_enter
      1.71            -0.4        1.29 ±  2%  perf-profile.children.cycles-pp.cpuidle_enter_state
      1.35            -0.4        0.95        perf-profile.children.cycles-pp.intel_idle
      0.49 ±  2%      -0.4        0.09 ±  5%  perf-profile.children.cycles-pp.up_read
      3.01 ±  3%      -0.4        2.65 ±  4%  perf-profile.children.cycles-pp.__schedule
      0.97            -0.4        0.61 ±  3%  perf-profile.children.cycles-pp.sysvipc_proc_start
     92.22            -0.3       91.87        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
      1.04            -0.3        0.70        perf-profile.children.cycles-pp.try_to_wake_up
      0.81            -0.3        0.51        perf-profile.children.cycles-pp.up_write
      2.67 ±  3%      -0.3        2.38 ±  5%  perf-profile.children.cycles-pp.schedule
      0.35 ±  3%      -0.3        0.10 ±  5%  perf-profile.children.cycles-pp.idr_get_next_ul
      0.35 ±  3%      -0.2        0.10        perf-profile.children.cycles-pp.idr_get_next
      1.66 ±  3%      -0.2        1.42 ±  5%  perf-profile.children.cycles-pp.schedule_preempt_disabled
      0.40            -0.2        0.17 ±  4%  perf-profile.children.cycles-pp.rwsem_optimistic_spin
      0.44 ±  2%      -0.2        0.23 ±  2%  perf-profile.children.cycles-pp.sysvipc_proc_next
      0.67            -0.2        0.50        perf-profile.children.cycles-pp.flush_smp_call_function_queue
      1.57 ±  2%      -0.2        1.40        perf-profile.children.cycles-pp.__percpu_counter_sum
      0.66            -0.2        0.51        perf-profile.children.cycles-pp.__flush_smp_call_function_queue
      0.56            -0.1        0.44        perf-profile.children.cycles-pp.sched_ttwu_pending
      0.51            -0.1        0.40 ±  2%  perf-profile.children.cycles-pp.activate_task
      0.49            -0.1        0.39 ±  2%  perf-profile.children.cycles-pp.ttwu_do_activate
      0.27            -0.1        0.17 ±  4%  perf-profile.children.cycles-pp.msgctl_stat
      0.48            -0.1        0.38        perf-profile.children.cycles-pp.dequeue_task_fair
      0.41 ±  2%      -0.1        0.32        perf-profile.children.cycles-pp.dequeue_entity
      0.15 ±  3%      -0.1        0.06        perf-profile.children.cycles-pp.osq_lock
      0.15 ±  3%      -0.1        0.06 ±  6%  perf-profile.children.cycles-pp.rwsem_spin_on_owner
      0.61            -0.1        0.52 ±  2%  perf-profile.children.cycles-pp.update_load_avg
      0.37            -0.1        0.28        perf-profile.children.cycles-pp.select_task_rq
      0.36            -0.1        0.28 ±  2%  perf-profile.children.cycles-pp.select_task_rq_fair
      0.43            -0.1        0.35 ±  2%  perf-profile.children.cycles-pp.enqueue_task_fair
      0.36 ±  2%      -0.1        0.28 ±  2%  perf-profile.children.cycles-pp.schedule_idle
      0.32            -0.1        0.25 ±  2%  perf-profile.children.cycles-pp.enqueue_entity
      1.84            -0.1        1.78        perf-profile.children.cycles-pp.wake_up_q
      0.18 ±  3%      -0.0        0.14 ±  4%  perf-profile.children.cycles-pp.select_idle_sibling
      0.17 ±  2%      -0.0        0.13        perf-profile.children.cycles-pp.wake_affine
      0.20 ±  2%      -0.0        0.16 ±  3%  perf-profile.children.cycles-pp.ttwu_queue_wakelist
      0.16 ±  3%      -0.0        0.12        perf-profile.children.cycles-pp.__smp_call_single_queue
      0.14 ±  3%      -0.0        0.11 ±  5%  perf-profile.children.cycles-pp.select_idle_cpu
      0.15 ±  2%      -0.0        0.12 ±  3%  perf-profile.children.cycles-pp.available_idle_cpu
      0.14 ±  5%      -0.0        0.11 ±  4%  perf-profile.children.cycles-pp.update_curr
      0.12 ±  4%      -0.0        0.09 ±  5%  perf-profile.children.cycles-pp.task_h_load
      0.14 ±  2%      -0.0        0.11 ±  3%  perf-profile.children.cycles-pp.rwsem_mark_wake
      0.10 ±  3%      -0.0        0.07 ±  5%  perf-profile.children.cycles-pp.switch_fpu_return
      0.12            -0.0        0.09 ±  6%  perf-profile.children.cycles-pp.menu_select
      0.10 ±  4%      -0.0        0.08 ±  8%  perf-profile.children.cycles-pp.select_idle_core
      0.09 ±  4%      -0.0        0.07 ±  7%  perf-profile.children.cycles-pp.switch_mm_irqs_off
      0.17 ±  4%      -0.0        0.15 ±  3%  perf-profile.children.cycles-pp.__irq_exit_rcu
      0.16 ±  3%      -0.0        0.14 ±  4%  perf-profile.children.cycles-pp.__do_softirq
      0.12 ±  4%      -0.0        0.09 ±  4%  perf-profile.children.cycles-pp.llist_add_batch
      0.08 ±  4%      -0.0        0.06 ±  6%  perf-profile.children.cycles-pp.sched_mm_cid_migrate_to
      0.08            -0.0        0.06        perf-profile.children.cycles-pp.__switch_to
      0.07            -0.0        0.05        perf-profile.children.cycles-pp.llist_reverse_order
      0.11 ±  5%      -0.0        0.09 ±  5%  perf-profile.children.cycles-pp.prepare_task_switch
      0.10 ±  4%      -0.0        0.09 ±  7%  perf-profile.children.cycles-pp.__update_load_avg_se
      0.07 ±  7%      -0.0        0.05        perf-profile.children.cycles-pp.restore_fpregs_from_fpstate
      0.09 ±  5%      -0.0        0.08 ±  4%  perf-profile.children.cycles-pp.rebalance_domains
      0.08 ±  4%      -0.0        0.07        perf-profile.children.cycles-pp.update_rq_clock_task
      0.06 ±  8%      +0.0        0.07        perf-profile.children.cycles-pp.security_msg_queue_msgsnd
      0.10 ±  4%      +0.0        0.11 ±  4%  perf-profile.children.cycles-pp.__cond_resched
      0.09            +0.0        0.11 ±  3%  perf-profile.children.cycles-pp.kmalloc_slab
      0.09 ±  6%      +0.0        0.11        perf-profile.children.cycles-pp.is_vmalloc_addr
      0.13 ±  5%      +0.0        0.15 ±  2%  perf-profile.children.cycles-pp.syscall_exit_to_user_mode_prepare
      0.10 ±  7%      +0.0        0.13 ±  3%  perf-profile.children.cycles-pp.format_decode
      0.18 ±  2%      +0.0        0.21 ±  2%  perf-profile.children.cycles-pp.__list_add_valid
      0.09 ± 11%      +0.0        0.12 ± 13%  perf-profile.children.cycles-pp.security_msg_msg_alloc
      0.14 ±  6%      +0.0        0.16 ±  4%  perf-profile.children.cycles-pp.security_ipc_permission
      0.10 ±  4%      +0.0        0.13 ±  3%  perf-profile.children.cycles-pp.number
      0.02 ±141%      +0.0        0.06 ±  8%  perf-profile.children.cycles-pp.allocate_slab
      0.19 ±  3%      +0.0        0.23 ±  2%  perf-profile.children.cycles-pp.obj_cgroup_charge
      0.26 ±  4%      +0.0        0.30 ±  3%  perf-profile.children.cycles-pp.syscall_enter_from_user_mode
      0.24 ±  5%      +0.1        0.29 ±  4%  perf-profile.children.cycles-pp.__get_obj_cgroup_from_memcg
      0.57            +0.1        0.63        perf-profile.children.cycles-pp.ss_wakeup
      0.47 ±  3%      +0.1        0.53 ±  5%  perf-profile.children.cycles-pp.get_obj_cgroup_from_current
      0.34 ±  2%      +0.1        0.41        perf-profile.children.cycles-pp.syscall_return_via_sysret
      0.32            +0.1        0.39        perf-profile.children.cycles-pp.vsnprintf
      0.41            +0.1        0.48        perf-profile.children.cycles-pp.__put_user_8
      0.32            +0.1        0.39        perf-profile.children.cycles-pp.seq_printf
      0.34            +0.1        0.42        perf-profile.children.cycles-pp.sysvipc_msg_proc_show
      0.43            +0.1        0.51 ±  2%  perf-profile.children.cycles-pp.__get_user_8
      0.50            +0.1        0.59 ±  2%  perf-profile.children.cycles-pp.__x64_sys_msgsnd
      0.36 ±  2%      +0.1        0.45 ±  2%  perf-profile.children.cycles-pp.exit_to_user_mode_prepare
      0.29 ±  3%      +0.1        0.38 ±  2%  perf-profile.children.cycles-pp.__virt_addr_valid
      0.39 ±  3%      +0.1        0.48        perf-profile.children.cycles-pp.__check_heap_object
      0.34 ±  2%      +0.1        0.45 ±  2%  perf-profile.children.cycles-pp.ipcperms
      0.61            +0.1        0.74        perf-profile.children.cycles-pp.mod_objcg_state
      0.83 ±  5%      +0.2        0.98 ±  2%  perf-profile.children.cycles-pp._copy_from_user
      0.66            +0.2        0.82        perf-profile.children.cycles-pp.syscall_exit_to_user_mode
      0.48 ±  2%      +0.2        0.65        perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
      0.86            +0.2        1.06        perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
      1.12            +0.2        1.37        perf-profile.children.cycles-pp.__entry_text_start
      0.56 ±  2%      +0.4        0.98        perf-profile.children.cycles-pp.___slab_alloc
      1.62            +0.4        2.04        perf-profile.children.cycles-pp.stress_msg
      2.35            +0.6        2.92        perf-profile.children.cycles-pp.__list_del_entry_valid
     44.55            +0.6       45.16        perf-profile.children.cycles-pp.do_msgrcv
      2.68            +0.7        3.39        perf-profile.children.cycles-pp.memcg_slab_post_alloc_hook
      3.11            +0.8        3.90 ±  2%  perf-profile.children.cycles-pp.__kmem_cache_free
     46.38            +1.0       47.36        perf-profile.children.cycles-pp.__libc_msgrcv
      2.99            +1.0        4.00        perf-profile.children.cycles-pp.check_heap_object
      3.87            +1.1        4.97 ±  8%  perf-profile.children.cycles-pp._copy_to_user
      3.61            +1.2        4.79 ±  2%  perf-profile.children.cycles-pp.__slab_free
      4.25            +1.4        5.69        perf-profile.children.cycles-pp.__check_object_size
      4.86            +1.4        6.30        perf-profile.children.cycles-pp.__kmem_cache_alloc_node
      4.97            +1.5        6.44        perf-profile.children.cycles-pp.percpu_counter_add_batch
      5.03            +1.5        6.51        perf-profile.children.cycles-pp.__kmalloc
      5.33            +1.7        7.00        perf-profile.children.cycles-pp.alloc_msg
     36.43            +1.7       38.16        perf-profile.children.cycles-pp.do_msgsnd
      7.07            +2.1        9.13        perf-profile.children.cycles-pp.load_msg
      7.44            +2.2        9.62 ±  3%  perf-profile.children.cycles-pp.free_msg
      7.39            +2.3        9.68 ±  4%  perf-profile.children.cycles-pp.store_msg
      7.88            +2.4       10.28 ±  4%  perf-profile.children.cycles-pp.do_msg_fill
     39.75            +2.9       42.64        perf-profile.children.cycles-pp.__libc_msgsnd
     11.25            +4.2       15.48        perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
     87.71            +4.3       92.00        perf-profile.children.cycles-pp.stress_run
     11.07            +6.5       17.57        perf-profile.children.cycles-pp._raw_spin_lock
     22.77           -20.8        1.98        perf-profile.self.cycles-pp.idr_find
      1.35            -0.4        0.95        perf-profile.self.cycles-pp.intel_idle
      0.33 ±  2%      -0.2        0.08        perf-profile.self.cycles-pp.idr_get_next_ul
      0.44 ±  2%      -0.2        0.21 ±  2%  perf-profile.self.cycles-pp._raw_spin_lock_irqsave
      0.26            -0.2        0.10 ±  4%  perf-profile.self.cycles-pp.rwsem_down_read_slowpath
      0.17 ±  2%      -0.1        0.03 ± 70%  perf-profile.self.cycles-pp._raw_spin_lock_irq
      0.15 ±  3%      -0.1        0.06        perf-profile.self.cycles-pp.osq_lock
      0.30 ±  2%      -0.0        0.25        perf-profile.self.cycles-pp.update_load_avg
      0.16 ±  2%      -0.0        0.12 ±  3%  perf-profile.self.cycles-pp.__schedule
      0.15 ±  2%      -0.0        0.12 ±  4%  perf-profile.self.cycles-pp.available_idle_cpu
      0.12 ±  4%      -0.0        0.09 ±  5%  perf-profile.self.cycles-pp.task_h_load
      0.09 ±  7%      -0.0        0.06 ±  6%  perf-profile.self.cycles-pp.rwsem_optimistic_spin
      0.12 ±  4%      -0.0        0.09 ±  4%  perf-profile.self.cycles-pp.llist_add_batch
      0.09 ±  4%      -0.0        0.07 ±  7%  perf-profile.self.cycles-pp.switch_mm_irqs_off
      0.07            -0.0        0.05        perf-profile.self.cycles-pp.llist_reverse_order
      0.17 ±  2%      -0.0        0.15 ±  4%  perf-profile.self.cycles-pp.__update_load_avg_cfs_rq
      0.10 ±  5%      -0.0        0.08 ±  6%  perf-profile.self.cycles-pp.enqueue_task_fair
      0.08 ±  4%      -0.0        0.06        perf-profile.self.cycles-pp.__switch_to
      0.10            -0.0        0.08 ±  5%  perf-profile.self.cycles-pp.__update_load_avg_se
      0.08 ±  6%      -0.0        0.06 ±  6%  perf-profile.self.cycles-pp.sched_mm_cid_migrate_to
      0.07 ±  7%      -0.0        0.05        perf-profile.self.cycles-pp.restore_fpregs_from_fpstate
      0.07 ±  6%      -0.0        0.06        perf-profile.self.cycles-pp.update_rq_clock_task
      0.06 ±  7%      -0.0        0.05        perf-profile.self.cycles-pp.menu_select
      0.06 ±  6%      +0.0        0.07        perf-profile.self.cycles-pp.security_msg_queue_msgrcv
      0.07 ±  5%      +0.0        0.08 ±  4%  perf-profile.self.cycles-pp.is_vmalloc_addr
      0.06 ±  9%      +0.0        0.07        perf-profile.self.cycles-pp.__cond_resched
      0.08 ±  6%      +0.0        0.09 ±  5%  perf-profile.self.cycles-pp.kmalloc_slab
      0.10 ±  5%      +0.0        0.11 ±  4%  perf-profile.self.cycles-pp.vsnprintf
      0.08 ±  4%      +0.0        0.10 ±  3%  perf-profile.self.cycles-pp.format_decode
      0.09 ±  4%      +0.0        0.11 ±  4%  perf-profile.self.cycles-pp.do_msg_fill
      0.16 ±  3%      +0.0        0.19        perf-profile.self.cycles-pp.__list_add_valid
      0.09            +0.0        0.11 ±  4%  perf-profile.self.cycles-pp.number
      0.08 ±  4%      +0.0        0.10 ±  4%  perf-profile.self.cycles-pp.alloc_msg
      0.11 ±  3%      +0.0        0.13 ±  2%  perf-profile.self.cycles-pp.__kmalloc
      0.12 ±  4%      +0.0        0.14 ±  5%  perf-profile.self.cycles-pp.security_ipc_permission
      0.08 ± 14%      +0.0        0.11 ± 15%  perf-profile.self.cycles-pp.security_msg_msg_alloc
      0.21 ±  3%      +0.0        0.24        perf-profile.self.cycles-pp.syscall_enter_from_user_mode
      0.14 ±  2%      +0.0        0.17 ±  3%  perf-profile.self.cycles-pp.exit_to_user_mode_prepare
      0.16 ±  3%      +0.0        0.19 ±  3%  perf-profile.self.cycles-pp.obj_cgroup_charge
      0.18 ±  2%      +0.0        0.22 ±  2%  perf-profile.self.cycles-pp.store_msg
      0.23 ±  7%      +0.0        0.28 ±  4%  perf-profile.self.cycles-pp.__get_obj_cgroup_from_memcg
      0.28 ±  2%      +0.1        0.34 ±  2%  perf-profile.self.cycles-pp.do_syscall_64
      0.54            +0.1        0.60        perf-profile.self.cycles-pp.ss_wakeup
      0.34 ±  2%      +0.1        0.41        perf-profile.self.cycles-pp.syscall_return_via_sysret
      0.40            +0.1        0.47        perf-profile.self.cycles-pp.__put_user_8
      0.29 ±  3%      +0.1        0.36        perf-profile.self.cycles-pp.__entry_text_start
      0.41            +0.1        0.48 ±  3%  perf-profile.self.cycles-pp.__get_user_8
      0.27 ±  3%      +0.1        0.35 ±  2%  perf-profile.self.cycles-pp.__virt_addr_valid
      0.37 ±  3%      +0.1        0.46        perf-profile.self.cycles-pp.__check_heap_object
      0.35            +0.1        0.44 ±  6%  perf-profile.self.cycles-pp.kfree
      0.46 ±  2%      +0.1        0.55 ±  2%  perf-profile.self.cycles-pp.__libc_msgsnd
      0.31 ±  3%      +0.1        0.41 ±  2%  perf-profile.self.cycles-pp.ipcperms
      0.47            +0.1        0.57 ±  2%  perf-profile.self.cycles-pp.__libc_msgrcv
      0.57            +0.1        0.69        perf-profile.self.cycles-pp.mod_objcg_state
      0.80 ±  5%      +0.1        0.93 ±  2%  perf-profile.self.cycles-pp._copy_from_user
      0.26            +0.1        0.40        perf-profile.self.cycles-pp.syscall_exit_to_user_mode
      0.48 ±  2%      +0.2        0.65        perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack
      0.84            +0.2        1.02        perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
      0.36            +0.2        0.55 ±  3%  perf-profile.self.cycles-pp.load_msg
      0.88            +0.2        1.08        perf-profile.self.cycles-pp.__kmem_cache_alloc_node
      0.61            +0.2        0.85        perf-profile.self.cycles-pp.__percpu_counter_sum
      0.76 ±  2%      +0.3        1.01 ±  2%  perf-profile.self.cycles-pp.wake_up_q
      0.67            +0.3        0.93        perf-profile.self.cycles-pp.__check_object_size
      0.48 ±  2%      +0.4        0.87 ±  2%  perf-profile.self.cycles-pp.___slab_alloc
      1.57            +0.4        1.99        perf-profile.self.cycles-pp.stress_msg
      0.77            +0.5        1.25        perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
      2.14            +0.5        2.67        perf-profile.self.cycles-pp._raw_spin_lock
      2.32            +0.6        2.88        perf-profile.self.cycles-pp.__list_del_entry_valid
      2.49            +0.7        3.14 ±  2%  perf-profile.self.cycles-pp.__kmem_cache_free
      2.43            +0.7        3.08        perf-profile.self.cycles-pp.memcg_slab_post_alloc_hook
      2.58            +0.9        3.48        perf-profile.self.cycles-pp.check_heap_object
      3.82            +1.1        4.91 ±  8%  perf-profile.self.cycles-pp._copy_to_user
      3.24            +1.2        4.41        perf-profile.self.cycles-pp.do_msgrcv
      3.58            +1.2        4.74 ±  2%  perf-profile.self.cycles-pp.__slab_free
      4.86            +1.4        6.32        perf-profile.self.cycles-pp.percpu_counter_add_batch
      7.55            +3.2       10.74        perf-profile.self.cycles-pp.ipc_obtain_object_check
      3.20            +3.2        6.42        perf-profile.self.cycles-pp.do_msgsnd
     11.14            +4.2       15.31        perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 01/16] ext4: correct the start block of counting reserved clusters
  2023-08-30 13:10   ` Jan Kara
@ 2023-10-06  2:33     ` Theodore Ts'o
  0 siblings, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2023-10-06  2:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: Zhang Yi, linux-ext4, adilger.kernel, yi.zhang, chengzhihao1, yukuai3

On Wed, Aug 30, 2023 at 03:10:31PM +0200, Jan Kara wrote:
> On Thu 24-08-23 17:26:04, Zhang Yi wrote:
> > From: Zhang Yi <yi.zhang@huawei.com>
> > 
> > When big allocate feature is enabled, we need to count and update
> > reserved clusters before removing a delayed only extent_status entry.
> > {init|count|get}_rsvd() have already done this, but the start block
> > number of this counting isn's correct in the following case.
> > 
> >   lblk            end
> >    |               |
> >    v               v
> >           -------------------------
> >           |                       | orig_es
> >           -------------------------
> >                    ^              ^
> >       len1 is 0    |     len2     |
> > 
> > If the start block of the orig_es entry founded is bigger than lblk, we
> > passed lblk as start block to count_rsvd(), but the length is correct,
> > finally, the range to be counted is offset. This patch fix this by
> > passing the start blocks to 'orig_es->lblk + len1'.
> > 
> > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Looks good. Feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

Thanks, I've applied the first two patches in this series, since these
are bug fixes.  The rest of the patch series requires more analysis
and review.

						- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 02/16] ext4: make sure allocate pending entry not fail
  2023-08-30 13:25   ` Jan Kara
@ 2023-10-06  2:33     ` Theodore Ts'o
  0 siblings, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2023-10-06  2:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: Zhang Yi, linux-ext4, adilger.kernel, yi.zhang, chengzhihao1, yukuai3

On Wed, Aug 30, 2023 at 03:25:03PM +0200, Jan Kara wrote:
> On Thu 24-08-23 17:26:05, Zhang Yi wrote:
> > From: Zhang Yi <yi.zhang@huawei.com>
> > 
> > __insert_pending() allocate memory in atomic context, so the allocation
> > could fail, but we are not handling that failure now. It could lead
> > ext4_es_remove_extent() to get wrong reserved clusters, and the global
> > data blocks reservation count will be incorrect. The same to
> > extents_status entry preallocation, preallocate pending entry out of the
> > i_es_lock with __GFP_NOFAIL, make sure __insert_pending() and
> > __revise_pending() always succeeds.
> > 
> > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Looks sensible. Feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

Thanks, I've applied the first two patches in this series, since these
are bug fixes.  The rest of the patch series requires more analysis
and review.

						- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-10-06  2:34 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-24  9:26 [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Zhang Yi
2023-08-24  9:26 ` [RFC PATCH 01/16] ext4: correct the start block of counting reserved clusters Zhang Yi
2023-08-30 13:10   ` Jan Kara
2023-10-06  2:33     ` Theodore Ts'o
2023-08-24  9:26 ` [RFC PATCH 02/16] ext4: make sure allocate pending entry not fail Zhang Yi
2023-08-30 13:25   ` Jan Kara
2023-10-06  2:33     ` Theodore Ts'o
2023-08-24  9:26 ` [RFC PATCH 03/16] ext4: let __revise_pending() return the number of new inserts pendings Zhang Yi
2023-08-24  9:26 ` [RFC PATCH 04/16] ext4: count removed reserved blocks for delalloc only es entry Zhang Yi
2023-08-24  9:26 ` [RFC PATCH 05/16] ext4: pass real delayed status into ext4_es_insert_extent() Zhang Yi
2023-08-24  9:26 ` [RFC PATCH 06/16] ext4: move delalloc data reserve spcae updating " Zhang Yi
2023-08-24  9:26 ` [RFC PATCH 07/16] ext4: count inode's total delalloc data blocks into ext4_es_tree Zhang Yi
2023-08-24  9:26 ` [RFC PATCH 08/16] ext4: refactor delalloc space reservation Zhang Yi
2023-08-24  9:26 ` [RFC PATCH 09/16] ext4: count reserved metadata blocks for delalloc per inode Zhang Yi
2023-08-24  9:26 ` [RFC PATCH 10/16] ext4: reserve meta blocks in ext4_da_reserve_space() Zhang Yi
2023-08-24  9:26 ` [RFC PATCH 11/16] ext4: factor out common part of ext4_da_{release|update_reserve}_space() Zhang Yi
2023-08-24  9:26 ` [RFC PATCH 12/16] ext4: update reserved meta blocks in ext4_da_{release|update_reserve}_space() Zhang Yi
2023-09-06  7:35   ` kernel test robot
2023-08-24  9:26 ` [RFC PATCH 13/16] ext4: calculate the worst extent blocks needed of a delalloc es entry Zhang Yi
2023-08-24  9:26 ` [RFC PATCH 14/16] ext4: reserve extent blocks for delalloc Zhang Yi
2023-08-24  9:26 ` [RFC PATCH 15/16] ext4: flush delalloc blocks if no free space Zhang Yi
2023-08-24  9:26 ` [RFC PATCH 16/16] ext4: drop ext4_nonda_switch() Zhang Yi
2023-08-30 15:30 ` [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Jan Kara
2023-09-01  2:33   ` Zhang Yi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.