linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/12] multiblock allocator improvements
@ 2023-05-30 12:33 Ojaswin Mujoo
  2023-05-30 12:33 ` [PATCH v2 01/12] Revert "ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits" Ojaswin Mujoo
                   ` (12 more replies)
  0 siblings, 13 replies; 26+ messages in thread
From: Ojaswin Mujoo @ 2023-05-30 12:33 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, Jan Kara, Kemeng Shi

** Changed since v1 [2] **

 1. Rebase over Kemeng's recent mballoc patchset [3]
 2. Picked up Kemeng's RVB on patch 1/12

 [2] https://lore.kernel.org/all/cover.1685009579.git.ojaswin@linux.ibm.com/
 [3] https://lore.kernel.org/all/20230417110617.2664129-1-shikemeng@huaweicloud.com/

** Changes since RFC [1] **

[1] https://lore.kernel.org/linux-ext4/cover.1674822311.git.ojaswin@linux.ibm.com/

1. Patch 1 reverts the commit 32c08693 

Lore link:
https://lore.kernel.org/linux-ext4/20230209194825.511043-15-shikemeng@huaweicloud.com/ 

So this patch was intended to remove a dead if-condition but it was not
actually dead code and removing it was causing a performance regression.
Unfortunately I somehow missed that when I was reviewing his patchset
and it already went in so I had to revert the commit. I've added details
of the regression and root cause in the revert commit. Also attaching
the performance numbers I observer:

Workload: fsmark - 100GiB ramdisk, 64 threads writing ~42000 files nodelalloc
-----
Baseline kernel:             ~5000 files/sec, ~9,000,000 extents scanned

This patchset rebased on 
ted/dev w/o revert patch:    ~8000 files/sec, ~7,000,000 ex scanned (+40-50%)

This patchset on ted/dev 
with revert patch:           ~30000 files/sec, ~800,000 ex scanned (+500%)
-----

2. Added Patch 13 which introduces symbolic names for criterias

3. In CR1.5 patch (Patch 12), in ext4_mb_choose_next_group_cr1_5(),
while trimming we also consider stripe size now. If specified, we round
up the goal length to stripe size. Here, with bigalloc I've made an
assumption that stripe size in fs blocks is always a multiple of
cluster_ratio. This assumption is based on a yet unmerged patch:
https://lore.kernel.org/linux-ext4/20230417110617.2664129-5-shikemeng@huaweicloud.com/

4. In CR1.5 patch, slight optimization in ext4_mb_choose_next_group_cr1_5()
based on Jan's feedback.

I've run xfstests quick on the patchset and plan to run auto overnight.
Would report if anything is breaking.

** Original Cover letter **

This patchset intends to improve some of the shortcomings of mb allocator
that we had noticed while running various tests and workloads in a
POWERPC machine with 64k block size.  

** Problems **

More specifically, we were seeing a sharp drop in performance when the
FS was highly fragmented (64K bs). We noticed that:

Problem 1: prefetch logic seemed to be skipping BLOCK_UNINIT groups
which was resulting in buddy and CR0/1 cache not being initialized for
these even though it could be done without any IO. (Not sure if there
was any history behind this design, do let me know if so).

Problem 2: With a 64K bs FS, we were commonly seeing cases where CR1
would correctly identify a good group but due to very high
fragmentation, complex scan would exit early due to ac->ac_found >
s_mb_max_to_scan, resulting in trimming of the allocated len.

Problem 3: Even though our avg free extent was say 4MB and original
request was merely 1 block of data, mballoc noramlization kept adding
PAs and requesting 8MB chunks. This led to almost all the requests
falling into slower CR 2 and with increased threads, we started seeing
lots of CR3 requests as well.

** How did we address them **

Problem 1 (Patch 8,9): Make ext4_mb_prefetch also call
ext4_read_block_bitmap_nowait() in case of BLOCK_UNINIT, so it can init
the BG and exit early without an IO. Next, fix the calls to
prefetch_fini so these newly init BGs can have their buddy initialised.

Problem 2 (Patch 7): When we come to complex_scan after CR1, my
understanding is that due to free/frag > goal_len, we can be sure that
there is atleast one chunk big enough to accomodate the goal request.
Hence, we can skip the overhead of mb_find_extent() other accounting for
each free extent and just process extents that are big enough.

Problem 3 (Patch 11): To solve this problem, this patchset implements a
new allocation criteria (CR1.5 or CR1_5 in code). The idea is that if
CR1 fails to find a BG, it will jump to CR1.5. Here the flow is as
follows:

  * We make an assumption that if CR1 has failed that means none of the
    currently cached BGs have a big enough continuous extent to satisfy
    our request In this case we fall to CR1.5.

  * In CR 1.5, we find the highest available free/frag BGs (from CR1
    lists) and trim the PAs to this order so that we can find 
    a BG without IO overhead of CR2. 
    
  * Parallely, prefetch will get in more groups in memory, and as more
    and more groups are cached, CR1.5 becomes a better replacement of
    CR2. This is because, for example, if all BGs are cahced and we
    couldn't find anything in CR0/1, we can assume that no BG has a big
    enough continuous free extent and hence CR1.5 can directly trim and
    find the next biggest extent we could get. In this scenario, without
    CR1.5, we would have continued scanning in CR2 which would have
    most probably trimmed the request after scanning for ~200 extents.
    
CR1.5 results in improved allocation speed at the cost of slightly increased
trimming of the len of blocks allocated. 

** Performance Numbers **

Unless stated otherwise, these numbers are from fsmark and fio tests with 64k
BS, 64K pagesize on 100Gi nvme0n1 with nodelalloc. There tests were performed
after the FS was fragmented till Avg Fragment Size == 4MB.

* Test 1: Writing ~40000 files of 64K each in a single directory (64 threads, fsmark)
* Test 2: Same as Test 1 on a 500GiB pmem device with dax
* Test 3: 5Gi write with mix of random and seq writes (fio)
* Test 4: 5Gi sequential writes (fio)

Here:
e = extents scanned
c = cr0 / cr1 / cr1.5 / cr2 / cr3 hits

+─────────+───────────────────────────────────+────────────────────────────────+
|         | Unpatched                         | Patched                        |
+─────────+───────────────────────────────────+────────────────────────────────+
| Test 1  | 6866 files/s                      | 13527 files/s                  |
|         | e: 8,188,644                      | e: 1,719,725                   |
|         | c: 381 / 330 / - / 4779 / 35534   | c: 381/ 280 / 33299/ 1000/ 6064|
+─────────+───────────────────────────────────+────────────────────────────────+
| Test 2  | 6927 files/s                      | 8422 files/s                   |
|         | e: 8,055,911                      | e: 261,268                     |
|         | cr: 1011 / 999 / - / 6153 / 32861 | c: 1721 / 1210 / 38093 / 0 / 0 |
+─────────+───────────────────────────────────+────────────────────────────────+
| Test 3  | 387 MiB/s                         | 443 MiB/s                      |
+─────────+───────────────────────────────────+────────────────────────────────+
| Test 4  | 3139 MiB/s                        | 3180 MiB/s                     |
+─────────+───────────────────────────────────+────────────────────────────────+

The numbers of same tests with 4k bs 64k pagesize are:

+─────────+────────────────────────────────────+────────────────────────────────+
|         | Unpatched                          | Patched                        |
+─────────+────────────────────────────────────+────────────────────────────────+
| Test 1  | 21618 files/s                      | 23528 files/s                  |
|         | e: 8,149,272                       | e: 223,013                     |
|         | c: 34 / 1380 / - / 5624 / 34710    | 34 / 1341 / 40387 / 0 / 0      |
+─────────+───────────────────────────────────+─────────────────────────────────+
| Test 2  | 30739 files/s                      | 30946 files/s                  |
|         | e: 7,742,853                       | e: 2,176,475                   |
|         | c: 1131 / 2244 / - / 3914 / 34468  | c: 1596/1079/28425/1098/8547   |
+─────────+───────────────────────────────────+─────────────────────────────────+
| Test 3  | 200 MiB/s                          | 186MiB/s                       |
+─────────+───────────────────────────────────+─────────────────────────────────+
| Test 4  | 621 MiB/s                          | 632 MiB/s                      |
+─────────+────────────────────────────────────+────────────────────────────────+

** Some Observations **

1. In the case of highly fragmented 64k blocksize most of the performance is
lost since we hold the BG lock while scanning a block group for best extent.
As our goal len is 8MB and we only have 4MB blocks, we are taking a long time
to scan causing other threads to wait on the BG lock. This can be seen in perf
diff of unpatched vs patched:

    83.14%    -24.89%  [kernel.vmlinux]            [k] do_raw_spin_lock

Using lockstat and perf call graph I was able to confirm that this lock was the
BG lock taken in ext4_mb_regular_allocator, contending with other processes trying
to take the same BG's lock in ext4_mb_regular_allocator() and __ext4_new_inode()


2. Currently, I do see some increase in fragmentation, I can take this
up as future work. Below are the e2freefrag results after Test 1 with
64k BS:

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Unpatched:

Min. free extent: 128 KB
Max. free extent: 8000 KB
Avg. free extent: 4096 KB
Num. free extent: 12630

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range :  Free extents   Free Blocks  Percent
  128K...  256K-  :             1             2    0.00%
  256K...  512K-  :             1             6    0.00%
  512K... 1024K-  :             4            48    0.01%
    1M...    2M-  :             5           120    0.01%
    2M...    4M-  :         11947        725624   85.31%
    4M...    8M-  :           672         83796    9.85%

Patched:

Min. free extent: 64 KB
Max. free extent: 11648 KB
Avg. free extent: 2688 KB
Num. free extent: 18847

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range :  Free extents   Free Blocks  Percent
   64K...  128K-  :             1             1    0.00%
  128K...  256K-  :             2             5    0.00%
  256K...  512K-  :             1             5    0.00%
  512K... 1024K-  :           297          3909    0.48%
    1M...    2M-  :         11221        341065   42.13%
    2M...    4M-  :          4940        294260   36.35%
    4M...    8M-  :          2384        170169   21.02%
    8M...   16M-  :             1           182    0.02%

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

-------------------------------------

Since these changes are looking good to me from my end, so posting for a
feedback from ext4 community.  

(gcexfstests -c all quick went fine with no new failures reported)

Any thoughts/suggestions are welcome!! 

Regards,
Ojaswin

Ojaswin Mujoo (10):
  Revert "ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check
    in ext4_mb_check_limits"
  ext4: Convert mballoc cr (criteria) to enum
  ext4: Add per CR extent scanned counter
  ext4: Add counter to track successful allocation of goal length
  ext4: Avoid scanning smaller extents in BG during CR1
  ext4: Don't skip prefetching BLOCK_UNINIT groups
  ext4: Ensure ext4_mb_prefetch_fini() is called for all prefetched BGs
  ext4: Abstract out logic to search average fragment list
  ext4: Add allocation criteria 1.5 (CR1_5)
  ext4: Give symbolic names to mballoc criterias

Ritesh Harjani (IBM) (2):
  ext4: mballoc: Remove useless setting of ac_criteria
  ext4: Remove unused extern variables declaration

 fs/ext4/ext4.h              |  70 +++++-
 fs/ext4/mballoc.c           | 453 ++++++++++++++++++++++++++----------
 fs/ext4/mballoc.h           |  16 +-
 fs/ext4/super.c             |  11 +-
 fs/ext4/sysfs.c             |   2 +
 include/trace/events/ext4.h |  18 +-
 6 files changed, 428 insertions(+), 142 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 01/12] Revert "ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits"
  2023-05-30 12:33 [PATCH v2 00/12] multiblock allocator improvements Ojaswin Mujoo
@ 2023-05-30 12:33 ` Ojaswin Mujoo
  2023-05-30 16:28   ` Sedat Dilek
  2023-05-30 12:33 ` [PATCH v2 02/12] ext4: mballoc: Remove useless setting of ac_criteria Ojaswin Mujoo
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 26+ messages in thread
From: Ojaswin Mujoo @ 2023-05-30 12:33 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, Jan Kara,
	Kemeng Shi, Ritesh Harjani

This reverts commit 32c0869370194ae5ac9f9f501953ef693040f6a1.

The reverted commit was intended to remove a dead check however it was observed
that this check was actually being used to exit early instead of looping
sbi->s_mb_max_to_scan times when we are able to find a free extent bigger than
the goal extent. Due to this, a my performance tests (fsmark, parallel file
writes in a highly fragmented FS) were seeing a 2x-3x regression.

Example, the default value of the following variables is:

sbi->s_mb_max_to_scan = 200
sbi->s_mb_min_to_scan = 10

In ext4_mb_check_limits() if we find an extent smaller than goal, then we return
early and try again. This loop will go on until we have processed
sbi->s_mb_max_to_scan(=200) number of free extents at which point we exit and
just use whatever we have even if it is smaller than goal extent.

Now, the regression comes when we find an extent bigger than goal. Earlier, in
this case we would loop only sbi->s_mb_min_to_scan(=10) times and then just use
the bigger extent. However with commit 32c08693 that check was removed and hence
we would loop sbi->s_mb_max_to_scan(=200) times even though we have a big enough
free extent to satisfy the request. The only time we would exit early would be
when the free extent is *exactly* the size of our goal, which is pretty uncommon
occurrence and so we would almost always end up looping 200 times.

Hence, revert the commit by adding the check back to fix the regression. Also
add a comment to outline this policy.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 fs/ext4/mballoc.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d4b6a2c1881d..7ac6d3524f29 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2063,7 +2063,7 @@ static void ext4_mb_check_limits(struct ext4_allocation_context *ac,
 	if (bex->fe_len < gex->fe_len)
 		return;
 
-	if (finish_group)
+	if (finish_group || ac->ac_found > sbi->s_mb_min_to_scan)
 		ext4_mb_use_best_found(ac, e4b);
 }
 
@@ -2075,6 +2075,20 @@ static void ext4_mb_check_limits(struct ext4_allocation_context *ac,
  * in the context. Later, the best found extent will be used, if
  * mballoc can't find good enough extent.
  *
+ * The algorithm used is roughly as follows:
+ *
+ * * If free extent found is exactly as big as goal, then
+ *   stop the scan and use it immediately
+ *
+ * * If free extent found is smaller than goal, then keep retrying
+ *   upto a max of sbi->s_mb_max_to_scan times (default 200). After
+ *   that stop scanning and use whatever we have.
+ *
+ * * If free extent found is bigger than goal, then keep retrying
+ *   upto a max of sbi->s_mb_min_to_scan times (default 10) before
+ *   stopping the scan and using the extent.
+ *
+ *
  * FIXME: real allocation policy is to be designed yet!
  */
 static void ext4_mb_measure_extent(struct ext4_allocation_context *ac,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 02/12] ext4: mballoc: Remove useless setting of ac_criteria
  2023-05-30 12:33 [PATCH v2 00/12] multiblock allocator improvements Ojaswin Mujoo
  2023-05-30 12:33 ` [PATCH v2 01/12] Revert "ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits" Ojaswin Mujoo
@ 2023-05-30 12:33 ` Ojaswin Mujoo
  2023-05-30 12:33 ` [PATCH v2 03/12] ext4: Remove unused extern variables declaration Ojaswin Mujoo
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Ojaswin Mujoo @ 2023-05-30 12:33 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, Jan Kara,
	Kemeng Shi, Ritesh Harjani (IBM)

From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>

There will be changes coming in future patches which will introduce a new
criteria for block allocation. This removes the useless setting of ac_criteria.
AFAIU, this might be only used to differentiate between whether a preallocated
blocks was allocated or was regular allocator called for allocating blocks.
Hence this also adds the debug prints to identify what type of block allocation
was done in ext4_mb_show_ac().

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/mballoc.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 7ac6d3524f29..9d73f61458d4 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4627,7 +4627,6 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 			atomic_inc(&tmp_pa->pa_count);
 			ext4_mb_use_inode_pa(ac, tmp_pa);
 			spin_unlock(&tmp_pa->pa_lock);
-			ac->ac_criteria = 10;
 			read_unlock(&ei->i_prealloc_lock);
 			return true;
 		}
@@ -4670,7 +4669,6 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 	}
 	if (cpa) {
 		ext4_mb_use_group_pa(ac, cpa);
-		ac->ac_criteria = 20;
 		return true;
 	}
 	return false;
@@ -5444,6 +5442,10 @@ static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
 			(unsigned long)ac->ac_b_ex.fe_logical,
 			(int)ac->ac_criteria);
 	mb_debug(sb, "%u found", ac->ac_found);
+	mb_debug(sb, "used pa: %s, ", ac->ac_pa ? "yes" : "no");
+	if (ac->ac_pa)
+		mb_debug(sb, "pa_type %s\n", ac->ac_pa->pa_type == MB_GROUP_PA ?
+			 "group pa" : "inode pa");
 	ext4_mb_show_pa(sb);
 }
 #else
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 03/12] ext4: Remove unused extern variables declaration
  2023-05-30 12:33 [PATCH v2 00/12] multiblock allocator improvements Ojaswin Mujoo
  2023-05-30 12:33 ` [PATCH v2 01/12] Revert "ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits" Ojaswin Mujoo
  2023-05-30 12:33 ` [PATCH v2 02/12] ext4: mballoc: Remove useless setting of ac_criteria Ojaswin Mujoo
@ 2023-05-30 12:33 ` Ojaswin Mujoo
  2023-05-30 12:33 ` [PATCH v2 04/12] ext4: Convert mballoc cr (criteria) to enum Ojaswin Mujoo
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Ojaswin Mujoo @ 2023-05-30 12:33 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, Jan Kara,
	Kemeng Shi, Ritesh Harjani (IBM)

From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>

ext4_mb_stats & ext4_mb_max_to_scan are never used. We use
sbi->s_mb_stats and sbi->s_mb_max_to_scan instead.
Hence kill these extern declarations.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h    | 2 --
 fs/ext4/mballoc.h | 2 +-
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b39a52b93a26..c075da665ec1 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2835,8 +2835,6 @@ int ext4_fc_record_regions(struct super_block *sb, int ino,
 /* mballoc.c */
 extern const struct seq_operations ext4_mb_seq_groups_ops;
 extern const struct seq_operations ext4_mb_seq_structs_summary_ops;
-extern long ext4_mb_stats;
-extern long ext4_mb_max_to_scan;
 extern int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset);
 extern int ext4_mb_init(struct super_block *);
 extern int ext4_mb_release(struct super_block *);
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 6d85ee8674a6..24b666e558f1 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -49,7 +49,7 @@
 #define MB_DEFAULT_MIN_TO_SCAN		10
 
 /*
- * with 'ext4_mb_stats' allocator will collect stats that will be
+ * with 's_mb_stats' allocator will collect stats that will be
  * shown at umount. The collecting costs though!
  */
 #define MB_DEFAULT_STATS		0
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 04/12] ext4: Convert mballoc cr (criteria) to enum
  2023-05-30 12:33 [PATCH v2 00/12] multiblock allocator improvements Ojaswin Mujoo
                   ` (2 preceding siblings ...)
  2023-05-30 12:33 ` [PATCH v2 03/12] ext4: Remove unused extern variables declaration Ojaswin Mujoo
@ 2023-05-30 12:33 ` Ojaswin Mujoo
  2023-06-06 13:13   ` Jan Kara
  2023-05-30 12:33 ` [PATCH v2 05/12] ext4: Add per CR extent scanned counter Ojaswin Mujoo
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 26+ messages in thread
From: Ojaswin Mujoo @ 2023-05-30 12:33 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, Jan Kara,
	Kemeng Shi, Ritesh Harjani

Convert criteria to be an enum so it easier to maintain and
update the tracefiles to use enum names. This change also makes
it easier to insert new criterias in the future.

There is no functional change in this patch.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext4/ext4.h              | 23 +++++++--
 fs/ext4/mballoc.c           | 96 ++++++++++++++++++-------------------
 include/trace/events/ext4.h | 16 ++++++-
 3 files changed, 82 insertions(+), 53 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c075da665ec1..f9a4eaa10c6a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -127,6 +127,23 @@ enum SHIFT_DIRECTION {
 	SHIFT_RIGHT,
 };
 
+/*
+ * Number of criterias defined. For each criteria, mballoc has slightly
+ * different way of finding the required blocks nad usually, higher the
+ * criteria the slower the allocation. We start at lower criterias and keep
+ * falling back to higher ones if we are not able to find any blocks.
+ */
+#define EXT4_MB_NUM_CRS 4
+/*
+ * All possible allocation criterias for mballoc
+ */
+enum criteria {
+	CR0,
+	CR1,
+	CR2,
+	CR3,
+};
+
 /*
  * Flags used in mballoc's allocation_context flags field.
  *
@@ -1542,9 +1559,9 @@ struct ext4_sb_info {
 	atomic_t s_bal_2orders;	/* 2^order hits */
 	atomic_t s_bal_cr0_bad_suggestions;
 	atomic_t s_bal_cr1_bad_suggestions;
-	atomic64_t s_bal_cX_groups_considered[4];
-	atomic64_t s_bal_cX_hits[4];
-	atomic64_t s_bal_cX_failed[4];		/* cX loop didn't find blocks */
+	atomic64_t s_bal_cX_groups_considered[EXT4_MB_NUM_CRS];
+	atomic64_t s_bal_cX_hits[EXT4_MB_NUM_CRS];
+	atomic64_t s_bal_cX_failed[EXT4_MB_NUM_CRS];		/* cX loop didn't find blocks */
 	atomic_t s_mb_buddies_generated;	/* number of buddies generated */
 	atomic64_t s_mb_generation_time;
 	atomic_t s_mb_lost_chunks;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 9d73f61458d4..97eaa22b907d 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -155,19 +155,19 @@
  * structures to decide the order in which groups are to be traversed for
  * fulfilling an allocation request.
  *
- * At CR = 0, we look for groups which have the largest_free_order >= the order
+ * At CR0 , we look for groups which have the largest_free_order >= the order
  * of the request. We directly look at the largest free order list in the data
  * structure (1) above where largest_free_order = order of the request. If that
  * list is empty, we look at remaining list in the increasing order of
- * largest_free_order. This allows us to perform CR = 0 lookup in O(1) time.
+ * largest_free_order. This allows us to perform CR0 lookup in O(1) time.
  *
- * At CR = 1, we only consider groups where average fragment size > request
+ * At CR1, we only consider groups where average fragment size > request
  * size. So, we lookup a group which has average fragment size just above or
  * equal to request size using our average fragment size group lists (data
  * structure 2) in O(1) time.
  *
  * If "mb_optimize_scan" mount option is not set, mballoc traverses groups in
- * linear order which requires O(N) search time for each CR 0 and CR 1 phase.
+ * linear order which requires O(N) search time for each CR0 and CR1 phase.
  *
  * The regular allocator (using the buddy cache) supports a few tunables.
  *
@@ -410,7 +410,7 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
 static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
 
 static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
-			       ext4_group_t group, int cr);
+			       ext4_group_t group, enum criteria cr);
 
 static int ext4_try_to_trim_range(struct super_block *sb,
 		struct ext4_buddy *e4b, ext4_grpblk_t start,
@@ -860,7 +860,7 @@ mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
  * cr level needs an update.
  */
 static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
-			int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
+			enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	struct ext4_group_info *iter, *grp;
@@ -885,8 +885,8 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
 		list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
 				    bb_largest_free_order_node) {
 			if (sbi->s_mb_stats)
-				atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
-			if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) {
+				atomic64_inc(&sbi->s_bal_cX_groups_considered[CR0]);
+			if (likely(ext4_mb_good_group(ac, iter->bb_group, CR0))) {
 				grp = iter;
 				break;
 			}
@@ -898,7 +898,7 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
 
 	if (!grp) {
 		/* Increment cr and search again */
-		*new_cr = 1;
+		*new_cr = CR1;
 	} else {
 		*group = grp->bb_group;
 		ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED;
@@ -910,7 +910,7 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
  * order. Updates *new_cr if cr level needs an update.
  */
 static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
-		int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
+		enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	struct ext4_group_info *grp = NULL, *iter;
@@ -933,8 +933,8 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
 		list_for_each_entry(iter, &sbi->s_mb_avg_fragment_size[i],
 				    bb_avg_fragment_size_node) {
 			if (sbi->s_mb_stats)
-				atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
-			if (likely(ext4_mb_good_group(ac, iter->bb_group, 1))) {
+				atomic64_inc(&sbi->s_bal_cX_groups_considered[CR1]);
+			if (likely(ext4_mb_good_group(ac, iter->bb_group, CR1))) {
 				grp = iter;
 				break;
 			}
@@ -948,7 +948,7 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
 		*group = grp->bb_group;
 		ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
 	} else {
-		*new_cr = 2;
+		*new_cr = CR2;
 	}
 }
 
@@ -956,7 +956,7 @@ static inline int should_optimize_scan(struct ext4_allocation_context *ac)
 {
 	if (unlikely(!test_opt2(ac->ac_sb, MB_OPTIMIZE_SCAN)))
 		return 0;
-	if (ac->ac_criteria >= 2)
+	if (ac->ac_criteria >= CR2)
 		return 0;
 	if (!ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))
 		return 0;
@@ -1001,7 +1001,7 @@ next_linear_group(struct ext4_allocation_context *ac, int group, int ngroups)
  * @ngroups   Total number of groups
  */
 static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
-		int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
+		enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups)
 {
 	*new_cr = ac->ac_criteria;
 
@@ -1010,9 +1010,9 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
 		return;
 	}
 
-	if (*new_cr == 0) {
+	if (*new_cr == CR0) {
 		ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
-	} else if (*new_cr == 1) {
+	} else if (*new_cr == CR1) {
 		ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
 	} else {
 		/*
@@ -2409,13 +2409,13 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
  * for the allocation or not.
  */
 static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
-				ext4_group_t group, int cr)
+				ext4_group_t group, enum criteria cr)
 {
 	ext4_grpblk_t free, fragments;
 	int flex_size = ext4_flex_bg_size(EXT4_SB(ac->ac_sb));
 	struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
 
-	BUG_ON(cr < 0 || cr >= 4);
+	BUG_ON(cr < CR0 || cr >= EXT4_MB_NUM_CRS);
 
 	if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp) || !grp))
 		return false;
@@ -2429,7 +2429,7 @@ static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
 		return false;
 
 	switch (cr) {
-	case 0:
+	case CR0:
 		BUG_ON(ac->ac_2order == 0);
 
 		/* Avoid using the first bg of a flexgroup for data files */
@@ -2448,15 +2448,15 @@ static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
 			return false;
 
 		return true;
-	case 1:
+	case CR1:
 		if ((free / fragments) >= ac->ac_g_ex.fe_len)
 			return true;
 		break;
-	case 2:
+	case CR2:
 		if (free >= ac->ac_g_ex.fe_len)
 			return true;
 		break;
-	case 3:
+	case CR3:
 		return true;
 	default:
 		BUG();
@@ -2477,7 +2477,7 @@ static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
  * out"!
  */
 static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
-				     ext4_group_t group, int cr)
+				     ext4_group_t group, enum criteria cr)
 {
 	struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
 	struct super_block *sb = ac->ac_sb;
@@ -2497,7 +2497,7 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
 	free = grp->bb_free;
 	if (free == 0)
 		goto out;
-	if (cr <= 2 && free < ac->ac_g_ex.fe_len)
+	if (cr <= CR2 && free < ac->ac_g_ex.fe_len)
 		goto out;
 	if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
 		goto out;
@@ -2512,7 +2512,7 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
 			ext4_get_group_desc(sb, group, NULL);
 		int ret;
 
-		/* cr=0/1 is a very optimistic search to find large
+		/* cr=CR0/CR1 is a very optimistic search to find large
 		 * good chunks almost for free.  If buddy data is not
 		 * ready, then this optimization makes no sense.  But
 		 * we never skip the first block group in a flex_bg,
@@ -2520,7 +2520,7 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
 		 * and we want to make sure we locate metadata blocks
 		 * in the first block group in the flex_bg if possible.
 		 */
-		if (cr < 2 &&
+		if (cr < CR2 &&
 		    (!sbi->s_log_groups_per_flex ||
 		     ((group & ((1 << sbi->s_log_groups_per_flex) - 1)) != 0)) &&
 		    !(ext4_has_group_desc_csum(sb) &&
@@ -2626,7 +2626,7 @@ static noinline_for_stack int
 ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 {
 	ext4_group_t prefetch_grp = 0, ngroups, group, i;
-	int cr = -1, new_cr;
+	enum criteria cr, new_cr;
 	int err = 0, first_err = 0;
 	unsigned int nr = 0, prefetch_ios = 0;
 	struct ext4_sb_info *sbi;
@@ -2684,13 +2684,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 	}
 
 	/* Let's just scan groups to find more-less suitable blocks */
-	cr = ac->ac_2order ? 0 : 1;
+	cr = ac->ac_2order ? CR0 : CR1;
 	/*
-	 * cr == 0 try to get exact allocation,
-	 * cr == 3  try to get anything
+	 * cr == CR0 try to get exact allocation,
+	 * cr == CR3 try to get anything
 	 */
 repeat:
-	for (; cr < 4 && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
+	for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
 		ac->ac_criteria = cr;
 		/*
 		 * searching for the right group start
@@ -2717,7 +2717,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 			 * spend a lot of time loading imperfect groups
 			 */
 			if ((prefetch_grp == group) &&
-			    (cr > 1 ||
+			    (cr > CR1 ||
 			     prefetch_ios < sbi->s_mb_prefetch_limit)) {
 				unsigned int curr_ios = prefetch_ios;
 
@@ -2759,9 +2759,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 			}
 
 			ac->ac_groups_scanned++;
-			if (cr == 0)
+			if (cr == CR0)
 				ext4_mb_simple_scan_group(ac, &e4b);
-			else if (cr == 1 && sbi->s_stripe &&
+			else if (cr == CR1 && sbi->s_stripe &&
 				 !(ac->ac_g_ex.fe_len %
 				 EXT4_B2C(sbi, sbi->s_stripe)))
 				ext4_mb_scan_aligned(ac, &e4b);
@@ -2802,7 +2802,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 			ac->ac_b_ex.fe_len = 0;
 			ac->ac_status = AC_STATUS_CONTINUE;
 			ac->ac_flags |= EXT4_MB_HINT_FIRST;
-			cr = 3;
+			cr = CR3;
 			goto repeat;
 		}
 	}
@@ -2927,36 +2927,36 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
 	seq_printf(seq, "\tgroups_scanned: %u\n",  atomic_read(&sbi->s_bal_groups_scanned));
 
 	seq_puts(seq, "\tcr0_stats:\n");
-	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[0]));
+	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR0]));
 	seq_printf(seq, "\t\tgroups_considered: %llu\n",
-		   atomic64_read(&sbi->s_bal_cX_groups_considered[0]));
+		   atomic64_read(&sbi->s_bal_cX_groups_considered[CR0]));
 	seq_printf(seq, "\t\tuseless_loops: %llu\n",
-		   atomic64_read(&sbi->s_bal_cX_failed[0]));
+		   atomic64_read(&sbi->s_bal_cX_failed[CR0]));
 	seq_printf(seq, "\t\tbad_suggestions: %u\n",
 		   atomic_read(&sbi->s_bal_cr0_bad_suggestions));
 
 	seq_puts(seq, "\tcr1_stats:\n");
-	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[1]));
+	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR1]));
 	seq_printf(seq, "\t\tgroups_considered: %llu\n",
-		   atomic64_read(&sbi->s_bal_cX_groups_considered[1]));
+		   atomic64_read(&sbi->s_bal_cX_groups_considered[CR1]));
 	seq_printf(seq, "\t\tuseless_loops: %llu\n",
-		   atomic64_read(&sbi->s_bal_cX_failed[1]));
+		   atomic64_read(&sbi->s_bal_cX_failed[CR1]));
 	seq_printf(seq, "\t\tbad_suggestions: %u\n",
 		   atomic_read(&sbi->s_bal_cr1_bad_suggestions));
 
 	seq_puts(seq, "\tcr2_stats:\n");
-	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[2]));
+	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR2]));
 	seq_printf(seq, "\t\tgroups_considered: %llu\n",
-		   atomic64_read(&sbi->s_bal_cX_groups_considered[2]));
+		   atomic64_read(&sbi->s_bal_cX_groups_considered[CR2]));
 	seq_printf(seq, "\t\tuseless_loops: %llu\n",
-		   atomic64_read(&sbi->s_bal_cX_failed[2]));
+		   atomic64_read(&sbi->s_bal_cX_failed[CR2]));
 
 	seq_puts(seq, "\tcr3_stats:\n");
-	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[3]));
+	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR3]));
 	seq_printf(seq, "\t\tgroups_considered: %llu\n",
-		   atomic64_read(&sbi->s_bal_cX_groups_considered[3]));
+		   atomic64_read(&sbi->s_bal_cX_groups_considered[CR3]));
 	seq_printf(seq, "\t\tuseless_loops: %llu\n",
-		   atomic64_read(&sbi->s_bal_cX_failed[3]));
+		   atomic64_read(&sbi->s_bal_cX_failed[CR3]));
 	seq_printf(seq, "\textents_scanned: %u\n", atomic_read(&sbi->s_bal_ex_scanned));
 	seq_printf(seq, "\t\tgoal_hits: %u\n", atomic_read(&sbi->s_bal_goals));
 	seq_printf(seq, "\t\t2^n_hits: %u\n", atomic_read(&sbi->s_bal_2orders));
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index ebccf6a6aa1b..f062147ca32b 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -120,6 +120,18 @@ TRACE_DEFINE_ENUM(EXT4_FC_REASON_MAX);
 		{ EXT4_FC_REASON_INODE_JOURNAL_DATA,	"INODE_JOURNAL_DATA"}, \
 		{ EXT4_FC_REASON_ENCRYPTED_FILENAME,	"ENCRYPTED_FILENAME"})
 
+TRACE_DEFINE_ENUM(CR0);
+TRACE_DEFINE_ENUM(CR1);
+TRACE_DEFINE_ENUM(CR2);
+TRACE_DEFINE_ENUM(CR3);
+
+#define show_criteria(cr)                       \
+	__print_symbolic(cr,                    \
+			 { CR0, "CR0" },	\
+			 { CR1, "CR1" },        \
+			 { CR2, "CR2" },        \
+			 { CR3, "CR3" })
+
 TRACE_EVENT(ext4_other_inode_update_time,
 	TP_PROTO(struct inode *inode, ino_t orig_ino),
 
@@ -1063,7 +1075,7 @@ TRACE_EVENT(ext4_mballoc_alloc,
 	),
 
 	TP_printk("dev %d,%d inode %lu orig %u/%d/%u@%u goal %u/%d/%u@%u "
-		  "result %u/%d/%u@%u blks %u grps %u cr %u flags %s "
+		  "result %u/%d/%u@%u blks %u grps %u cr %s flags %s "
 		  "tail %u broken %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
@@ -1073,7 +1085,7 @@ TRACE_EVENT(ext4_mballoc_alloc,
 		  __entry->goal_len, __entry->goal_logical,
 		  __entry->result_group, __entry->result_start,
 		  __entry->result_len, __entry->result_logical,
-		  __entry->found, __entry->groups, __entry->cr,
+		  __entry->found, __entry->groups, show_criteria(__entry->cr),
 		  show_mballoc_flags(__entry->flags), __entry->tail,
 		  __entry->buddy ? 1 << __entry->buddy : 0)
 );
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 05/12] ext4: Add per CR extent scanned counter
  2023-05-30 12:33 [PATCH v2 00/12] multiblock allocator improvements Ojaswin Mujoo
                   ` (3 preceding siblings ...)
  2023-05-30 12:33 ` [PATCH v2 04/12] ext4: Convert mballoc cr (criteria) to enum Ojaswin Mujoo
@ 2023-05-30 12:33 ` Ojaswin Mujoo
  2023-05-30 12:33 ` [PATCH v2 06/12] ext4: Add counter to track successful allocation of goal length Ojaswin Mujoo
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Ojaswin Mujoo @ 2023-05-30 12:33 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, Jan Kara,
	Kemeng Shi, Ritesh Harjani

This gives better visibility into the number of extents scanned in each
particular CR. For example, this information can be used to see how out
block group scanning logic is performing when the BG is fragmented.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h    |  1 +
 fs/ext4/mballoc.c | 12 ++++++++++++
 fs/ext4/mballoc.h |  1 +
 3 files changed, 14 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f9a4eaa10c6a..2df4189ef778 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1553,6 +1553,7 @@ struct ext4_sb_info {
 	atomic_t s_bal_success;	/* we found long enough chunks */
 	atomic_t s_bal_allocated;	/* in blocks */
 	atomic_t s_bal_ex_scanned;	/* total extents scanned */
+	atomic_t s_bal_cX_ex_scanned[EXT4_MB_NUM_CRS];	/* total extents scanned */
 	atomic_t s_bal_groups_scanned;	/* number of groups scanned */
 	atomic_t s_bal_goals;	/* goal hits */
 	atomic_t s_bal_breaks;	/* too long searches */
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 97eaa22b907d..a3106607486f 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2104,6 +2104,7 @@ static void ext4_mb_measure_extent(struct ext4_allocation_context *ac,
 	BUG_ON(ac->ac_status != AC_STATUS_CONTINUE);
 
 	ac->ac_found++;
+	ac->ac_cX_found[ac->ac_criteria]++;
 
 	/*
 	 * The special case - take what you catch first
@@ -2278,6 +2279,7 @@ void ext4_mb_simple_scan_group(struct ext4_allocation_context *ac,
 			break;
 		}
 		ac->ac_found++;
+		ac->ac_cX_found[ac->ac_criteria]++;
 
 		ac->ac_b_ex.fe_len = 1 << i;
 		ac->ac_b_ex.fe_start = k << i;
@@ -2393,6 +2395,7 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
 			max = mb_find_extent(e4b, i, stripe, &ex);
 			if (max >= stripe) {
 				ac->ac_found++;
+				ac->ac_cX_found[ac->ac_criteria]++;
 				ex.fe_logical = 0xDEADF00D; /* debug value */
 				ac->ac_b_ex = ex;
 				ext4_mb_use_best_found(ac, e4b);
@@ -2930,6 +2933,7 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
 	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR0]));
 	seq_printf(seq, "\t\tgroups_considered: %llu\n",
 		   atomic64_read(&sbi->s_bal_cX_groups_considered[CR0]));
+	seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR0]));
 	seq_printf(seq, "\t\tuseless_loops: %llu\n",
 		   atomic64_read(&sbi->s_bal_cX_failed[CR0]));
 	seq_printf(seq, "\t\tbad_suggestions: %u\n",
@@ -2939,6 +2943,7 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
 	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR1]));
 	seq_printf(seq, "\t\tgroups_considered: %llu\n",
 		   atomic64_read(&sbi->s_bal_cX_groups_considered[CR1]));
+	seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR1]));
 	seq_printf(seq, "\t\tuseless_loops: %llu\n",
 		   atomic64_read(&sbi->s_bal_cX_failed[CR1]));
 	seq_printf(seq, "\t\tbad_suggestions: %u\n",
@@ -2948,6 +2953,7 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
 	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR2]));
 	seq_printf(seq, "\t\tgroups_considered: %llu\n",
 		   atomic64_read(&sbi->s_bal_cX_groups_considered[CR2]));
+	seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR2]));
 	seq_printf(seq, "\t\tuseless_loops: %llu\n",
 		   atomic64_read(&sbi->s_bal_cX_failed[CR2]));
 
@@ -2955,6 +2961,7 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
 	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR3]));
 	seq_printf(seq, "\t\tgroups_considered: %llu\n",
 		   atomic64_read(&sbi->s_bal_cX_groups_considered[CR3]));
+	seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR3]));
 	seq_printf(seq, "\t\tuseless_loops: %llu\n",
 		   atomic64_read(&sbi->s_bal_cX_failed[CR3]));
 	seq_printf(seq, "\textents_scanned: %u\n", atomic_read(&sbi->s_bal_ex_scanned));
@@ -4403,7 +4410,12 @@ static void ext4_mb_collect_stats(struct ext4_allocation_context *ac)
 		atomic_add(ac->ac_b_ex.fe_len, &sbi->s_bal_allocated);
 		if (ac->ac_b_ex.fe_len >= ac->ac_o_ex.fe_len)
 			atomic_inc(&sbi->s_bal_success);
+
 		atomic_add(ac->ac_found, &sbi->s_bal_ex_scanned);
+		for (int i=0; i<EXT4_MB_NUM_CRS; i++) {
+			atomic_add(ac->ac_cX_found[i], &sbi->s_bal_cX_ex_scanned[i]);
+		}
+
 		atomic_add(ac->ac_groups_scanned, &sbi->s_bal_groups_scanned);
 		if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start &&
 				ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group)
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 24b666e558f1..acfdc204e15d 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -184,6 +184,7 @@ struct ext4_allocation_context {
 	__u16 ac_groups_scanned;
 	__u16 ac_groups_linear_remaining;
 	__u16 ac_found;
+	__u16 ac_cX_found[EXT4_MB_NUM_CRS];
 	__u16 ac_tail;
 	__u16 ac_buddy;
 	__u8 ac_status;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 06/12] ext4: Add counter to track successful allocation of goal length
  2023-05-30 12:33 [PATCH v2 00/12] multiblock allocator improvements Ojaswin Mujoo
                   ` (4 preceding siblings ...)
  2023-05-30 12:33 ` [PATCH v2 05/12] ext4: Add per CR extent scanned counter Ojaswin Mujoo
@ 2023-05-30 12:33 ` Ojaswin Mujoo
  2023-05-30 12:33 ` [PATCH v2 07/12] ext4: Avoid scanning smaller extents in BG during CR1 Ojaswin Mujoo
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Ojaswin Mujoo @ 2023-05-30 12:33 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, Jan Kara,
	Kemeng Shi, Ritesh Harjani

Track number of allocations where the length of blocks allocated is equal to the
length of goal blocks (post normalization). This metric could be useful if
making changes to the allocator logic in the future as it could give us
visibility into how often do we trim our requests.

PS: ac_b_ex.fe_len might get modified due to preallocation efforts and
hence we use ac_f_ex.fe_len instead since we want to compare how much the
allocator was able to actually find.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h    | 1 +
 fs/ext4/mballoc.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 2df4189ef778..eae981ab2539 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1556,6 +1556,7 @@ struct ext4_sb_info {
 	atomic_t s_bal_cX_ex_scanned[EXT4_MB_NUM_CRS];	/* total extents scanned */
 	atomic_t s_bal_groups_scanned;	/* number of groups scanned */
 	atomic_t s_bal_goals;	/* goal hits */
+	atomic_t s_bal_len_goals;	/* len goal hits */
 	atomic_t s_bal_breaks;	/* too long searches */
 	atomic_t s_bal_2orders;	/* 2^order hits */
 	atomic_t s_bal_cr0_bad_suggestions;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a3106607486f..73e98a4d01f5 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2966,6 +2966,7 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
 		   atomic64_read(&sbi->s_bal_cX_failed[CR3]));
 	seq_printf(seq, "\textents_scanned: %u\n", atomic_read(&sbi->s_bal_ex_scanned));
 	seq_printf(seq, "\t\tgoal_hits: %u\n", atomic_read(&sbi->s_bal_goals));
+	seq_printf(seq, "\t\tlen_goal_hits: %u\n", atomic_read(&sbi->s_bal_len_goals));
 	seq_printf(seq, "\t\t2^n_hits: %u\n", atomic_read(&sbi->s_bal_2orders));
 	seq_printf(seq, "\t\tbreaks: %u\n", atomic_read(&sbi->s_bal_breaks));
 	seq_printf(seq, "\t\tlost: %u\n", atomic_read(&sbi->s_mb_lost_chunks));
@@ -4420,6 +4421,8 @@ static void ext4_mb_collect_stats(struct ext4_allocation_context *ac)
 		if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start &&
 				ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group)
 			atomic_inc(&sbi->s_bal_goals);
+		if (ac->ac_f_ex.fe_len == ac->ac_g_ex.fe_len)
+			atomic_inc(&sbi->s_bal_len_goals);
 		if (ac->ac_found > sbi->s_mb_max_to_scan)
 			atomic_inc(&sbi->s_bal_breaks);
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 07/12] ext4: Avoid scanning smaller extents in BG during CR1
  2023-05-30 12:33 [PATCH v2 00/12] multiblock allocator improvements Ojaswin Mujoo
                   ` (5 preceding siblings ...)
  2023-05-30 12:33 ` [PATCH v2 06/12] ext4: Add counter to track successful allocation of goal length Ojaswin Mujoo
@ 2023-05-30 12:33 ` Ojaswin Mujoo
  2023-05-30 12:33 ` [PATCH v2 08/12] ext4: Don't skip prefetching BLOCK_UNINIT groups Ojaswin Mujoo
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Ojaswin Mujoo @ 2023-05-30 12:33 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, Jan Kara,
	Kemeng Shi, Ritesh Harjani

When we are inside ext4_mb_complex_scan_group() in CR1, we can be sure
that this group has atleast 1 big enough continuous free extent to satisfy
our request because (free / fragments) > goal length.

Hence, instead of wasting time looping over smaller free extents, only
try to consider the free extent if we are sure that it has enough
continuous free space to satisfy goal length. This is particularly
useful when scanning highly fragmented BGs in CR1 as, without this
patch, the allocator might stop scanning early before reaching the big
enough free extent (due to ac_found > mb_max_to_scan) which causes us to
uncessarily trim the request.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/mballoc.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 73e98a4d01f5..c86565606359 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2308,7 +2308,7 @@ void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac,
 	struct super_block *sb = ac->ac_sb;
 	void *bitmap = e4b->bd_bitmap;
 	struct ext4_free_extent ex;
-	int i;
+	int i, j, freelen;
 	int free;
 
 	free = e4b->bd_info->bb_free;
@@ -2335,6 +2335,23 @@ void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac,
 			break;
 		}
 
+		if (ac->ac_criteria < CR2) {
+			/*
+			 * In CR1, we are sure that this group will
+			 * have a large enough continuous free extent, so skip
+			 * over the smaller free extents
+			 */
+			j = mb_find_next_bit(bitmap,
+						EXT4_CLUSTERS_PER_GROUP(sb), i);
+			freelen = j - i;
+
+			if (freelen < ac->ac_g_ex.fe_len) {
+				i = j;
+				free -= freelen;
+				continue;
+			}
+		}
+
 		mb_find_extent(e4b, i, ac->ac_g_ex.fe_len, &ex);
 		if (WARN_ON(ex.fe_len <= 0))
 			break;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 08/12] ext4: Don't skip prefetching BLOCK_UNINIT groups
  2023-05-30 12:33 [PATCH v2 00/12] multiblock allocator improvements Ojaswin Mujoo
                   ` (6 preceding siblings ...)
  2023-05-30 12:33 ` [PATCH v2 07/12] ext4: Avoid scanning smaller extents in BG during CR1 Ojaswin Mujoo
@ 2023-05-30 12:33 ` Ojaswin Mujoo
  2023-05-30 12:33 ` [PATCH v2 09/12] ext4: Ensure ext4_mb_prefetch_fini() is called for all prefetched BGs Ojaswin Mujoo
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Ojaswin Mujoo @ 2023-05-30 12:33 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, Jan Kara,
	Kemeng Shi, Ritesh Harjani

Currently, ext4_mb_prefetch() and ext4_mb_prefetch_fini() skip
BLOCK_UNINIT groups since fetching their bitmaps doesn't need disk IO.
As a consequence, we end not initializing the buddy structures and CR0/1
lists for these BGs, even though it can be done without any disk IO
overhead. Hence, don't skip such BGs during prefetch and prefetch_fini.

This improves the accuracy of CR0/1 allocation as earlier, we could have
essentially empty BLOCK_UNINIT groups being ignored by CR0/1 due to their buddy
not being initialized, leading to slower CR2 allocations. With this patch CR0/1
will be able to discover these groups as well, thus improving performance.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/mballoc.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index c86565606359..79455c7e645b 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2590,9 +2590,7 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group,
 		 */
 		if (gdp && grp && !EXT4_MB_GRP_TEST_AND_SET_READ(grp) &&
 		    EXT4_MB_GRP_NEED_INIT(grp) &&
-		    ext4_free_group_clusters(sb, gdp) > 0 &&
-		    !(ext4_has_group_desc_csum(sb) &&
-		      (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))) {
+		    ext4_free_group_clusters(sb, gdp) > 0 ) {
 			bh = ext4_read_block_bitmap_nowait(sb, group, true);
 			if (bh && !IS_ERR(bh)) {
 				if (!buffer_uptodate(bh) && cnt)
@@ -2633,9 +2631,7 @@ void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
 		grp = ext4_get_group_info(sb, group);
 
 		if (grp && gdp && EXT4_MB_GRP_NEED_INIT(grp) &&
-		    ext4_free_group_clusters(sb, gdp) > 0 &&
-		    !(ext4_has_group_desc_csum(sb) &&
-		      (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))) {
+		    ext4_free_group_clusters(sb, gdp) > 0) {
 			if (ext4_mb_init_group(sb, group, GFP_NOFS))
 				break;
 		}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 09/12] ext4: Ensure ext4_mb_prefetch_fini() is called for all prefetched BGs
  2023-05-30 12:33 [PATCH v2 00/12] multiblock allocator improvements Ojaswin Mujoo
                   ` (7 preceding siblings ...)
  2023-05-30 12:33 ` [PATCH v2 08/12] ext4: Don't skip prefetching BLOCK_UNINIT groups Ojaswin Mujoo
@ 2023-05-30 12:33 ` Ojaswin Mujoo
  2023-06-06 14:00   ` Guoqing Jiang
  2023-05-30 12:33 ` [PATCH v2 10/12] ext4: Abstract out logic to search average fragment list Ojaswin Mujoo
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 26+ messages in thread
From: Ojaswin Mujoo @ 2023-05-30 12:33 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, Jan Kara,
	Kemeng Shi, Ritesh Harjani

Before this patch, the call stack in ext4_run_li_request is as follows:

  /*
   * nr = no. of BGs we want to fetch (=s_mb_prefetch)
   * prefetch_ios = no. of BGs not uptodate after
   * 		    ext4_read_block_bitmap_nowait()
   */
  next_group = ext4_mb_prefetch(sb, group, nr, prefetch_ios);
  ext4_mb_prefetch_fini(sb, next_group prefetch_ios);

ext4_mb_prefetch_fini() will only try to initialize buddies for BGs in
range [next_group - prefetch_ios, next_group). This is incorrect since
sometimes (prefetch_ios < nr), which causes ext4_mb_prefetch_fini() to
incorrectly ignore some of the BGs that might need initialization. This
issue is more notable now with the previous patch enabling "fetching" of
BLOCK_UNINIT BGs which are marked buffer_uptodate by default.

Fix this by passing nr to ext4_mb_prefetch_fini() instead of
prefetch_ios so that it considers the right range of groups.

Similarly, make sure we don't pass nr=0 to ext4_mb_prefetch_fini() in
ext4_mb_regular_allocator() since we might have prefetched BLOCK_UNINIT
groups that would need buddy initialization.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/mballoc.c |  4 ----
 fs/ext4/super.c   | 11 ++++-------
 2 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 79455c7e645b..6775d73dfc68 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2735,8 +2735,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 			if ((prefetch_grp == group) &&
 			    (cr > CR1 ||
 			     prefetch_ios < sbi->s_mb_prefetch_limit)) {
-				unsigned int curr_ios = prefetch_ios;
-
 				nr = sbi->s_mb_prefetch;
 				if (ext4_has_feature_flex_bg(sb)) {
 					nr = 1 << sbi->s_log_groups_per_flex;
@@ -2745,8 +2743,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 				}
 				prefetch_grp = ext4_mb_prefetch(sb, group,
 							nr, &prefetch_ios);
-				if (prefetch_ios == curr_ios)
-					nr = 0;
 			}
 
 			/* This now checks without needing the buddy page */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 2da5476fa48b..27c1dabacd43 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3692,16 +3692,13 @@ static int ext4_run_li_request(struct ext4_li_request *elr)
 	ext4_group_t group = elr->lr_next_group;
 	unsigned int prefetch_ios = 0;
 	int ret = 0;
+	int nr = EXT4_SB(sb)->s_mb_prefetch;
 	u64 start_time;
 
 	if (elr->lr_mode == EXT4_LI_MODE_PREFETCH_BBITMAP) {
-		elr->lr_next_group = ext4_mb_prefetch(sb, group,
-				EXT4_SB(sb)->s_mb_prefetch, &prefetch_ios);
-		if (prefetch_ios)
-			ext4_mb_prefetch_fini(sb, elr->lr_next_group,
-					      prefetch_ios);
-		trace_ext4_prefetch_bitmaps(sb, group, elr->lr_next_group,
-					    prefetch_ios);
+		elr->lr_next_group = ext4_mb_prefetch(sb, group, nr, &prefetch_ios);
+		ext4_mb_prefetch_fini(sb, elr->lr_next_group, nr);
+		trace_ext4_prefetch_bitmaps(sb, group, elr->lr_next_group, nr);
 		if (group >= elr->lr_next_group) {
 			ret = 1;
 			if (elr->lr_first_not_zeroed != ngroups &&
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 10/12] ext4: Abstract out logic to search average fragment list
  2023-05-30 12:33 [PATCH v2 00/12] multiblock allocator improvements Ojaswin Mujoo
                   ` (8 preceding siblings ...)
  2023-05-30 12:33 ` [PATCH v2 09/12] ext4: Ensure ext4_mb_prefetch_fini() is called for all prefetched BGs Ojaswin Mujoo
@ 2023-05-30 12:33 ` Ojaswin Mujoo
  2023-05-30 12:33 ` [PATCH v2 11/12] ext4: Add allocation criteria 1.5 (CR1_5) Ojaswin Mujoo
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Ojaswin Mujoo @ 2023-05-30 12:33 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, Jan Kara,
	Kemeng Shi, Ritesh Harjani

Make the logic of searching average fragment list of a given order reusable
by abstracting it out to a differnet function. This will also avoid
code duplication in upcoming patches.

No functional changes.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/mballoc.c | 51 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 33 insertions(+), 18 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 6775d73dfc68..f59e1e0e01b1 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -905,6 +905,37 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
 	}
 }
 
+/*
+ * Find a suitable group of given order from the average fragments list.
+ */
+static struct ext4_group_info *
+ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac, int order)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+	struct list_head *frag_list = &sbi->s_mb_avg_fragment_size[order];
+	rwlock_t *frag_list_lock = &sbi->s_mb_avg_fragment_size_locks[order];
+	struct ext4_group_info *grp = NULL, *iter;
+	enum criteria cr = ac->ac_criteria;
+
+	if (list_empty(frag_list))
+		return NULL;
+	read_lock(frag_list_lock);
+	if (list_empty(frag_list)) {
+		read_unlock(frag_list_lock);
+		return NULL;
+	}
+	list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) {
+		if (sbi->s_mb_stats)
+			atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]);
+		if (likely(ext4_mb_good_group(ac, iter->bb_group, cr))) {
+			grp = iter;
+			break;
+		}
+	}
+	read_unlock(frag_list_lock);
+	return grp;
+}
+
 /*
  * Choose next group by traversing average fragment size list of suitable
  * order. Updates *new_cr if cr level needs an update.
@@ -913,7 +944,7 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
 		enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
-	struct ext4_group_info *grp = NULL, *iter;
+	struct ext4_group_info *grp = NULL;
 	int i;
 
 	if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
@@ -923,23 +954,7 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
 
 	for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
 	     i < MB_NUM_ORDERS(ac->ac_sb); i++) {
-		if (list_empty(&sbi->s_mb_avg_fragment_size[i]))
-			continue;
-		read_lock(&sbi->s_mb_avg_fragment_size_locks[i]);
-		if (list_empty(&sbi->s_mb_avg_fragment_size[i])) {
-			read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
-			continue;
-		}
-		list_for_each_entry(iter, &sbi->s_mb_avg_fragment_size[i],
-				    bb_avg_fragment_size_node) {
-			if (sbi->s_mb_stats)
-				atomic64_inc(&sbi->s_bal_cX_groups_considered[CR1]);
-			if (likely(ext4_mb_good_group(ac, iter->bb_group, CR1))) {
-				grp = iter;
-				break;
-			}
-		}
-		read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
+		grp = ext4_mb_find_good_group_avg_frag_lists(ac, i);
 		if (grp)
 			break;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 11/12] ext4: Add allocation criteria 1.5 (CR1_5)
  2023-05-30 12:33 [PATCH v2 00/12] multiblock allocator improvements Ojaswin Mujoo
                   ` (9 preceding siblings ...)
  2023-05-30 12:33 ` [PATCH v2 10/12] ext4: Abstract out logic to search average fragment list Ojaswin Mujoo
@ 2023-05-30 12:33 ` Ojaswin Mujoo
  2023-06-07 10:21   ` Jan Kara
  2023-05-30 12:33 ` [PATCH v2 12/12] ext4: Give symbolic names to mballoc criterias Ojaswin Mujoo
  2023-06-09  3:14 ` [PATCH v2 00/12] multiblock allocator improvements Theodore Ts'o
  12 siblings, 1 reply; 26+ messages in thread
From: Ojaswin Mujoo @ 2023-05-30 12:33 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, Jan Kara,
	Kemeng Shi, Ritesh Harjani

CR1_5 aims to optimize allocations which can't be satisfied in CR1. The
fact that we couldn't find a group in CR1 suggests that it would be
difficult to find a continuous extent to compleltely satisfy our
allocations. So before falling to the slower CR2, in CR1.5 we
proactively trim the the preallocations so we can find a group with
(free / fragments) big enough.  This speeds up our allocation at the
cost of slightly reduced preallocation.

The patch also adds a new sysfs tunable:

* /sys/fs/ext4/<partition>/mb_cr1_5_max_trim_order

This controls how much CR1.5 can trim a request before falling to CR2.
For example, for a request of order 7 and max trim order 2, CR1.5 can
trim this upto order 5.

Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

ext4 squash
---
 fs/ext4/ext4.h              |   8 ++-
 fs/ext4/mballoc.c           | 135 +++++++++++++++++++++++++++++++++---
 fs/ext4/mballoc.h           |  13 ++++
 fs/ext4/sysfs.c             |   2 +
 include/trace/events/ext4.h |   2 +
 5 files changed, 150 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index eae981ab2539..942e97026a60 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -133,13 +133,14 @@ enum SHIFT_DIRECTION {
  * criteria the slower the allocation. We start at lower criterias and keep
  * falling back to higher ones if we are not able to find any blocks.
  */
-#define EXT4_MB_NUM_CRS 4
+#define EXT4_MB_NUM_CRS 5
 /*
  * All possible allocation criterias for mballoc
  */
 enum criteria {
 	CR0,
 	CR1,
+	CR1_5,
 	CR2,
 	CR3,
 };
@@ -185,6 +186,9 @@ enum criteria {
 #define EXT4_MB_CR0_OPTIMIZED		0x8000
 /* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */
 #define EXT4_MB_CR1_OPTIMIZED		0x00010000
+/* Avg fragment size rb tree lookup succeeded at least once for cr = 1.5 */
+#define EXT4_MB_CR1_5_OPTIMIZED		0x00020000
+
 struct ext4_allocation_request {
 	/* target inode for block we're allocating */
 	struct inode *inode;
@@ -1547,6 +1551,7 @@ struct ext4_sb_info {
 	unsigned long s_mb_last_start;
 	unsigned int s_mb_prefetch;
 	unsigned int s_mb_prefetch_limit;
+	unsigned int s_mb_cr1_5_max_trim_order;
 
 	/* stats for buddy allocator */
 	atomic_t s_bal_reqs;	/* number of reqs with len > 1 */
@@ -1561,6 +1566,7 @@ struct ext4_sb_info {
 	atomic_t s_bal_2orders;	/* 2^order hits */
 	atomic_t s_bal_cr0_bad_suggestions;
 	atomic_t s_bal_cr1_bad_suggestions;
+	atomic_t s_bal_cr1_5_bad_suggestions;
 	atomic64_t s_bal_cX_groups_considered[EXT4_MB_NUM_CRS];
 	atomic64_t s_bal_cX_hits[EXT4_MB_NUM_CRS];
 	atomic64_t s_bal_cX_failed[EXT4_MB_NUM_CRS];		/* cX loop didn't find blocks */
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index f59e1e0e01b1..0cf037489e97 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -166,6 +166,14 @@
  * equal to request size using our average fragment size group lists (data
  * structure 2) in O(1) time.
  *
+ * At CR1.5 (aka CR1_5), we aim to optimize allocations which can't be satisfied
+ * in CR1. The fact that we couldn't find a group in CR1 suggests that there is
+ * no BG that has average fragment size > goal length. So before falling to the
+ * slower CR2, in CR1.5 we proactively trim goal length and then use the same
+ * fragment lists as CR1 to find a BG with a big enough average fragment size.
+ * This increases the chances of finding a suitable block group in O(1) time and
+ * results * in faster allocation at the cost of reduced size of allocation.
+ *
  * If "mb_optimize_scan" mount option is not set, mballoc traverses groups in
  * linear order which requires O(N) search time for each CR0 and CR1 phase.
  *
@@ -963,6 +971,91 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
 		*group = grp->bb_group;
 		ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
 	} else {
+		*new_cr = CR1_5;
+	}
+}
+
+/*
+ * We couldn't find a group in CR1 so try to find the highest free fragment
+ * order we have and proactively trim the goal request length to that order to
+ * find a suitable group faster.
+ *
+ * This optimizes allocation speed at the cost of slightly reduced
+ * preallocations. However, we make sure that we don't trim the request too
+ * much and fall to CR2 in that case.
+ */
+static void ext4_mb_choose_next_group_cr1_5(struct ext4_allocation_context *ac,
+		enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+	struct ext4_group_info *grp = NULL;
+	int i, order, min_order;
+	unsigned long num_stripe_clusters = 0;
+
+	if (unlikely(ac->ac_flags & EXT4_MB_CR1_5_OPTIMIZED)) {
+		if (sbi->s_mb_stats)
+			atomic_inc(&sbi->s_bal_cr1_5_bad_suggestions);
+	}
+
+	/*
+	 * mb_avg_fragment_size_order() returns order in a way that makes
+	 * retrieving back the length using (1 << order) inaccurate. Hence, use
+	 * fls() instead since we need to know the actual length while modifying
+	 * goal length.
+	 */
+	order = fls(ac->ac_g_ex.fe_len);
+	min_order = order - sbi->s_mb_cr1_5_max_trim_order;
+	if (min_order < 0)
+		min_order = 0;
+
+	if (1 << min_order < ac->ac_o_ex.fe_len)
+		min_order = fls(ac->ac_o_ex.fe_len) + 1;
+
+	if (sbi->s_stripe > 0) {
+		/*
+		 * We are assuming that stripe size is always a multiple of
+		 * cluster ratio otherwise __ext4_fill_super exists early.
+		 */
+		num_stripe_clusters = EXT4_NUM_B2C(sbi, sbi->s_stripe);
+		if (1 << min_order < num_stripe_clusters)
+			min_order = fls(num_stripe_clusters);
+	}
+
+	for (i = order; i >= min_order; i--) {
+		int frag_order;
+		/*
+		 * Scale down goal len to make sure we find something
+		 * in the free fragments list. Basically, reduce
+		 * preallocations.
+		 */
+		ac->ac_g_ex.fe_len = 1 << i;
+
+		if (num_stripe_clusters > 0) {
+			/*
+			 * Try to round up the adjusted goal to stripe size
+			 * (in cluster units) multiple for efficiency.
+			 *
+			 * XXX: Is s->stripe always a power of 2? In that case
+			 * we can use the faster round_up() variant.
+			 */
+			ac->ac_g_ex.fe_len = roundup(ac->ac_g_ex.fe_len,
+						     num_stripe_clusters);
+		}
+
+		frag_order = mb_avg_fragment_size_order(ac->ac_sb,
+							ac->ac_g_ex.fe_len);
+
+		grp = ext4_mb_find_good_group_avg_frag_lists(ac, frag_order);
+		if (grp)
+			break;
+	}
+
+	if (grp) {
+		*group = grp->bb_group;
+		ac->ac_flags |= EXT4_MB_CR1_5_OPTIMIZED;
+	} else {
+		/* Reset goal length to original goal length before falling into CR2 */
+		ac->ac_g_ex.fe_len = ac->ac_orig_goal_len;
 		*new_cr = CR2;
 	}
 }
@@ -1029,6 +1122,8 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
 		ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
 	} else if (*new_cr == CR1) {
 		ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
+	} else if (*new_cr == CR1_5) {
+		ext4_mb_choose_next_group_cr1_5(ac, new_cr, group, ngroups);
 	} else {
 		/*
 		 * TODO: For CR=2, we can arrange groups in an rb tree sorted by
@@ -2352,7 +2447,7 @@ void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac,
 
 		if (ac->ac_criteria < CR2) {
 			/*
-			 * In CR1, we are sure that this group will
+			 * In CR1 and CR1_5, we are sure that this group will
 			 * have a large enough continuous free extent, so skip
 			 * over the smaller free extents
 			 */
@@ -2484,6 +2579,7 @@ static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
 
 		return true;
 	case CR1:
+	case CR1_5:
 		if ((free / fragments) >= ac->ac_g_ex.fe_len)
 			return true;
 		break;
@@ -2748,7 +2844,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 			 * spend a lot of time loading imperfect groups
 			 */
 			if ((prefetch_grp == group) &&
-			    (cr > CR1 ||
+			    (cr > CR1_5 ||
 			     prefetch_ios < sbi->s_mb_prefetch_limit)) {
 				nr = sbi->s_mb_prefetch;
 				if (ext4_has_feature_flex_bg(sb)) {
@@ -2788,7 +2884,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 			ac->ac_groups_scanned++;
 			if (cr == CR0)
 				ext4_mb_simple_scan_group(ac, &e4b);
-			else if (cr == CR1 && sbi->s_stripe &&
+			else if ((cr == CR1 || cr == CR1_5) && sbi->s_stripe &&
 				 !(ac->ac_g_ex.fe_len %
 				 EXT4_B2C(sbi, sbi->s_stripe)))
 				ext4_mb_scan_aligned(ac, &e4b);
@@ -2804,6 +2900,11 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		/* Processed all groups and haven't found blocks */
 		if (sbi->s_mb_stats && i == ngroups)
 			atomic64_inc(&sbi->s_bal_cX_failed[cr]);
+
+		if (i == ngroups && ac->ac_criteria == CR1_5)
+			/* Reset goal length to original goal length before
+			 * falling into CR2 */
+			ac->ac_g_ex.fe_len = ac->ac_orig_goal_len;
 	}
 
 	if (ac->ac_b_ex.fe_len > 0 && ac->ac_status != AC_STATUS_FOUND &&
@@ -2973,6 +3074,16 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
 	seq_printf(seq, "\t\tbad_suggestions: %u\n",
 		   atomic_read(&sbi->s_bal_cr1_bad_suggestions));
 
+	seq_puts(seq, "\tcr1.5_stats:\n");
+	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR1_5]));
+	seq_printf(seq, "\t\tgroups_considered: %llu\n",
+		   atomic64_read(&sbi->s_bal_cX_groups_considered[CR1_5]));
+	seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR1_5]));
+	seq_printf(seq, "\t\tuseless_loops: %llu\n",
+		   atomic64_read(&sbi->s_bal_cX_failed[CR1_5]));
+	seq_printf(seq, "\t\tbad_suggestions: %u\n",
+		   atomic_read(&sbi->s_bal_cr1_5_bad_suggestions));
+
 	seq_puts(seq, "\tcr2_stats:\n");
 	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR2]));
 	seq_printf(seq, "\t\tgroups_considered: %llu\n",
@@ -3490,6 +3601,8 @@ int ext4_mb_init(struct super_block *sb)
 	sbi->s_mb_stats = MB_DEFAULT_STATS;
 	sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD;
 	sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS;
+	sbi->s_mb_cr1_5_max_trim_order = MB_DEFAULT_CR1_5_TRIM_ORDER;
+
 	/*
 	 * The default group preallocation is 512, which for 4k block
 	 * sizes translates to 2 megabytes.  However for bigalloc file
@@ -4402,6 +4515,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 	 * placement or satisfy big request as is */
 	ac->ac_g_ex.fe_logical = start;
 	ac->ac_g_ex.fe_len = EXT4_NUM_B2C(sbi, size);
+	ac->ac_orig_goal_len = ac->ac_g_ex.fe_len;
 
 	/* define goal start in order to merge */
 	if (ar->pright && (ar->lright == (start + size)) &&
@@ -4445,8 +4559,10 @@ static void ext4_mb_collect_stats(struct ext4_allocation_context *ac)
 		if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start &&
 				ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group)
 			atomic_inc(&sbi->s_bal_goals);
-		if (ac->ac_f_ex.fe_len == ac->ac_g_ex.fe_len)
+		/* did we allocate as much as normalizer originally wanted? */
+		if (ac->ac_f_ex.fe_len == ac->ac_orig_goal_len)
 			atomic_inc(&sbi->s_bal_len_goals);
+
 		if (ac->ac_found > sbi->s_mb_max_to_scan)
 			atomic_inc(&sbi->s_bal_breaks);
 	}
@@ -4931,7 +5047,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 
 	pa = ac->ac_pa;
 
-	if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len) {
+	if (ac->ac_b_ex.fe_len < ac->ac_orig_goal_len) {
 		int new_bex_start;
 		int new_bex_end;
 
@@ -4946,14 +5062,14 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 		 * fragmentation in check while ensuring logical range of best
 		 * extent doesn't overflow out of goal extent:
 		 *
-		 * 1. Check if best ex can be kept at end of goal and still
-		 *    cover original start
+		 * 1. Check if best ex can be kept at end of goal (before
+		 *    cr_best_avail trimmed it) and still cover original start
 		 * 2. Else, check if best ex can be kept at start of goal and
 		 *    still cover original start
 		 * 3. Else, keep the best ex at start of original request.
 		 */
 		new_bex_end = ac->ac_g_ex.fe_logical +
-			EXT4_C2B(sbi, ac->ac_g_ex.fe_len);
+			EXT4_C2B(sbi, ac->ac_orig_goal_len);
 		new_bex_start = new_bex_end - EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
 		if (ac->ac_o_ex.fe_logical >= new_bex_start)
 			goto adjust_bex;
@@ -4974,7 +5090,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 		BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
 		BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
 		BUG_ON(new_bex_end > (ac->ac_g_ex.fe_logical +
-				      EXT4_C2B(sbi, ac->ac_g_ex.fe_len)));
+				      EXT4_C2B(sbi, ac->ac_orig_goal_len)));
 	}
 
 	pa->pa_lstart = ac->ac_b_ex.fe_logical;
@@ -5594,6 +5710,7 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac,
 	ac->ac_o_ex.fe_start = block;
 	ac->ac_o_ex.fe_len = len;
 	ac->ac_g_ex = ac->ac_o_ex;
+	ac->ac_orig_goal_len = ac->ac_g_ex.fe_len;
 	ac->ac_flags = ar->flags;
 
 	/* we have to define context: we'll work with a file or
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index acfdc204e15d..bddc0335c261 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -85,6 +85,13 @@
  */
 #define MB_DEFAULT_LINEAR_SCAN_THRESHOLD	16
 
+/*
+ * The maximum order upto which CR1.5 can trim a particular allocation request.
+ * Example, if we have an order 7 request and max trim order of 3, CR1.5 can
+ * trim this upto order 4.
+ */
+#define MB_DEFAULT_CR1_5_TRIM_ORDER	3
+
 /*
  * Number of valid buddy orders
  */
@@ -179,6 +186,12 @@ struct ext4_allocation_context {
 	/* copy of the best found extent taken before preallocation efforts */
 	struct ext4_free_extent ac_f_ex;
 
+	/*
+	 * goal len can change in CR1.5, so save the original len. This is
+	 * used while adjusting the PA window and for accounting.
+	 */
+	ext4_grpblk_t	ac_orig_goal_len;
+
 	__u32 ac_groups_considered;
 	__u32 ac_flags;		/* allocation hints */
 	__u16 ac_groups_scanned;
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index 3042bc605bbf..4a5c08c8dddb 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -223,6 +223,7 @@ EXT4_RW_ATTR_SBI_UI(warning_ratelimit_interval_ms, s_warning_ratelimit_state.int
 EXT4_RW_ATTR_SBI_UI(warning_ratelimit_burst, s_warning_ratelimit_state.burst);
 EXT4_RW_ATTR_SBI_UI(msg_ratelimit_interval_ms, s_msg_ratelimit_state.interval);
 EXT4_RW_ATTR_SBI_UI(msg_ratelimit_burst, s_msg_ratelimit_state.burst);
+EXT4_RW_ATTR_SBI_UI(mb_cr1_5_max_trim_order, s_mb_cr1_5_max_trim_order);
 #ifdef CONFIG_EXT4_DEBUG
 EXT4_RW_ATTR_SBI_UL(simulate_fail, s_simulate_fail);
 #endif
@@ -273,6 +274,7 @@ static struct attribute *ext4_attrs[] = {
 	ATTR_LIST(warning_ratelimit_burst),
 	ATTR_LIST(msg_ratelimit_interval_ms),
 	ATTR_LIST(msg_ratelimit_burst),
+	ATTR_LIST(mb_cr1_5_max_trim_order),
 	ATTR_LIST(errors_count),
 	ATTR_LIST(warning_count),
 	ATTR_LIST(msg_count),
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index f062147ca32b..7ea9b4fcb21f 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -122,6 +122,7 @@ TRACE_DEFINE_ENUM(EXT4_FC_REASON_MAX);
 
 TRACE_DEFINE_ENUM(CR0);
 TRACE_DEFINE_ENUM(CR1);
+TRACE_DEFINE_ENUM(CR1_5);
 TRACE_DEFINE_ENUM(CR2);
 TRACE_DEFINE_ENUM(CR3);
 
@@ -129,6 +130,7 @@ TRACE_DEFINE_ENUM(CR3);
 	__print_symbolic(cr,                    \
 			 { CR0, "CR0" },	\
 			 { CR1, "CR1" },        \
+			 { CR1_5, "CR1.5" }     \
 			 { CR2, "CR2" },        \
 			 { CR3, "CR3" })
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 12/12] ext4: Give symbolic names to mballoc criterias
  2023-05-30 12:33 [PATCH v2 00/12] multiblock allocator improvements Ojaswin Mujoo
                   ` (10 preceding siblings ...)
  2023-05-30 12:33 ` [PATCH v2 11/12] ext4: Add allocation criteria 1.5 (CR1_5) Ojaswin Mujoo
@ 2023-05-30 12:33 ` Ojaswin Mujoo
  2023-06-07 10:39   ` Jan Kara
  2023-06-09  3:14 ` [PATCH v2 00/12] multiblock allocator improvements Theodore Ts'o
  12 siblings, 1 reply; 26+ messages in thread
From: Ojaswin Mujoo @ 2023-05-30 12:33 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, Jan Kara, Kemeng Shi

mballoc criterias have historically been called by numbers
like CR0, CR1... however this makes it confusing to understand
what each criteria is about.

Change these criterias from numbers to symbolic names and add
relevant comments. While we are at it, also reformat and add some
comments to ext4_seq_mb_stats_show() for better readability.

Additionally, define CR_FAST which signifies the criteria
below which we can make quicker decisions like:
  * quitting early if (free block < requested len)
  * avoiding to scan free extents smaller than required len.
  * avoiding to initialize buddy cache and work with existing cache
  * limiting prefetches

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/ext4.h              |  55 ++++++--
 fs/ext4/mballoc.c           | 271 ++++++++++++++++++++----------------
 fs/ext4/mballoc.h           |   8 +-
 fs/ext4/sysfs.c             |   4 +-
 include/trace/events/ext4.h |  26 ++--
 5 files changed, 214 insertions(+), 150 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 942e97026a60..c29a4e1fcd5d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -135,16 +135,45 @@ enum SHIFT_DIRECTION {
  */
 #define EXT4_MB_NUM_CRS 5
 /*
- * All possible allocation criterias for mballoc
+ * All possible allocation criterias for mballoc. Lower are faster.
  */
 enum criteria {
-	CR0,
-	CR1,
-	CR1_5,
-	CR2,
-	CR3,
+	/*
+	 * Used when number of blocks needed is a power of 2. This doesn't
+	 * trigger any disk IO except prefetch and is the fastest criteria.
+	 */
+	CR_POWER2_ALIGNED,
+
+	/*
+	 * Tries to lookup in-memory data structures to find the most suitable
+	 * group that satisfies goal request. No disk IO except block prefetch.
+	 */
+	CR_GOAL_LEN_FAST,
+
+        /*
+	 * Same as CR_GOAL_LEN_FAST but is allowed to reduce the goal length to
+         * the best available length for faster allocation.
+	 */
+	CR_BEST_AVAIL_LEN,
+
+	/*
+	 * Reads each block group sequentially, performing disk IO if necessary, to
+	 * find find_suitable block group. Tries to allocate goal length but might trim
+	 * the request if nothing is found after enough tries.
+	 */
+	CR_GOAL_LEN_SLOW,
+
+	/*
+	 * Finds the first free set of blocks and allocates those. This is only
+	 * used in rare cases when CR_GOAL_LEN_SLOW also fails to allocate
+	 * anything.
+	 */
+	CR_ANY_FREE,
 };
 
+/* criteria below which we use fast block scanning and avoid unnecessary IO */
+#define CR_FAST CR_GOAL_LEN_SLOW
+
 /*
  * Flags used in mballoc's allocation_context flags field.
  *
@@ -183,11 +212,11 @@ enum criteria {
 /* Do strict check for free blocks while retrying block allocation */
 #define EXT4_MB_STRICT_CHECK		0x4000
 /* Large fragment size list lookup succeeded at least once for cr = 0 */
-#define EXT4_MB_CR0_OPTIMIZED		0x8000
+#define EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED		0x8000
 /* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */
-#define EXT4_MB_CR1_OPTIMIZED		0x00010000
+#define EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED		0x00010000
 /* Avg fragment size rb tree lookup succeeded at least once for cr = 1.5 */
-#define EXT4_MB_CR1_5_OPTIMIZED		0x00020000
+#define EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED		0x00020000
 
 struct ext4_allocation_request {
 	/* target inode for block we're allocating */
@@ -1551,7 +1580,7 @@ struct ext4_sb_info {
 	unsigned long s_mb_last_start;
 	unsigned int s_mb_prefetch;
 	unsigned int s_mb_prefetch_limit;
-	unsigned int s_mb_cr1_5_max_trim_order;
+	unsigned int s_mb_best_avail_max_trim_order;
 
 	/* stats for buddy allocator */
 	atomic_t s_bal_reqs;	/* number of reqs with len > 1 */
@@ -1564,9 +1593,9 @@ struct ext4_sb_info {
 	atomic_t s_bal_len_goals;	/* len goal hits */
 	atomic_t s_bal_breaks;	/* too long searches */
 	atomic_t s_bal_2orders;	/* 2^order hits */
-	atomic_t s_bal_cr0_bad_suggestions;
-	atomic_t s_bal_cr1_bad_suggestions;
-	atomic_t s_bal_cr1_5_bad_suggestions;
+	atomic_t s_bal_p2_aligned_bad_suggestions;
+	atomic_t s_bal_goal_fast_bad_suggestions;
+	atomic_t s_bal_best_avail_bad_suggestions;
 	atomic64_t s_bal_cX_groups_considered[EXT4_MB_NUM_CRS];
 	atomic64_t s_bal_cX_hits[EXT4_MB_NUM_CRS];
 	atomic64_t s_bal_cX_failed[EXT4_MB_NUM_CRS];		/* cX loop didn't find blocks */
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 0cf037489e97..4f2a1df98141 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -155,27 +155,31 @@
  * structures to decide the order in which groups are to be traversed for
  * fulfilling an allocation request.
  *
- * At CR0 , we look for groups which have the largest_free_order >= the order
- * of the request. We directly look at the largest free order list in the data
- * structure (1) above where largest_free_order = order of the request. If that
- * list is empty, we look at remaining list in the increasing order of
- * largest_free_order. This allows us to perform CR0 lookup in O(1) time.
+ * At CR_POWER2_ALIGNED , we look for groups which have the largest_free_order
+ * >= the order of the request. We directly look at the largest free order list
+ * in the data structure (1) above where largest_free_order = order of the
+ * request. If that list is empty, we look at remaining list in the increasing
+ * order of largest_free_order. This allows us to perform CR_POWER2_ALIGNED
+ * lookup in O(1) time.
  *
- * At CR1, we only consider groups where average fragment size > request
- * size. So, we lookup a group which has average fragment size just above or
- * equal to request size using our average fragment size group lists (data
- * structure 2) in O(1) time.
+ * At CR_GOAL_LEN_FAST, we only consider groups where
+ * average fragment size > request size. So, we lookup a group which has average
+ * fragment size just above or equal to request size using our average fragment
+ * size group lists (data structure 2) in O(1) time.
  *
- * At CR1.5 (aka CR1_5), we aim to optimize allocations which can't be satisfied
- * in CR1. The fact that we couldn't find a group in CR1 suggests that there is
- * no BG that has average fragment size > goal length. So before falling to the
- * slower CR2, in CR1.5 we proactively trim goal length and then use the same
- * fragment lists as CR1 to find a BG with a big enough average fragment size.
- * This increases the chances of finding a suitable block group in O(1) time and
- * results * in faster allocation at the cost of reduced size of allocation.
+ * At CR_BEST_AVAIL_LEN, we aim to optimize allocations which can't be satisfied
+ * in CR_GOAL_LEN_FAST. The fact that we couldn't find a group in
+ * CR_GOAL_LEN_FAST suggests that there is no BG that has avg
+ * fragment size > goal length. So before falling to the slower
+ * CR_GOAL_LEN_SLOW, in CR_BEST_AVAIL_LEN we proactively trim goal length and
+ * then use the same fragment lists as CR_GOAL_LEN_FAST to find a BG with a big
+ * enough average fragment size. This increases the chances of finding a
+ * suitable block group in O(1) time and results in faster allocation at the
+ * cost of reduced size of allocation.
  *
  * If "mb_optimize_scan" mount option is not set, mballoc traverses groups in
- * linear order which requires O(N) search time for each CR0 and CR1 phase.
+ * linear order which requires O(N) search time for each CR_POWER2_ALIGNED and
+ * CR_GOAL_LEN_FAST phase.
  *
  * The regular allocator (using the buddy cache) supports a few tunables.
  *
@@ -360,8 +364,8 @@
  *  - bitlock on a group	(group)
  *  - object (inode/locality)	(object)
  *  - per-pa lock		(pa)
- *  - cr0 lists lock		(cr0)
- *  - cr1 tree lock		(cr1)
+ *  - cr_power2_aligned lists lock	(cr_power2_aligned)
+ *  - cr_goal_len_fast lists lock	(cr_goal_len_fast)
  *
  * Paths:
  *  - new pa
@@ -393,7 +397,7 @@
  *
  *  - allocation path (ext4_mb_regular_allocator)
  *    group
- *    cr0/cr1
+ *    cr_power2_aligned/cr_goal_len_fast
  */
 static struct kmem_cache *ext4_pspace_cachep;
 static struct kmem_cache *ext4_ac_cachep;
@@ -867,7 +871,7 @@ mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
  * Choose next group by traversing largest_free_order lists. Updates *new_cr if
  * cr level needs an update.
  */
-static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
+static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context *ac,
 			enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
@@ -877,8 +881,8 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
 	if (ac->ac_status == AC_STATUS_FOUND)
 		return;
 
-	if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
-		atomic_inc(&sbi->s_bal_cr0_bad_suggestions);
+	if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED))
+		atomic_inc(&sbi->s_bal_p2_aligned_bad_suggestions);
 
 	grp = NULL;
 	for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
@@ -893,8 +897,8 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
 		list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
 				    bb_largest_free_order_node) {
 			if (sbi->s_mb_stats)
-				atomic64_inc(&sbi->s_bal_cX_groups_considered[CR0]);
-			if (likely(ext4_mb_good_group(ac, iter->bb_group, CR0))) {
+				atomic64_inc(&sbi->s_bal_cX_groups_considered[CR_POWER2_ALIGNED]);
+			if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) {
 				grp = iter;
 				break;
 			}
@@ -906,10 +910,10 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
 
 	if (!grp) {
 		/* Increment cr and search again */
-		*new_cr = CR1;
+		*new_cr = CR_GOAL_LEN_FAST;
 	} else {
 		*group = grp->bb_group;
-		ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED;
+		ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED;
 	}
 }
 
@@ -948,16 +952,16 @@ ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac, int o
  * Choose next group by traversing average fragment size list of suitable
  * order. Updates *new_cr if cr level needs an update.
  */
-static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
+static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context *ac,
 		enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	struct ext4_group_info *grp = NULL;
 	int i;
 
-	if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
+	if (unlikely(ac->ac_flags & EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED)) {
 		if (sbi->s_mb_stats)
-			atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
+			atomic_inc(&sbi->s_bal_goal_fast_bad_suggestions);
 	}
 
 	for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
@@ -969,22 +973,22 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
 
 	if (grp) {
 		*group = grp->bb_group;
-		ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
+		ac->ac_flags |= EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED;
 	} else {
-		*new_cr = CR1_5;
+		*new_cr = CR_BEST_AVAIL_LEN;
 	}
 }
 
 /*
- * We couldn't find a group in CR1 so try to find the highest free fragment
+ * We couldn't find a group in CR_GOAL_LEN_FAST so try to find the highest free fragment
  * order we have and proactively trim the goal request length to that order to
  * find a suitable group faster.
  *
  * This optimizes allocation speed at the cost of slightly reduced
  * preallocations. However, we make sure that we don't trim the request too
- * much and fall to CR2 in that case.
+ * much and fall to CR_GOAL_LEN_SLOW in that case.
  */
-static void ext4_mb_choose_next_group_cr1_5(struct ext4_allocation_context *ac,
+static void ext4_mb_choose_next_group_best_avail(struct ext4_allocation_context *ac,
 		enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
@@ -992,9 +996,9 @@ static void ext4_mb_choose_next_group_cr1_5(struct ext4_allocation_context *ac,
 	int i, order, min_order;
 	unsigned long num_stripe_clusters = 0;
 
-	if (unlikely(ac->ac_flags & EXT4_MB_CR1_5_OPTIMIZED)) {
+	if (unlikely(ac->ac_flags & EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED)) {
 		if (sbi->s_mb_stats)
-			atomic_inc(&sbi->s_bal_cr1_5_bad_suggestions);
+			atomic_inc(&sbi->s_bal_best_avail_bad_suggestions);
 	}
 
 	/*
@@ -1004,7 +1008,7 @@ static void ext4_mb_choose_next_group_cr1_5(struct ext4_allocation_context *ac,
 	 * goal length.
 	 */
 	order = fls(ac->ac_g_ex.fe_len);
-	min_order = order - sbi->s_mb_cr1_5_max_trim_order;
+	min_order = order - sbi->s_mb_best_avail_max_trim_order;
 	if (min_order < 0)
 		min_order = 0;
 
@@ -1052,11 +1056,11 @@ static void ext4_mb_choose_next_group_cr1_5(struct ext4_allocation_context *ac,
 
 	if (grp) {
 		*group = grp->bb_group;
-		ac->ac_flags |= EXT4_MB_CR1_5_OPTIMIZED;
+		ac->ac_flags |= EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED;
 	} else {
-		/* Reset goal length to original goal length before falling into CR2 */
+		/* Reset goal length to original goal length before falling into CR_GOAL_LEN_SLOW */
 		ac->ac_g_ex.fe_len = ac->ac_orig_goal_len;
-		*new_cr = CR2;
+		*new_cr = CR_GOAL_LEN_SLOW;
 	}
 }
 
@@ -1064,7 +1068,7 @@ static inline int should_optimize_scan(struct ext4_allocation_context *ac)
 {
 	if (unlikely(!test_opt2(ac->ac_sb, MB_OPTIMIZE_SCAN)))
 		return 0;
-	if (ac->ac_criteria >= CR2)
+	if (ac->ac_criteria >= CR_GOAL_LEN_SLOW)
 		return 0;
 	if (!ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))
 		return 0;
@@ -1118,12 +1122,12 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
 		return;
 	}
 
-	if (*new_cr == CR0) {
-		ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
-	} else if (*new_cr == CR1) {
-		ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
-	} else if (*new_cr == CR1_5) {
-		ext4_mb_choose_next_group_cr1_5(ac, new_cr, group, ngroups);
+	if (*new_cr == CR_POWER2_ALIGNED) {
+		ext4_mb_choose_next_group_p2_aligned(ac, new_cr, group, ngroups);
+	} else if (*new_cr == CR_GOAL_LEN_FAST) {
+		ext4_mb_choose_next_group_goal_fast(ac, new_cr, group, ngroups);
+	} else if (*new_cr == CR_BEST_AVAIL_LEN) {
+		ext4_mb_choose_next_group_best_avail(ac, new_cr, group, ngroups);
 	} else {
 		/*
 		 * TODO: For CR=2, we can arrange groups in an rb tree sorted by
@@ -2445,11 +2449,12 @@ void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac,
 			break;
 		}
 
-		if (ac->ac_criteria < CR2) {
+		if (ac->ac_criteria < CR_FAST) {
 			/*
-			 * In CR1 and CR1_5, we are sure that this group will
-			 * have a large enough continuous free extent, so skip
-			 * over the smaller free extents
+			 * In CR_GOAL_LEN_FAST and CR_BEST_AVAIL_LEN, we are
+			 * sure that this group will have a large enough
+			 * continuous free extent, so skip over the smaller free
+			 * extents
 			 */
 			j = mb_find_next_bit(bitmap,
 						EXT4_CLUSTERS_PER_GROUP(sb), i);
@@ -2545,7 +2550,7 @@ static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
 	int flex_size = ext4_flex_bg_size(EXT4_SB(ac->ac_sb));
 	struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
 
-	BUG_ON(cr < CR0 || cr >= EXT4_MB_NUM_CRS);
+	BUG_ON(cr < CR_POWER2_ALIGNED || cr >= EXT4_MB_NUM_CRS);
 
 	if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp) || !grp))
 		return false;
@@ -2559,7 +2564,7 @@ static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
 		return false;
 
 	switch (cr) {
-	case CR0:
+	case CR_POWER2_ALIGNED:
 		BUG_ON(ac->ac_2order == 0);
 
 		/* Avoid using the first bg of a flexgroup for data files */
@@ -2578,16 +2583,16 @@ static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
 			return false;
 
 		return true;
-	case CR1:
-	case CR1_5:
+	case CR_GOAL_LEN_FAST:
+	case CR_BEST_AVAIL_LEN:
 		if ((free / fragments) >= ac->ac_g_ex.fe_len)
 			return true;
 		break;
-	case CR2:
+	case CR_GOAL_LEN_SLOW:
 		if (free >= ac->ac_g_ex.fe_len)
 			return true;
 		break;
-	case CR3:
+	case CR_ANY_FREE:
 		return true;
 	default:
 		BUG();
@@ -2628,7 +2633,7 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
 	free = grp->bb_free;
 	if (free == 0)
 		goto out;
-	if (cr <= CR2 && free < ac->ac_g_ex.fe_len)
+	if (cr <= CR_FAST && free < ac->ac_g_ex.fe_len)
 		goto out;
 	if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
 		goto out;
@@ -2643,15 +2648,16 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
 			ext4_get_group_desc(sb, group, NULL);
 		int ret;
 
-		/* cr=CR0/CR1 is a very optimistic search to find large
-		 * good chunks almost for free.  If buddy data is not
-		 * ready, then this optimization makes no sense.  But
-		 * we never skip the first block group in a flex_bg,
-		 * since this gets used for metadata block allocation,
-		 * and we want to make sure we locate metadata blocks
-		 * in the first block group in the flex_bg if possible.
+		/*
+		 * cr=CR_POWER2_ALIGNED/CR_GOAL_LEN_FAST is a very optimistic
+		 * search to find large good chunks almost for free. If buddy
+		 * data is not ready, then this optimization makes no sense. But
+		 * we never skip the first block group in a flex_bg, since this
+		 * gets used for metadata block allocation, and we want to make
+		 * sure we locate metadata blocks in the first block group in
+		 * the flex_bg if possible.
 		 */
-		if (cr < CR2 &&
+		if (cr < CR_FAST &&
 		    (!sbi->s_log_groups_per_flex ||
 		     ((group & ((1 << sbi->s_log_groups_per_flex) - 1)) != 0)) &&
 		    !(ext4_has_group_desc_csum(sb) &&
@@ -2811,10 +2817,10 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 	}
 
 	/* Let's just scan groups to find more-less suitable blocks */
-	cr = ac->ac_2order ? CR0 : CR1;
+	cr = ac->ac_2order ? CR_POWER2_ALIGNED : CR_GOAL_LEN_FAST;
 	/*
-	 * cr == CR0 try to get exact allocation,
-	 * cr == CR3 try to get anything
+	 * cr == CR_POWER2_ALIGNED try to get exact allocation,
+	 * cr == CR_ANY_FREE try to get anything
 	 */
 repeat:
 	for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
@@ -2844,7 +2850,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 			 * spend a lot of time loading imperfect groups
 			 */
 			if ((prefetch_grp == group) &&
-			    (cr > CR1_5 ||
+			    (cr >= CR_FAST ||
 			     prefetch_ios < sbi->s_mb_prefetch_limit)) {
 				nr = sbi->s_mb_prefetch;
 				if (ext4_has_feature_flex_bg(sb)) {
@@ -2882,9 +2888,11 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 			}
 
 			ac->ac_groups_scanned++;
-			if (cr == CR0)
+			if (cr == CR_POWER2_ALIGNED)
 				ext4_mb_simple_scan_group(ac, &e4b);
-			else if ((cr == CR1 || cr == CR1_5) && sbi->s_stripe &&
+			else if ((cr == CR_GOAL_LEN_FAST ||
+				 cr == CR_BEST_AVAIL_LEN) &&
+				 sbi->s_stripe &&
 				 !(ac->ac_g_ex.fe_len %
 				 EXT4_B2C(sbi, sbi->s_stripe)))
 				ext4_mb_scan_aligned(ac, &e4b);
@@ -2901,9 +2909,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		if (sbi->s_mb_stats && i == ngroups)
 			atomic64_inc(&sbi->s_bal_cX_failed[cr]);
 
-		if (i == ngroups && ac->ac_criteria == CR1_5)
+		if (i == ngroups && ac->ac_criteria == CR_BEST_AVAIL_LEN)
 			/* Reset goal length to original goal length before
-			 * falling into CR2 */
+			 * falling into CR_GOAL_LEN_SLOW */
 			ac->ac_g_ex.fe_len = ac->ac_orig_goal_len;
 	}
 
@@ -2930,7 +2938,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 			ac->ac_b_ex.fe_len = 0;
 			ac->ac_status = AC_STATUS_CONTINUE;
 			ac->ac_flags |= EXT4_MB_HINT_FIRST;
-			cr = CR3;
+			cr = CR_ANY_FREE;
 			goto repeat;
 		}
 	}
@@ -3046,66 +3054,94 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
 	seq_puts(seq, "mballoc:\n");
 	if (!sbi->s_mb_stats) {
 		seq_puts(seq, "\tmb stats collection turned off.\n");
-		seq_puts(seq, "\tTo enable, please write \"1\" to sysfs file mb_stats.\n");
+		seq_puts(
+			seq,
+			"\tTo enable, please write \"1\" to sysfs file mb_stats.\n");
 		return 0;
 	}
 	seq_printf(seq, "\treqs: %u\n", atomic_read(&sbi->s_bal_reqs));
 	seq_printf(seq, "\tsuccess: %u\n", atomic_read(&sbi->s_bal_success));
 
-	seq_printf(seq, "\tgroups_scanned: %u\n",  atomic_read(&sbi->s_bal_groups_scanned));
-
-	seq_puts(seq, "\tcr0_stats:\n");
-	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR0]));
-	seq_printf(seq, "\t\tgroups_considered: %llu\n",
-		   atomic64_read(&sbi->s_bal_cX_groups_considered[CR0]));
-	seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR0]));
+	seq_printf(seq, "\tgroups_scanned: %u\n",
+		   atomic_read(&sbi->s_bal_groups_scanned));
+
+	/* CR_POWER2_ALIGNED stats */
+	seq_puts(seq, "\tcr_p2_aligned_stats:\n");
+	seq_printf(seq, "\t\thits: %llu\n",
+		   atomic64_read(&sbi->s_bal_cX_hits[CR_POWER2_ALIGNED]));
+	seq_printf(
+		seq, "\t\tgroups_considered: %llu\n",
+		atomic64_read(
+			&sbi->s_bal_cX_groups_considered[CR_POWER2_ALIGNED]));
+	seq_printf(seq, "\t\textents_scanned: %u\n",
+		   atomic_read(&sbi->s_bal_cX_ex_scanned[CR_POWER2_ALIGNED]));
 	seq_printf(seq, "\t\tuseless_loops: %llu\n",
-		   atomic64_read(&sbi->s_bal_cX_failed[CR0]));
+		   atomic64_read(&sbi->s_bal_cX_failed[CR_POWER2_ALIGNED]));
 	seq_printf(seq, "\t\tbad_suggestions: %u\n",
-		   atomic_read(&sbi->s_bal_cr0_bad_suggestions));
+		   atomic_read(&sbi->s_bal_p2_aligned_bad_suggestions));
 
-	seq_puts(seq, "\tcr1_stats:\n");
-	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR1]));
+	/* CR_GOAL_LEN_FAST stats */
+	seq_puts(seq, "\tcr_goal_fast_stats:\n");
+	seq_printf(seq, "\t\thits: %llu\n",
+		   atomic64_read(&sbi->s_bal_cX_hits[CR_GOAL_LEN_FAST]));
 	seq_printf(seq, "\t\tgroups_considered: %llu\n",
-		   atomic64_read(&sbi->s_bal_cX_groups_considered[CR1]));
-	seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR1]));
+		   atomic64_read(
+			   &sbi->s_bal_cX_groups_considered[CR_GOAL_LEN_FAST]));
+	seq_printf(seq, "\t\textents_scanned: %u\n",
+		   atomic_read(&sbi->s_bal_cX_ex_scanned[CR_GOAL_LEN_FAST]));
 	seq_printf(seq, "\t\tuseless_loops: %llu\n",
-		   atomic64_read(&sbi->s_bal_cX_failed[CR1]));
+		   atomic64_read(&sbi->s_bal_cX_failed[CR_GOAL_LEN_FAST]));
 	seq_printf(seq, "\t\tbad_suggestions: %u\n",
-		   atomic_read(&sbi->s_bal_cr1_bad_suggestions));
-
-	seq_puts(seq, "\tcr1.5_stats:\n");
-	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR1_5]));
-	seq_printf(seq, "\t\tgroups_considered: %llu\n",
-		   atomic64_read(&sbi->s_bal_cX_groups_considered[CR1_5]));
-	seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR1_5]));
+		   atomic_read(&sbi->s_bal_goal_fast_bad_suggestions));
+
+	/* CR_BEST_AVAIL_LEN stats */
+	seq_puts(seq, "\tcr_best_avail_stats:\n");
+	seq_printf(seq, "\t\thits: %llu\n",
+		   atomic64_read(&sbi->s_bal_cX_hits[CR_BEST_AVAIL_LEN]));
+	seq_printf(
+		seq, "\t\tgroups_considered: %llu\n",
+		atomic64_read(
+			&sbi->s_bal_cX_groups_considered[CR_BEST_AVAIL_LEN]));
+	seq_printf(seq, "\t\textents_scanned: %u\n",
+		   atomic_read(&sbi->s_bal_cX_ex_scanned[CR_BEST_AVAIL_LEN]));
 	seq_printf(seq, "\t\tuseless_loops: %llu\n",
-		   atomic64_read(&sbi->s_bal_cX_failed[CR1_5]));
+		   atomic64_read(&sbi->s_bal_cX_failed[CR_BEST_AVAIL_LEN]));
 	seq_printf(seq, "\t\tbad_suggestions: %u\n",
-		   atomic_read(&sbi->s_bal_cr1_5_bad_suggestions));
+		   atomic_read(&sbi->s_bal_best_avail_bad_suggestions));
 
-	seq_puts(seq, "\tcr2_stats:\n");
-	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR2]));
+	/* CR_GOAL_LEN_SLOW stats */
+	seq_puts(seq, "\tcr_goal_slow_stats:\n");
+	seq_printf(seq, "\t\thits: %llu\n",
+		   atomic64_read(&sbi->s_bal_cX_hits[CR_GOAL_LEN_SLOW]));
 	seq_printf(seq, "\t\tgroups_considered: %llu\n",
-		   atomic64_read(&sbi->s_bal_cX_groups_considered[CR2]));
-	seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR2]));
+		   atomic64_read(
+			   &sbi->s_bal_cX_groups_considered[CR_GOAL_LEN_SLOW]));
+	seq_printf(seq, "\t\textents_scanned: %u\n",
+		   atomic_read(&sbi->s_bal_cX_ex_scanned[CR_GOAL_LEN_SLOW]));
 	seq_printf(seq, "\t\tuseless_loops: %llu\n",
-		   atomic64_read(&sbi->s_bal_cX_failed[CR2]));
-
-	seq_puts(seq, "\tcr3_stats:\n");
-	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR3]));
-	seq_printf(seq, "\t\tgroups_considered: %llu\n",
-		   atomic64_read(&sbi->s_bal_cX_groups_considered[CR3]));
-	seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR3]));
+		   atomic64_read(&sbi->s_bal_cX_failed[CR_GOAL_LEN_SLOW]));
+
+	/* CR_ANY_FREE stats */
+	seq_puts(seq, "\tcr_any_free_stats:\n");
+	seq_printf(seq, "\t\thits: %llu\n",
+		   atomic64_read(&sbi->s_bal_cX_hits[CR_ANY_FREE]));
+	seq_printf(
+		seq, "\t\tgroups_considered: %llu\n",
+		atomic64_read(&sbi->s_bal_cX_groups_considered[CR_ANY_FREE]));
+	seq_printf(seq, "\t\textents_scanned: %u\n",
+		   atomic_read(&sbi->s_bal_cX_ex_scanned[CR_ANY_FREE]));
 	seq_printf(seq, "\t\tuseless_loops: %llu\n",
-		   atomic64_read(&sbi->s_bal_cX_failed[CR3]));
-	seq_printf(seq, "\textents_scanned: %u\n", atomic_read(&sbi->s_bal_ex_scanned));
+		   atomic64_read(&sbi->s_bal_cX_failed[CR_ANY_FREE]));
+
+	/* Aggregates */
+	seq_printf(seq, "\textents_scanned: %u\n",
+		   atomic_read(&sbi->s_bal_ex_scanned));
 	seq_printf(seq, "\t\tgoal_hits: %u\n", atomic_read(&sbi->s_bal_goals));
-	seq_printf(seq, "\t\tlen_goal_hits: %u\n", atomic_read(&sbi->s_bal_len_goals));
+	seq_printf(seq, "\t\tlen_goal_hits: %u\n",
+		   atomic_read(&sbi->s_bal_len_goals));
 	seq_printf(seq, "\t\t2^n_hits: %u\n", atomic_read(&sbi->s_bal_2orders));
 	seq_printf(seq, "\t\tbreaks: %u\n", atomic_read(&sbi->s_bal_breaks));
 	seq_printf(seq, "\t\tlost: %u\n", atomic_read(&sbi->s_mb_lost_chunks));
-
 	seq_printf(seq, "\tbuddies_generated: %u/%u\n",
 		   atomic_read(&sbi->s_mb_buddies_generated),
 		   ext4_get_groups_count(sb));
@@ -3113,8 +3149,7 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
 		   atomic64_read(&sbi->s_mb_generation_time));
 	seq_printf(seq, "\tpreallocated: %u\n",
 		   atomic_read(&sbi->s_mb_preallocated));
-	seq_printf(seq, "\tdiscarded: %u\n",
-		   atomic_read(&sbi->s_mb_discarded));
+	seq_printf(seq, "\tdiscarded: %u\n", atomic_read(&sbi->s_mb_discarded));
 	return 0;
 }
 
@@ -3601,7 +3636,7 @@ int ext4_mb_init(struct super_block *sb)
 	sbi->s_mb_stats = MB_DEFAULT_STATS;
 	sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD;
 	sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS;
-	sbi->s_mb_cr1_5_max_trim_order = MB_DEFAULT_CR1_5_TRIM_ORDER;
+	sbi->s_mb_best_avail_max_trim_order = MB_DEFAULT_BEST_AVAIL_TRIM_ORDER;
 
 	/*
 	 * The default group preallocation is 512, which for 4k block
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index bddc0335c261..df6b5e7c2274 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -86,11 +86,11 @@
 #define MB_DEFAULT_LINEAR_SCAN_THRESHOLD	16
 
 /*
- * The maximum order upto which CR1.5 can trim a particular allocation request.
- * Example, if we have an order 7 request and max trim order of 3, CR1.5 can
- * trim this upto order 4.
+ * The maximum order upto which CR_BEST_AVAIL_LEN can trim a particular
+ * allocation request. Example, if we have an order 7 request and max trim order
+ * of 3, we can trim this request upto order 4.
  */
-#define MB_DEFAULT_CR1_5_TRIM_ORDER	3
+#define MB_DEFAULT_BEST_AVAIL_TRIM_ORDER	3
 
 /*
  * Number of valid buddy orders
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index 4a5c08c8dddb..6d332dff79dd 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -223,7 +223,7 @@ EXT4_RW_ATTR_SBI_UI(warning_ratelimit_interval_ms, s_warning_ratelimit_state.int
 EXT4_RW_ATTR_SBI_UI(warning_ratelimit_burst, s_warning_ratelimit_state.burst);
 EXT4_RW_ATTR_SBI_UI(msg_ratelimit_interval_ms, s_msg_ratelimit_state.interval);
 EXT4_RW_ATTR_SBI_UI(msg_ratelimit_burst, s_msg_ratelimit_state.burst);
-EXT4_RW_ATTR_SBI_UI(mb_cr1_5_max_trim_order, s_mb_cr1_5_max_trim_order);
+EXT4_RW_ATTR_SBI_UI(mb_best_avail_max_trim_order, s_mb_best_avail_max_trim_order);
 #ifdef CONFIG_EXT4_DEBUG
 EXT4_RW_ATTR_SBI_UL(simulate_fail, s_simulate_fail);
 #endif
@@ -274,7 +274,7 @@ static struct attribute *ext4_attrs[] = {
 	ATTR_LIST(warning_ratelimit_burst),
 	ATTR_LIST(msg_ratelimit_interval_ms),
 	ATTR_LIST(msg_ratelimit_burst),
-	ATTR_LIST(mb_cr1_5_max_trim_order),
+	ATTR_LIST(mb_best_avail_max_trim_order),
 	ATTR_LIST(errors_count),
 	ATTR_LIST(warning_count),
 	ATTR_LIST(msg_count),
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 7ea9b4fcb21f..bab28121c7a4 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -120,19 +120,19 @@ TRACE_DEFINE_ENUM(EXT4_FC_REASON_MAX);
 		{ EXT4_FC_REASON_INODE_JOURNAL_DATA,	"INODE_JOURNAL_DATA"}, \
 		{ EXT4_FC_REASON_ENCRYPTED_FILENAME,	"ENCRYPTED_FILENAME"})
 
-TRACE_DEFINE_ENUM(CR0);
-TRACE_DEFINE_ENUM(CR1);
-TRACE_DEFINE_ENUM(CR1_5);
-TRACE_DEFINE_ENUM(CR2);
-TRACE_DEFINE_ENUM(CR3);
-
-#define show_criteria(cr)                       \
-	__print_symbolic(cr,                    \
-			 { CR0, "CR0" },	\
-			 { CR1, "CR1" },        \
-			 { CR1_5, "CR1.5" }     \
-			 { CR2, "CR2" },        \
-			 { CR3, "CR3" })
+TRACE_DEFINE_ENUM(CR_POWER2_ALIGNED);
+TRACE_DEFINE_ENUM(CR_GOAL_LEN_FAST);
+TRACE_DEFINE_ENUM(CR_BEST_AVAIL_LEN);
+TRACE_DEFINE_ENUM(CR_GOAL_LEN_SLOW);
+TRACE_DEFINE_ENUM(CR_ANY_FREE);
+
+#define show_criteria(cr)                                               \
+	__print_symbolic(cr,                                            \
+			 { CR_POWER2_ALIGNED, "CR_POWER2_ALIGNED" },	\
+			 { CR_GOAL_LEN_FAST, "CR_GOAL_LEN_FAST" },      \
+			 { CR_BEST_AVAIL_LEN, "CR_BEST_AVAIL_LEN" },    \
+			 { CR_GOAL_LEN_SLOW, "CR_GOAL_LEN_SLOW" },      \
+			 { CR_ANY_FREE, "CR_ANY_FREE" })
 
 TRACE_EVENT(ext4_other_inode_update_time,
 	TP_PROTO(struct inode *inode, ino_t orig_ino),
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 01/12] Revert "ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits"
  2023-05-30 12:33 ` [PATCH v2 01/12] Revert "ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits" Ojaswin Mujoo
@ 2023-05-30 16:28   ` Sedat Dilek
  2023-05-31  8:57     ` Ojaswin Mujoo
  0 siblings, 1 reply; 26+ messages in thread
From: Sedat Dilek @ 2023-05-30 16:28 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-fsdevel,
	linux-kernel, Jan Kara, Kemeng Shi, Ritesh Harjani

On Tue, May 30, 2023 at 3:25 PM Ojaswin Mujoo <ojaswin@linux.ibm.com> wrote:
>
> This reverts commit 32c0869370194ae5ac9f9f501953ef693040f6a1.
>
> The reverted commit was intended to remove a dead check however it was observed
> that this check was actually being used to exit early instead of looping
> sbi->s_mb_max_to_scan times when we are able to find a free extent bigger than
> the goal extent. Due to this, a my performance tests (fsmark, parallel file
> writes in a highly fragmented FS) were seeing a 2x-3x regression.
>
> Example, the default value of the following variables is:
>
> sbi->s_mb_max_to_scan = 200
> sbi->s_mb_min_to_scan = 10
>
> In ext4_mb_check_limits() if we find an extent smaller than goal, then we return
> early and try again. This loop will go on until we have processed
> sbi->s_mb_max_to_scan(=200) number of free extents at which point we exit and
> just use whatever we have even if it is smaller than goal extent.
>
> Now, the regression comes when we find an extent bigger than goal. Earlier, in
> this case we would loop only sbi->s_mb_min_to_scan(=10) times and then just use
> the bigger extent. However with commit 32c08693 that check was removed and hence
> we would loop sbi->s_mb_max_to_scan(=200) times even though we have a big enough
> free extent to satisfy the request. The only time we would exit early would be
> when the free extent is *exactly* the size of our goal, which is pretty uncommon
> occurrence and so we would almost always end up looping 200 times.
>
> Hence, revert the commit by adding the check back to fix the regression. Also
> add a comment to outline this policy.
>

Hi,

I applied this single patch of your series v2 on top of Linux v6.4-rc4.

So, if this is a regression I ask myself if this is material for Linux 6.4?

Can you comment on this, please?

Thanks.

Regards,
-Sedat-


> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
>  fs/ext4/mballoc.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index d4b6a2c1881d..7ac6d3524f29 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2063,7 +2063,7 @@ static void ext4_mb_check_limits(struct ext4_allocation_context *ac,
>         if (bex->fe_len < gex->fe_len)
>                 return;
>
> -       if (finish_group)
> +       if (finish_group || ac->ac_found > sbi->s_mb_min_to_scan)
>                 ext4_mb_use_best_found(ac, e4b);
>  }
>
> @@ -2075,6 +2075,20 @@ static void ext4_mb_check_limits(struct ext4_allocation_context *ac,
>   * in the context. Later, the best found extent will be used, if
>   * mballoc can't find good enough extent.
>   *
> + * The algorithm used is roughly as follows:
> + *
> + * * If free extent found is exactly as big as goal, then
> + *   stop the scan and use it immediately
> + *
> + * * If free extent found is smaller than goal, then keep retrying
> + *   upto a max of sbi->s_mb_max_to_scan times (default 200). After
> + *   that stop scanning and use whatever we have.
> + *
> + * * If free extent found is bigger than goal, then keep retrying
> + *   upto a max of sbi->s_mb_min_to_scan times (default 10) before
> + *   stopping the scan and using the extent.
> + *
> + *
>   * FIXME: real allocation policy is to be designed yet!
>   */
>  static void ext4_mb_measure_extent(struct ext4_allocation_context *ac,
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 01/12] Revert "ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits"
  2023-05-30 16:28   ` Sedat Dilek
@ 2023-05-31  8:57     ` Ojaswin Mujoo
  2023-06-02 13:45       ` Thorsten Leemhuis
  0 siblings, 1 reply; 26+ messages in thread
From: Ojaswin Mujoo @ 2023-05-31  8:57 UTC (permalink / raw)
  To: Sedat Dilek
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-fsdevel,
	linux-kernel, Jan Kara, Kemeng Shi, Ritesh Harjani

On Tue, May 30, 2023 at 06:28:22PM +0200, Sedat Dilek wrote:
> On Tue, May 30, 2023 at 3:25 PM Ojaswin Mujoo <ojaswin@linux.ibm.com> wrote:
> >
> > This reverts commit 32c0869370194ae5ac9f9f501953ef693040f6a1.
> >
> > The reverted commit was intended to remove a dead check however it was observed
> > that this check was actually being used to exit early instead of looping
> > sbi->s_mb_max_to_scan times when we are able to find a free extent bigger than
> > the goal extent. Due to this, a my performance tests (fsmark, parallel file
> > writes in a highly fragmented FS) were seeing a 2x-3x regression.
> >
> > Example, the default value of the following variables is:
> >
> > sbi->s_mb_max_to_scan = 200
> > sbi->s_mb_min_to_scan = 10
> >
> > In ext4_mb_check_limits() if we find an extent smaller than goal, then we return
> > early and try again. This loop will go on until we have processed
> > sbi->s_mb_max_to_scan(=200) number of free extents at which point we exit and
> > just use whatever we have even if it is smaller than goal extent.
> >
> > Now, the regression comes when we find an extent bigger than goal. Earlier, in
> > this case we would loop only sbi->s_mb_min_to_scan(=10) times and then just use
> > the bigger extent. However with commit 32c08693 that check was removed and hence
> > we would loop sbi->s_mb_max_to_scan(=200) times even though we have a big enough
> > free extent to satisfy the request. The only time we would exit early would be
> > when the free extent is *exactly* the size of our goal, which is pretty uncommon
> > occurrence and so we would almost always end up looping 200 times.
> >
> > Hence, revert the commit by adding the check back to fix the regression. Also
> > add a comment to outline this policy.
> >
> 
> Hi,
> 
> I applied this single patch of your series v2 on top of Linux v6.4-rc4.
> 
> So, if this is a regression I ask myself if this is material for Linux 6.4?
> 
> Can you comment on this, please?
> 
> Thanks.
> 
> Regards,
> -Sedat-

Hi Sedat,

Since this patch fixes a regression I think it should ideally go in
Linux 6.4

Regards,
ojaswin
> 
> 
> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> > Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
> > ---
> >  fs/ext4/mballoc.c | 16 +++++++++++++++-
> >  1 file changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index d4b6a2c1881d..7ac6d3524f29 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -2063,7 +2063,7 @@ static void ext4_mb_check_limits(struct ext4_allocation_context *ac,
> >         if (bex->fe_len < gex->fe_len)
> >                 return;
> >
> > -       if (finish_group)
> > +       if (finish_group || ac->ac_found > sbi->s_mb_min_to_scan)
> >                 ext4_mb_use_best_found(ac, e4b);
> >  }
> >
> > @@ -2075,6 +2075,20 @@ static void ext4_mb_check_limits(struct ext4_allocation_context *ac,
> >   * in the context. Later, the best found extent will be used, if
> >   * mballoc can't find good enough extent.
> >   *
> > + * The algorithm used is roughly as follows:
> > + *
> > + * * If free extent found is exactly as big as goal, then
> > + *   stop the scan and use it immediately
> > + *
> > + * * If free extent found is smaller than goal, then keep retrying
> > + *   upto a max of sbi->s_mb_max_to_scan times (default 200). After
> > + *   that stop scanning and use whatever we have.
> > + *
> > + * * If free extent found is bigger than goal, then keep retrying
> > + *   upto a max of sbi->s_mb_min_to_scan times (default 10) before
> > + *   stopping the scan and using the extent.
> > + *
> > + *
> >   * FIXME: real allocation policy is to be designed yet!
> >   */
> >  static void ext4_mb_measure_extent(struct ext4_allocation_context *ac,
> > --
> > 2.31.1
> >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 01/12] Revert "ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits"
  2023-05-31  8:57     ` Ojaswin Mujoo
@ 2023-06-02 13:45       ` Thorsten Leemhuis
  2023-06-02 16:45         ` Theodore Ts'o
  0 siblings, 1 reply; 26+ messages in thread
From: Thorsten Leemhuis @ 2023-06-02 13:45 UTC (permalink / raw)
  To: Ojaswin Mujoo, Sedat Dilek
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-fsdevel,
	linux-kernel, Jan Kara, Kemeng Shi, Ritesh Harjani

On 31.05.23 10:57, Ojaswin Mujoo wrote:
> On Tue, May 30, 2023 at 06:28:22PM +0200, Sedat Dilek wrote:
>> On Tue, May 30, 2023 at 3:25 PM Ojaswin Mujoo <ojaswin@linux.ibm.com> wrote:
>>>
>>> This reverts commit 32c0869370194ae5ac9f9f501953ef693040f6a1.
>>>
>>> The reverted commit was intended to remove a dead check however it was observed
>>> that this check was actually being used to exit early instead of looping
>>> sbi->s_mb_max_to_scan times when we are able to find a free extent bigger than
>>> the goal extent. Due to this, a my performance tests (fsmark, parallel file
>>> writes in a highly fragmented FS) were seeing a 2x-3x regression.
>>>
>>> Example, the default value of the following variables is:
>>>
>>> sbi->s_mb_max_to_scan = 200
>>> sbi->s_mb_min_to_scan = 10
>>>
>>> In ext4_mb_check_limits() if we find an extent smaller than goal, then we return
>>> early and try again. This loop will go on until we have processed
>>> sbi->s_mb_max_to_scan(=200) number of free extents at which point we exit and
>>> just use whatever we have even if it is smaller than goal extent.
>>>
>>> Now, the regression comes when we find an extent bigger than goal. Earlier, in
>>> this case we would loop only sbi->s_mb_min_to_scan(=10) times and then just use
>>> the bigger extent. However with commit 32c08693 that check was removed and hence
>>> we would loop sbi->s_mb_max_to_scan(=200) times even though we have a big enough
>>> free extent to satisfy the request. The only time we would exit early would be
>>> when the free extent is *exactly* the size of our goal, which is pretty uncommon
>>> occurrence and so we would almost always end up looping 200 times.
>>>
>>> Hence, revert the commit by adding the check back to fix the regression. Also
>>> add a comment to outline this policy.
>>
>> I applied this single patch of your series v2 on top of Linux v6.4-rc4.
>>
>> So, if this is a regression I ask myself if this is material for Linux 6.4?
>>
>> Can you comment on this, please?
> 
> Since this patch fixes a regression I think it should ideally go in
> Linux 6.4

Ted can speak up for himself, but maybe this might speed things up:

A lot of maintainers in a case like this want fixes (like this)
submitted separately from other changes (like the rest of this series).

/me hopes this will help and not confuse anything

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 01/12] Revert "ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits"
  2023-06-02 13:45       ` Thorsten Leemhuis
@ 2023-06-02 16:45         ` Theodore Ts'o
  0 siblings, 0 replies; 26+ messages in thread
From: Theodore Ts'o @ 2023-06-02 16:45 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Ojaswin Mujoo, Sedat Dilek, linux-ext4, Ritesh Harjani,
	linux-fsdevel, linux-kernel, Jan Kara, Kemeng Shi,
	Ritesh Harjani

On Fri, Jun 02, 2023 at 03:45:52PM +0200, Thorsten Leemhuis wrote:
> > 
> > Since this patch fixes a regression I think it should ideally go in
> > Linux 6.4
> 
> Ted can speak up for himself, but maybe this might speed things up:
> 
> A lot of maintainers in a case like this want fixes (like this)
> submitted separately from other changes (like the rest of this series).

While it's nice to do that in the future (since I would have noticed
this earlier, it could have gone into my regression fixes push to
Linus last week), in this particular case I've already noted this
particular issue, and per the discussion in the last weekly ext4 video
conference chat, I will be reordering the pashes so I can send a
secondary regression fix to Linus very shortly.

Thanks,

						- Ted

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 04/12] ext4: Convert mballoc cr (criteria) to enum
  2023-05-30 12:33 ` [PATCH v2 04/12] ext4: Convert mballoc cr (criteria) to enum Ojaswin Mujoo
@ 2023-06-06 13:13   ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2023-06-06 13:13 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-fsdevel,
	linux-kernel, Jan Kara, Kemeng Shi, Ritesh Harjani

On Tue 30-05-23 18:03:42, Ojaswin Mujoo wrote:
> Convert criteria to be an enum so it easier to maintain and
> update the tracefiles to use enum names. This change also makes
> it easier to insert new criterias in the future.
> 
> There is no functional change in this patch.
> 
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

...

> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index c075da665ec1..f9a4eaa10c6a 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -127,6 +127,23 @@ enum SHIFT_DIRECTION {
>  	SHIFT_RIGHT,
>  };
>  
> +/*
> + * Number of criterias defined. For each criteria, mballoc has slightly
> + * different way of finding the required blocks nad usually, higher the
> + * criteria the slower the allocation. We start at lower criterias and keep
> + * falling back to higher ones if we are not able to find any blocks.
> + */
> +#define EXT4_MB_NUM_CRS 4
> +/*
> + * All possible allocation criterias for mballoc
> + */
> +enum criteria {
> +	CR0,
> +	CR1,
> +	CR2,
> +	CR3,
> +};

Usually we define EXT4_MB_NUM_CRS like:

enum criteria {
	CR0,
	CR1,
	CR2,
	CR3,
	EXT4_MB_NUM_CRS
};

> @@ -2626,7 +2626,7 @@ static noinline_for_stack int
>  ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>  {
>  	ext4_group_t prefetch_grp = 0, ngroups, group, i;
> -	int cr = -1, new_cr;
> +	enum criteria cr, new_cr;
>  	int err = 0, first_err = 0;
>  	unsigned int nr = 0, prefetch_ios = 0;
>  	struct ext4_sb_info *sbi;

This can cause uninitialized use of 'cr' variable in the 'out:' label.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 09/12] ext4: Ensure ext4_mb_prefetch_fini() is called for all prefetched BGs
  2023-05-30 12:33 ` [PATCH v2 09/12] ext4: Ensure ext4_mb_prefetch_fini() is called for all prefetched BGs Ojaswin Mujoo
@ 2023-06-06 14:00   ` Guoqing Jiang
  2023-06-27  6:51     ` Ojaswin Mujoo
  0 siblings, 1 reply; 26+ messages in thread
From: Guoqing Jiang @ 2023-06-06 14:00 UTC (permalink / raw)
  To: Ojaswin Mujoo, linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, Jan Kara,
	Kemeng Shi, Ritesh Harjani

Hello,

On 5/30/23 20:33, Ojaswin Mujoo wrote:
> Before this patch, the call stack in ext4_run_li_request is as follows:
>
>    /*
>     * nr = no. of BGs we want to fetch (=s_mb_prefetch)
>     * prefetch_ios = no. of BGs not uptodate after
>     * 		    ext4_read_block_bitmap_nowait()
>     */
>    next_group = ext4_mb_prefetch(sb, group, nr, prefetch_ios);
>    ext4_mb_prefetch_fini(sb, next_group prefetch_ios);
>
> ext4_mb_prefetch_fini() will only try to initialize buddies for BGs in
> range [next_group - prefetch_ios, next_group). This is incorrect since
> sometimes (prefetch_ios < nr), which causes ext4_mb_prefetch_fini() to
> incorrectly ignore some of the BGs that might need initialization. This
> issue is more notable now with the previous patch enabling "fetching" of
> BLOCK_UNINIT BGs which are marked buffer_uptodate by default.
>
> Fix this by passing nr to ext4_mb_prefetch_fini() instead of
> prefetch_ios so that it considers the right range of groups.

Thanks for the series.

> Similarly, make sure we don't pass nr=0 to ext4_mb_prefetch_fini() in
> ext4_mb_regular_allocator() since we might have prefetched BLOCK_UNINIT
> groups that would need buddy initialization.

Seems ext4_mb_prefetch_fini can't be called by ext4_mb_regular_allocator
if nr is 0.

https://elixir.bootlin.com/linux/v6.4-rc5/source/fs/ext4/mballoc.c#L2816

Am I miss something?

Thanks,
Guoqing

> Signed-off-by: Ojaswin Mujoo<ojaswin@linux.ibm.com>
> Reviewed-by: Ritesh Harjani (IBM)<ritesh.list@gmail.com>
> Reviewed-by: Jan Kara<jack@suse.cz>
> ---
>   fs/ext4/mballoc.c |  4 ----
>   fs/ext4/super.c   | 11 ++++-------
>   2 files changed, 4 insertions(+), 11 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 79455c7e645b..6775d73dfc68 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2735,8 +2735,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>   			if ((prefetch_grp == group) &&
>   			    (cr > CR1 ||
>   			     prefetch_ios < sbi->s_mb_prefetch_limit)) {
> -				unsigned int curr_ios = prefetch_ios;
> -
>   				nr = sbi->s_mb_prefetch;
>   				if (ext4_has_feature_flex_bg(sb)) {
>   					nr = 1 << sbi->s_log_groups_per_flex;
> @@ -2745,8 +2743,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>   				}
>   				prefetch_grp = ext4_mb_prefetch(sb, group,
>   							nr, &prefetch_ios);
> -				if (prefetch_ios == curr_ios)
> -					nr = 0;
>   			}
>   
>   			/* This now checks without needing the buddy page */
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 2da5476fa48b..27c1dabacd43 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -3692,16 +3692,13 @@ static int ext4_run_li_request(struct ext4_li_request *elr)
>   	ext4_group_t group = elr->lr_next_group;
>   	unsigned int prefetch_ios = 0;
>   	int ret = 0;
> +	int nr = EXT4_SB(sb)->s_mb_prefetch;
>   	u64 start_time;
>   
>   	if (elr->lr_mode == EXT4_LI_MODE_PREFETCH_BBITMAP) {
> -		elr->lr_next_group = ext4_mb_prefetch(sb, group,
> -				EXT4_SB(sb)->s_mb_prefetch, &prefetch_ios);
> -		if (prefetch_ios)
> -			ext4_mb_prefetch_fini(sb, elr->lr_next_group,
> -					      prefetch_ios);
> -		trace_ext4_prefetch_bitmaps(sb, group, elr->lr_next_group,
> -					    prefetch_ios);
> +		elr->lr_next_group = ext4_mb_prefetch(sb, group, nr, &prefetch_ios);
> +		ext4_mb_prefetch_fini(sb, elr->lr_next_group, nr);
> +		trace_ext4_prefetch_bitmaps(sb, group, elr->lr_next_group, nr);
>   		if (group >= elr->lr_next_group) {
>   			ret = 1;
>   			if (elr->lr_first_not_zeroed != ngroups &&


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 11/12] ext4: Add allocation criteria 1.5 (CR1_5)
  2023-05-30 12:33 ` [PATCH v2 11/12] ext4: Add allocation criteria 1.5 (CR1_5) Ojaswin Mujoo
@ 2023-06-07 10:21   ` Jan Kara
  2023-06-08 14:45     ` Theodore Ts'o
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2023-06-07 10:21 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-fsdevel,
	linux-kernel, Jan Kara, Kemeng Shi, Ritesh Harjani

On Tue 30-05-23 18:03:49, Ojaswin Mujoo wrote:
> CR1_5 aims to optimize allocations which can't be satisfied in CR1. The
> fact that we couldn't find a group in CR1 suggests that it would be
> difficult to find a continuous extent to compleltely satisfy our
> allocations. So before falling to the slower CR2, in CR1.5 we
> proactively trim the the preallocations so we can find a group with
> (free / fragments) big enough.  This speeds up our allocation at the
> cost of slightly reduced preallocation.
> 
> The patch also adds a new sysfs tunable:
> 
> * /sys/fs/ext4/<partition>/mb_cr1_5_max_trim_order
> 
> This controls how much CR1.5 can trim a request before falling to CR2.
> For example, for a request of order 7 and max trim order 2, CR1.5 can
> trim this upto order 5.
> 
> Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> 
> ext4 squash

Why is this here?

> +/*
> + * We couldn't find a group in CR1 so try to find the highest free fragment
> + * order we have and proactively trim the goal request length to that order to
> + * find a suitable group faster.
> + *
> + * This optimizes allocation speed at the cost of slightly reduced
> + * preallocations. However, we make sure that we don't trim the request too
> + * much and fall to CR2 in that case.
> + */
> +static void ext4_mb_choose_next_group_cr1_5(struct ext4_allocation_context *ac,
> +		enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> +{
> +	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> +	struct ext4_group_info *grp = NULL;
> +	int i, order, min_order;
> +	unsigned long num_stripe_clusters = 0;
> +
> +	if (unlikely(ac->ac_flags & EXT4_MB_CR1_5_OPTIMIZED)) {
> +		if (sbi->s_mb_stats)
> +			atomic_inc(&sbi->s_bal_cr1_5_bad_suggestions);
> +	}
> +
> +	/*
> +	 * mb_avg_fragment_size_order() returns order in a way that makes
> +	 * retrieving back the length using (1 << order) inaccurate. Hence, use
> +	 * fls() instead since we need to know the actual length while modifying
> +	 * goal length.
> +	 */
> +	order = fls(ac->ac_g_ex.fe_len);
> +	min_order = order - sbi->s_mb_cr1_5_max_trim_order;
> +	if (min_order < 0)
> +		min_order = 0;
> +
> +	if (1 << min_order < ac->ac_o_ex.fe_len)
> +		min_order = fls(ac->ac_o_ex.fe_len) + 1;
> +
> +	if (sbi->s_stripe > 0) {
> +		/*
> +		 * We are assuming that stripe size is always a multiple of
> +		 * cluster ratio otherwise __ext4_fill_super exists early.
> +		 */
> +		num_stripe_clusters = EXT4_NUM_B2C(sbi, sbi->s_stripe);
> +		if (1 << min_order < num_stripe_clusters)
> +			min_order = fls(num_stripe_clusters);
> +	}
> +
> +	for (i = order; i >= min_order; i--) {
> +		int frag_order;
> +		/*
> +		 * Scale down goal len to make sure we find something
> +		 * in the free fragments list. Basically, reduce
> +		 * preallocations.
> +		 */
> +		ac->ac_g_ex.fe_len = 1 << i;

I smell some off-by-one issues here. Look fls(1) == 1 so (1 << fls(n)) > n.
Hence this loop will actually *grow* the goal allocation length. Also I'm
not sure why you have +1 in min_order = fls(ac->ac_o_ex.fe_len) + 1.

> +
> +		if (num_stripe_clusters > 0) {
> +			/*
> +			 * Try to round up the adjusted goal to stripe size
						        ^^^ goal length?

> +			 * (in cluster units) multiple for efficiency.
> +			 *
> +			 * XXX: Is s->stripe always a power of 2? In that case
> +			 * we can use the faster round_up() variant.
> +			 */

I don't think s->stripe has to be a power of 2. E.g. when you have three
data disks in a RAID config.

Otherwise the patch looks good to me.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 12/12] ext4: Give symbolic names to mballoc criterias
  2023-05-30 12:33 ` [PATCH v2 12/12] ext4: Give symbolic names to mballoc criterias Ojaswin Mujoo
@ 2023-06-07 10:39   ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2023-06-07 10:39 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-fsdevel,
	linux-kernel, Jan Kara, Kemeng Shi

On Tue 30-05-23 18:03:50, Ojaswin Mujoo wrote:
> mballoc criterias have historically been called by numbers
> like CR0, CR1... however this makes it confusing to understand
> what each criteria is about.
> 
> Change these criterias from numbers to symbolic names and add
> relevant comments. While we are at it, also reformat and add some
> comments to ext4_seq_mb_stats_show() for better readability.
> 
> Additionally, define CR_FAST which signifies the criteria
> below which we can make quicker decisions like:
>   * quitting early if (free block < requested len)
>   * avoiding to scan free extents smaller than required len.
>   * avoiding to initialize buddy cache and work with existing cache
>   * limiting prefetches
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Thanks for doing this!

> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 942e97026a60..c29a4e1fcd5d 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -135,16 +135,45 @@ enum SHIFT_DIRECTION {
>   */
>  #define EXT4_MB_NUM_CRS 5
>  /*
> - * All possible allocation criterias for mballoc
> + * All possible allocation criterias for mballoc. Lower are faster.
>   */
>  enum criteria {
> -	CR0,
> -	CR1,
> -	CR1_5,
> -	CR2,
> -	CR3,
> +	/*
> +	 * Used when number of blocks needed is a power of 2. This doesn't
> +	 * trigger any disk IO except prefetch and is the fastest criteria.
> +	 */
> +	CR_POWER2_ALIGNED,
> +
> +	/*
> +	 * Tries to lookup in-memory data structures to find the most suitable
> +	 * group that satisfies goal request. No disk IO except block prefetch.
> +	 */
> +	CR_GOAL_LEN_FAST,
> +
> +        /*
> +	 * Same as CR_GOAL_LEN_FAST but is allowed to reduce the goal length to
> +         * the best available length for faster allocation.

Some whitespace damage here...

> +	 */
> +	CR_BEST_AVAIL_LEN,
> +
> +	/*
> +	 * Reads each block group sequentially, performing disk IO if necessary, to
> +	 * find find_suitable block group. Tries to allocate goal length but might trim

Too long line here.

> +	 * the request if nothing is found after enough tries.
> +	 */
> +	CR_GOAL_LEN_SLOW,
> +
> +	/*
> +	 * Finds the first free set of blocks and allocates those. This is only
> +	 * used in rare cases when CR_GOAL_LEN_SLOW also fails to allocate
> +	 * anything.
> +	 */
> +	CR_ANY_FREE,
>  };
>  
> +/* criteria below which we use fast block scanning and avoid unnecessary IO */
> +#define CR_FAST CR_GOAL_LEN_SLOW
> +

Maybe instead of defining CR_FAST value we could define

static inline bool mballoc_cr_expensive(enum criteria cr)
{
	return cr >= CR_GOAL_LEN_SLOW;
}

And use this. I think it will make the conditions more understandable.

...

> @@ -1064,7 +1068,7 @@ static inline int should_optimize_scan(struct ext4_allocation_context *ac)
>  {
>  	if (unlikely(!test_opt2(ac->ac_sb, MB_OPTIMIZE_SCAN)))
>  		return 0;
> -	if (ac->ac_criteria >= CR2)
> +	if (ac->ac_criteria >= CR_GOAL_LEN_SLOW)

Maybe we should use CR_FAST (or the new function) here?

Otherwise the patch looks good!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 11/12] ext4: Add allocation criteria 1.5 (CR1_5)
  2023-06-07 10:21   ` Jan Kara
@ 2023-06-08 14:45     ` Theodore Ts'o
  2023-06-09 10:57       ` Ojaswin Mujoo
  0 siblings, 1 reply; 26+ messages in thread
From: Theodore Ts'o @ 2023-06-08 14:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ojaswin Mujoo, linux-ext4, Ritesh Harjani, linux-fsdevel,
	linux-kernel, Kemeng Shi, Ritesh Harjani

Jan, thanks for the comments to Ojaswin's patch series.  Since I had
already landed his patch series in my tree and have been testing it,
I've fixed the obvious issues you've raised in a fixup patch
(attached).

There is one issue which I have not fixed:

On Wed, Jun 07, 2023 at 12:21:03PM +0200, Jan Kara wrote:
> > +	for (i = order; i >= min_order; i--) {
> > +		int frag_order;
> > +		/*
> > +		 * Scale down goal len to make sure we find something
> > +		 * in the free fragments list. Basically, reduce
> > +		 * preallocations.
> > +		 */
> > +		ac->ac_g_ex.fe_len = 1 << i;
> 
> I smell some off-by-one issues here. Look fls(1) == 1 so (1 << fls(n)) > n.
> Hence this loop will actually *grow* the goal allocation length. Also I'm
> not sure why you have +1 in min_order = fls(ac->ac_o_ex.fe_len) + 1.

Ojaswin, could you take a look this?  Thanks!!

	       	   	       	      - Ted

commit 182d2d90a180838789ed5a19e08c333043d1617a
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Thu Jun 8 10:39:35 2023 -0400

    ext4: clean up mballoc criteria comments
    
    Line wrap and slightly clarify the comments describing mballoc's
    cirtiera.
    
    Define EXT4_MB_NUM_CRS as part of the enum, so that it will
    automatically get updated when criteria is added or removed.
    
    Also fix a potential unitialized use of 'cr' variable if
    CONFIG_EXT4_DEBUG is enabled.
    
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6a1f013d23f7..45a531446ea2 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -128,47 +128,52 @@ enum SHIFT_DIRECTION {
 };
 
 /*
- * Number of criterias defined. For each criteria, mballoc has slightly
- * different way of finding the required blocks nad usually, higher the
- * criteria the slower the allocation. We start at lower criterias and keep
- * falling back to higher ones if we are not able to find any blocks.
- */
-#define EXT4_MB_NUM_CRS 5
-/*
- * All possible allocation criterias for mballoc. Lower are faster.
+ * For each criteria, mballoc has slightly different way of finding
+ * the required blocks nad usually, higher the criteria the slower the
+ * allocation.  We start at lower criterias and keep falling back to
+ * higher ones if we are not able to find any blocks.  Lower (earlier)
+ * criteria are faster.
  */
 enum criteria {
 	/*
-	 * Used when number of blocks needed is a power of 2. This doesn't
-	 * trigger any disk IO except prefetch and is the fastest criteria.
+	 * Used when number of blocks needed is a power of 2. This
+	 * doesn't trigger any disk IO except prefetch and is the
+	 * fastest criteria.
 	 */
 	CR_POWER2_ALIGNED,
 
 	/*
-	 * Tries to lookup in-memory data structures to find the most suitable
-	 * group that satisfies goal request. No disk IO except block prefetch.
+	 * Tries to lookup in-memory data structures to find the most
+	 * suitable group that satisfies goal request. No disk IO
+	 * except block prefetch.
 	 */
 	CR_GOAL_LEN_FAST,
 
         /*
-	 * Same as CR_GOAL_LEN_FAST but is allowed to reduce the goal length to
-         * the best available length for faster allocation.
+	 * Same as CR_GOAL_LEN_FAST but is allowed to reduce the goal
+         * length to the best available length for faster allocation.
 	 */
 	CR_BEST_AVAIL_LEN,
 
 	/*
-	 * Reads each block group sequentially, performing disk IO if necessary, to
-	 * find find_suitable block group. Tries to allocate goal length but might trim
-	 * the request if nothing is found after enough tries.
+	 * Reads each block group sequentially, performing disk IO if
+	 * necessary, to find find_suitable block group. Tries to
+	 * allocate goal length but might trim the request if nothing
+	 * is found after enough tries.
 	 */
 	CR_GOAL_LEN_SLOW,
 
 	/*
-	 * Finds the first free set of blocks and allocates those. This is only
-	 * used in rare cases when CR_GOAL_LEN_SLOW also fails to allocate
-	 * anything.
+	 * Finds the first free set of blocks and allocates
+	 * those. This is only used in rare cases when
+	 * CR_GOAL_LEN_SLOW also fails to allocate anything.
 	 */
 	CR_ANY_FREE,
+
+	/*
+	 * Number of criterias defined.
+	 */
+	EXT4_MB_NUM_CRS
 };
 
 /* criteria below which we use fast block scanning and avoid unnecessary IO */
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 8a6896d4e9b0..2f9f5dc720cc 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2759,7 +2759,7 @@ static noinline_for_stack int
 ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 {
 	ext4_group_t prefetch_grp = 0, ngroups, group, i;
-	enum criteria cr, new_cr;
+	enum criteria new_cr, cr = CR_GOAL_LEN_FAST;
 	int err = 0, first_err = 0;
 	unsigned int nr = 0, prefetch_ios = 0;
 	struct ext4_sb_info *sbi;
@@ -2816,12 +2816,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		spin_unlock(&sbi->s_md_lock);
 	}
 
-	/* Let's just scan groups to find more-less suitable blocks */
-	cr = ac->ac_2order ? CR_POWER2_ALIGNED : CR_GOAL_LEN_FAST;
 	/*
-	 * cr == CR_POWER2_ALIGNED try to get exact allocation,
-	 * cr == CR_ANY_FREE try to get anything
+	 * Let's just scan groups to find more-less suitable blocks We
+	 * start with CR_GOAL_LEN_FAST, unless it is power of 2
+	 * aligned, in which case let's do that faster approach first.
 	 */
+	if (ac->ac_2order)
+		cr = CR_POWER2_ALIGNED;
 repeat:
 	for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
 		ac->ac_criteria = cr;

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 00/12] multiblock allocator improvements
  2023-05-30 12:33 [PATCH v2 00/12] multiblock allocator improvements Ojaswin Mujoo
                   ` (11 preceding siblings ...)
  2023-05-30 12:33 ` [PATCH v2 12/12] ext4: Give symbolic names to mballoc criterias Ojaswin Mujoo
@ 2023-06-09  3:14 ` Theodore Ts'o
  12 siblings, 0 replies; 26+ messages in thread
From: Theodore Ts'o @ 2023-06-09  3:14 UTC (permalink / raw)
  To: linux-ext4, Ojaswin Mujoo
  Cc: Theodore Ts'o, Ritesh Harjani, linux-fsdevel, linux-kernel,
	Jan Kara, Kemeng Shi


On Tue, 30 May 2023 18:03:38 +0530, Ojaswin Mujoo wrote:
> ** Changed since v1 [2] **
> 
>  1. Rebase over Kemeng's recent mballoc patchset [3]
>  2. Picked up Kemeng's RVB on patch 1/12
> 
>  [2] https://lore.kernel.org/all/cover.1685009579.git.ojaswin@linux.ibm.com/
>  [3] https://lore.kernel.org/all/20230417110617.2664129-1-shikemeng@huaweicloud.com/
> 
> [...]

Applied, thanks!

[01/12] Revert "ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits"
        commit: 3582e74599d376bc18cae123045cd295360d885b
[02/12] ext4: mballoc: Remove useless setting of ac_criteria
        commit: fb665804fd62e600b5c2350ea69295261ce8374d
[03/12] ext4: Remove unused extern variables declaration
        commit: 3086ed54c0e65c60b0fb142e181e7dd4e3b7b1e0
[04/12] ext4: Convert mballoc cr (criteria) to enum
        commit: eb7d4a8b9510887fb690a6b912d80cb0bce21387
[05/12] ext4: Add per CR extent scanned counter
        commit: 9e97d81a1fa105b80583b5152e4b9cb794734585
[06/12] ext4: Add counter to track successful allocation of goal length
        commit: af97bca67ff63191d44023f895b6033eb7d3423a
[07/12] ext4: Avoid scanning smaller extents in BG during CR1
        commit: caf886aecd608a8ef05ab10957cf4b9fd9564712
[08/12] ext4: Don't skip prefetching BLOCK_UNINIT groups
        commit: bf912c937ed41c4581d77806b003f22625eee0b5
[09/12] ext4: Ensure ext4_mb_prefetch_fini() is called for all prefetched BGs
        commit: 64f6fb876cedc30fc1430b96eb442bd84bc61459
[10/12] ext4: Abstract out logic to search average fragment list
        commit: 1918cdc99d125c275dcdd4527520c78bb1a3c1ef
[11/12] ext4: Add allocation criteria 1.5 (CR1_5)
        commit: 7b748ea2a6ad2bda304553b5cf8745f542af6b34
[12/12] ext4: Give symbolic names to mballoc criterias
        commit: c9f19daa1824a73218526650a9aade17536527c8

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 11/12] ext4: Add allocation criteria 1.5 (CR1_5)
  2023-06-08 14:45     ` Theodore Ts'o
@ 2023-06-09 10:57       ` Ojaswin Mujoo
  0 siblings, 0 replies; 26+ messages in thread
From: Ojaswin Mujoo @ 2023-06-09 10:57 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, linux-ext4, Ritesh Harjani, linux-fsdevel,
	linux-kernel, Kemeng Shi, Ritesh Harjani

On Thu, Jun 08, 2023 at 10:45:05AM -0400, Theodore Ts'o wrote:
> Jan, thanks for the comments to Ojaswin's patch series.  Since I had
> already landed his patch series in my tree and have been testing it,
> I've fixed the obvious issues you've raised in a fixup patch
> (attached).
> 
> There is one issue which I have not fixed:
> 
> On Wed, Jun 07, 2023 at 12:21:03PM +0200, Jan Kara wrote:
> > > +	for (i = order; i >= min_order; i--) {
> > > +		int frag_order;
> > > +		/*
> > > +		 * Scale down goal len to make sure we find something
> > > +		 * in the free fragments list. Basically, reduce
> > > +		 * preallocations.
> > > +		 */
> > > +		ac->ac_g_ex.fe_len = 1 << i;
> > 
> > I smell some off-by-one issues here. Look fls(1) == 1 so (1 << fls(n)) > n.
> > Hence this loop will actually *grow* the goal allocation length. Also I'm
> > not sure why you have +1 in min_order = fls(ac->ac_o_ex.fe_len) + 1.
> 
> Ojaswin, could you take a look this?  Thanks!!
> 
> 	       	   	       	      - Ted
> 
> commit 182d2d90a180838789ed5a19e08c333043d1617a
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Thu Jun 8 10:39:35 2023 -0400
> 
>     ext4: clean up mballoc criteria comments
>     
>     Line wrap and slightly clarify the comments describing mballoc's
>     cirtiera.
>     
>     Define EXT4_MB_NUM_CRS as part of the enum, so that it will
>     automatically get updated when criteria is added or removed.
>     
>     Also fix a potential unitialized use of 'cr' variable if
>     CONFIG_EXT4_DEBUG is enabled.
>     
>     Signed-off-by: Theodore Ts'o <tytso@mit.edu>

Hi Ted, 

Patch looks good, thanks for doing this. I've sent the fix
for the off by one issue here:

https://lore.kernel.org/linux-ext4/20230609103403.112807-1-ojaswin@linux.ibm.com/T/#u

Jan, thanks for the review. I've addressed the bug for now. Since
I'm on vacation for the next one and a half week I might not be able to
address the other cleanups. I'll get them done once I'm back.

Regards,
ojaswin
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 6a1f013d23f7..45a531446ea2 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -128,47 +128,52 @@ enum SHIFT_DIRECTION {
>  };
>  
>  /*
> - * Number of criterias defined. For each criteria, mballoc has slightly
> - * different way of finding the required blocks nad usually, higher the
> - * criteria the slower the allocation. We start at lower criterias and keep
> - * falling back to higher ones if we are not able to find any blocks.
> - */
> -#define EXT4_MB_NUM_CRS 5
> -/*
> - * All possible allocation criterias for mballoc. Lower are faster.
> + * For each criteria, mballoc has slightly different way of finding
> + * the required blocks nad usually, higher the criteria the slower the
> + * allocation.  We start at lower criterias and keep falling back to
> + * higher ones if we are not able to find any blocks.  Lower (earlier)
> + * criteria are faster.
>   */
>  enum criteria {
>  	/*
> -	 * Used when number of blocks needed is a power of 2. This doesn't
> -	 * trigger any disk IO except prefetch and is the fastest criteria.
> +	 * Used when number of blocks needed is a power of 2. This
> +	 * doesn't trigger any disk IO except prefetch and is the
> +	 * fastest criteria.
>  	 */
>  	CR_POWER2_ALIGNED,
>  
>  	/*
> -	 * Tries to lookup in-memory data structures to find the most suitable
> -	 * group that satisfies goal request. No disk IO except block prefetch.
> +	 * Tries to lookup in-memory data structures to find the most
> +	 * suitable group that satisfies goal request. No disk IO
> +	 * except block prefetch.
>  	 */
>  	CR_GOAL_LEN_FAST,
>  
>          /*
> -	 * Same as CR_GOAL_LEN_FAST but is allowed to reduce the goal length to
> -         * the best available length for faster allocation.
> +	 * Same as CR_GOAL_LEN_FAST but is allowed to reduce the goal
> +         * length to the best available length for faster allocation.
>  	 */
>  	CR_BEST_AVAIL_LEN,
>  
>  	/*
> -	 * Reads each block group sequentially, performing disk IO if necessary, to
> -	 * find find_suitable block group. Tries to allocate goal length but might trim
> -	 * the request if nothing is found after enough tries.
> +	 * Reads each block group sequentially, performing disk IO if
> +	 * necessary, to find find_suitable block group. Tries to
> +	 * allocate goal length but might trim the request if nothing
> +	 * is found after enough tries.
>  	 */
>  	CR_GOAL_LEN_SLOW,
>  
>  	/*
> -	 * Finds the first free set of blocks and allocates those. This is only
> -	 * used in rare cases when CR_GOAL_LEN_SLOW also fails to allocate
> -	 * anything.
> +	 * Finds the first free set of blocks and allocates
> +	 * those. This is only used in rare cases when
> +	 * CR_GOAL_LEN_SLOW also fails to allocate anything.
>  	 */
>  	CR_ANY_FREE,
> +
> +	/*
> +	 * Number of criterias defined.
> +	 */
> +	EXT4_MB_NUM_CRS
>  };
>  
>  /* criteria below which we use fast block scanning and avoid unnecessary IO */
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 8a6896d4e9b0..2f9f5dc720cc 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2759,7 +2759,7 @@ static noinline_for_stack int
>  ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>  {
>  	ext4_group_t prefetch_grp = 0, ngroups, group, i;
> -	enum criteria cr, new_cr;
> +	enum criteria new_cr, cr = CR_GOAL_LEN_FAST;
>  	int err = 0, first_err = 0;
>  	unsigned int nr = 0, prefetch_ios = 0;
>  	struct ext4_sb_info *sbi;
> @@ -2816,12 +2816,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>  		spin_unlock(&sbi->s_md_lock);
>  	}
>  
> -	/* Let's just scan groups to find more-less suitable blocks */
> -	cr = ac->ac_2order ? CR_POWER2_ALIGNED : CR_GOAL_LEN_FAST;
>  	/*
> -	 * cr == CR_POWER2_ALIGNED try to get exact allocation,
> -	 * cr == CR_ANY_FREE try to get anything
> +	 * Let's just scan groups to find more-less suitable blocks We
> +	 * start with CR_GOAL_LEN_FAST, unless it is power of 2
> +	 * aligned, in which case let's do that faster approach first.
>  	 */
> +	if (ac->ac_2order)
> +		cr = CR_POWER2_ALIGNED;
>  repeat:
>  	for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
>  		ac->ac_criteria = cr;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 09/12] ext4: Ensure ext4_mb_prefetch_fini() is called for all prefetched BGs
  2023-06-06 14:00   ` Guoqing Jiang
@ 2023-06-27  6:51     ` Ojaswin Mujoo
  2023-06-28  1:33       ` Guoqing Jiang
  0 siblings, 1 reply; 26+ messages in thread
From: Ojaswin Mujoo @ 2023-06-27  6:51 UTC (permalink / raw)
  To: Guoqing Jiang
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-fsdevel,
	linux-kernel, Jan Kara, Kemeng Shi, Ritesh Harjani

On Tue, Jun 06, 2023 at 10:00:57PM +0800, Guoqing Jiang wrote:
> Hello,
> 
> On 5/30/23 20:33, Ojaswin Mujoo wrote:
> > Before this patch, the call stack in ext4_run_li_request is as follows:
> > 
> >    /*
> >     * nr = no. of BGs we want to fetch (=s_mb_prefetch)
> >     * prefetch_ios = no. of BGs not uptodate after
> >     * 		    ext4_read_block_bitmap_nowait()
> >     */
> >    next_group = ext4_mb_prefetch(sb, group, nr, prefetch_ios);
> >    ext4_mb_prefetch_fini(sb, next_group prefetch_ios);
> > 
> > ext4_mb_prefetch_fini() will only try to initialize buddies for BGs in
> > range [next_group - prefetch_ios, next_group). This is incorrect since
> > sometimes (prefetch_ios < nr), which causes ext4_mb_prefetch_fini() to
> > incorrectly ignore some of the BGs that might need initialization. This
> > issue is more notable now with the previous patch enabling "fetching" of
> > BLOCK_UNINIT BGs which are marked buffer_uptodate by default.
> > 
> > Fix this by passing nr to ext4_mb_prefetch_fini() instead of
> > prefetch_ios so that it considers the right range of groups.
> 
> Thanks for the series.
> 
> > Similarly, make sure we don't pass nr=0 to ext4_mb_prefetch_fini() in
> > ext4_mb_regular_allocator() since we might have prefetched BLOCK_UNINIT
> > groups that would need buddy initialization.
> 
> Seems ext4_mb_prefetch_fini can't be called by ext4_mb_regular_allocator
> if nr is 0.

Hi Guoqing,

Sorry I was on vacation so didn't get a chance to reply to this sooner.
Let me explain what I meant by that statement in the commit message.

So basically, the prefetch_ios output argument is incremented whenever
ext4_mb_prefetch() reads a block group with !buffer_uptodate(bh).
However, for BLOCK_UNINIT BGs the buffer is marked uptodate after
initialization and hence prefetch_ios is not incremented when such BGs
are prefetched. 

This leads to nr becoming 0 due to the following line (removed in this patch):

				if (prefetch_ios == curr_ios)
					nr = 0;

hence ext4_mb_prefetch_fini() would never pre initialise the corresponding 
buddy structures. Instead, these structures would then get initialized
probably at a later point during the slower allocation criterias. The
motivation of making sure the BLOCK_UNINIT BGs' buddies are pre
initialized is so the faster allocation criterias can utilize the data
to make better decisions.

Regards,
ojaswin

> 
> https://elixir.bootlin.com/linux/v6.4-rc5/source/fs/ext4/mballoc.c#L2816
> 
> Am I miss something?
> 
> Thanks,
> Guoqing
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 09/12] ext4: Ensure ext4_mb_prefetch_fini() is called for all prefetched BGs
  2023-06-27  6:51     ` Ojaswin Mujoo
@ 2023-06-28  1:33       ` Guoqing Jiang
  0 siblings, 0 replies; 26+ messages in thread
From: Guoqing Jiang @ 2023-06-28  1:33 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-fsdevel,
	linux-kernel, Jan Kara, Kemeng Shi, Ritesh Harjani

Hi Ojaswin,

On 6/27/23 14:51, Ojaswin Mujoo wrote:
> On Tue, Jun 06, 2023 at 10:00:57PM +0800, Guoqing Jiang wrote:
>> Hello,
>>
>> On 5/30/23 20:33, Ojaswin Mujoo wrote:
>>> Before this patch, the call stack in ext4_run_li_request is as follows:
>>>
>>>     /*
>>>      * nr = no. of BGs we want to fetch (=s_mb_prefetch)
>>>      * prefetch_ios = no. of BGs not uptodate after
>>>      * 		    ext4_read_block_bitmap_nowait()
>>>      */
>>>     next_group = ext4_mb_prefetch(sb, group, nr, prefetch_ios);
>>>     ext4_mb_prefetch_fini(sb, next_group prefetch_ios);
>>>
>>> ext4_mb_prefetch_fini() will only try to initialize buddies for BGs in
>>> range [next_group - prefetch_ios, next_group). This is incorrect since
>>> sometimes (prefetch_ios < nr), which causes ext4_mb_prefetch_fini() to
>>> incorrectly ignore some of the BGs that might need initialization. This
>>> issue is more notable now with the previous patch enabling "fetching" of
>>> BLOCK_UNINIT BGs which are marked buffer_uptodate by default.
>>>
>>> Fix this by passing nr to ext4_mb_prefetch_fini() instead of
>>> prefetch_ios so that it considers the right range of groups.
>> Thanks for the series.
>>
>>> Similarly, make sure we don't pass nr=0 to ext4_mb_prefetch_fini() in
>>> ext4_mb_regular_allocator() since we might have prefetched BLOCK_UNINIT
>>> groups that would need buddy initialization.
>> Seems ext4_mb_prefetch_fini can't be called by ext4_mb_regular_allocator
>> if nr is 0.
> Hi Guoqing,
>
> Sorry I was on vacation so didn't get a chance to reply to this sooner.
> Let me explain what I meant by that statement in the commit message.
>
> So basically, the prefetch_ios output argument is incremented whenever
> ext4_mb_prefetch() reads a block group with !buffer_uptodate(bh).
> However, for BLOCK_UNINIT BGs the buffer is marked uptodate after
> initialization and hence prefetch_ios is not incremented when such BGs
> are prefetched.
>
> This leads to nr becoming 0 due to the following line (removed in this patch):
>
> 				if (prefetch_ios == curr_ios)
> 					nr = 0;
>
> hence ext4_mb_prefetch_fini() would never pre initialise the corresponding
> buddy structures. Instead, these structures would then get initialized
> probably at a later point during the slower allocation criterias. The
> motivation of making sure the BLOCK_UNINIT BGs' buddies are pre
> initialized is so the faster allocation criterias can utilize the data
> to make better decisions.

Got it, appreciate for the detailed explanation!

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2023-06-28  1:39 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-30 12:33 [PATCH v2 00/12] multiblock allocator improvements Ojaswin Mujoo
2023-05-30 12:33 ` [PATCH v2 01/12] Revert "ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits" Ojaswin Mujoo
2023-05-30 16:28   ` Sedat Dilek
2023-05-31  8:57     ` Ojaswin Mujoo
2023-06-02 13:45       ` Thorsten Leemhuis
2023-06-02 16:45         ` Theodore Ts'o
2023-05-30 12:33 ` [PATCH v2 02/12] ext4: mballoc: Remove useless setting of ac_criteria Ojaswin Mujoo
2023-05-30 12:33 ` [PATCH v2 03/12] ext4: Remove unused extern variables declaration Ojaswin Mujoo
2023-05-30 12:33 ` [PATCH v2 04/12] ext4: Convert mballoc cr (criteria) to enum Ojaswin Mujoo
2023-06-06 13:13   ` Jan Kara
2023-05-30 12:33 ` [PATCH v2 05/12] ext4: Add per CR extent scanned counter Ojaswin Mujoo
2023-05-30 12:33 ` [PATCH v2 06/12] ext4: Add counter to track successful allocation of goal length Ojaswin Mujoo
2023-05-30 12:33 ` [PATCH v2 07/12] ext4: Avoid scanning smaller extents in BG during CR1 Ojaswin Mujoo
2023-05-30 12:33 ` [PATCH v2 08/12] ext4: Don't skip prefetching BLOCK_UNINIT groups Ojaswin Mujoo
2023-05-30 12:33 ` [PATCH v2 09/12] ext4: Ensure ext4_mb_prefetch_fini() is called for all prefetched BGs Ojaswin Mujoo
2023-06-06 14:00   ` Guoqing Jiang
2023-06-27  6:51     ` Ojaswin Mujoo
2023-06-28  1:33       ` Guoqing Jiang
2023-05-30 12:33 ` [PATCH v2 10/12] ext4: Abstract out logic to search average fragment list Ojaswin Mujoo
2023-05-30 12:33 ` [PATCH v2 11/12] ext4: Add allocation criteria 1.5 (CR1_5) Ojaswin Mujoo
2023-06-07 10:21   ` Jan Kara
2023-06-08 14:45     ` Theodore Ts'o
2023-06-09 10:57       ` Ojaswin Mujoo
2023-05-30 12:33 ` [PATCH v2 12/12] ext4: Give symbolic names to mballoc criterias Ojaswin Mujoo
2023-06-07 10:39   ` Jan Kara
2023-06-09  3:14 ` [PATCH v2 00/12] multiblock allocator improvements Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).